Did the makers of Devin AI lie about their capabilities?
Their glamorous intro videos have a few holes that people are missing
Executive Highlights
The release of Devin by Cognition Labs raised a lot of interest by software engineers, investors, and other tech professionals. Devin’s supposedly amazing capabilities once again have many people panicking about the loss of Software Engineering as a career.
I originally was not planning to cover Devin and it’s hype for two reasons. Firstly, I’ve already done a deep-dive into the LLM Coders and their technical limitations over here. Unless something changes dramatically, I would be mostly repeating myself.
Secondly, I don’t have particularly profound insight into software engineering workflows. Creators like Rahul Pandey, Ryan Peterman, Luca Rossi, and, Logan Thorneloe have all done some great work on the importance of communication and stakeholder management for software engineering, and how AI tools like Devin do nothing there. Compared to them, I know very little and would add almost nothing to the conversation (no doubt this is shocking to many of you who regard me as an Omniscient Deity to whom you should sacrifice your firstborn child. I am in fact, human).
I think this misconception comes from the people thinking software engineers are paid well to code. We are paid well to solve difficult problems. This is why we call ourselves engineers even though we aren’t accredited in any way. Engineers solve problems using the tools at their disposal.
In a largely digital world, that tool is often code and that’s why software engineers exist. We use code to solve problems. It’s as simple as that. We’re paid well because that skill is in demand and highly valuable to a lot of companies. Companies need people who can identify how to address user needs and design the systems to do so. This becomes an increasingly difficult problem to solve when factoring in integration with other systems, scalability, reliability, local laws and regulations, etc. The software engineering skill that’s valuable is solving the problems — not writing the code.
-I cite this article a lot, but Logan really created greatness here.
However, a recent video by the YouTuber Internet of Bugs changed my mind. In it, Karl walks through some borderline deceptive practices used by Cognition Labs in their infamous demo video of Devin performing a request on Upwork. This prompted me to look deeper into some of the communications around Devin, and I spotted similar techniques that were being used to generate hype for Devin. The demos for Devin rely on extremely cherry-picked examples, omit key limitations, and employ other tactics that are commonly used to build hype.
In this article, we will be studying these tactics. While I have nothing against Devin or Cognition Labs (and some of the demos were very impressive), I deeply dislike the culture of clickbait, hero worship and hype that has taken over tech these days. In my opinion, this leads to several issues:
- Hype-based environments hide the real issues with a particular technology or solution. One need look no further than Crypto for a recently devasting example. Many people who got into Crypto did so without understanding the limitations (or how regulations can be a good thing). We’ve something similar in AI: with Lawyers blindly trusting ChatGPT in court cases. And there are always the environmental impacts that get overlooked-
In 2016, the average business saved and stored 347.56 terabytes of data, according to research from HubSpot. Keeping that amount of data stored would generate nearly 700 tons of carbon dioxide each year.
-Source. A blind reliance of Gen AI will only shoot up this number.
- Hype diverts attention from alternatives, which can stifle progress as everyone rushes to adopt the same approach/method. Academia has been suffering from this for a while, with a flood of derivative papers that sacrifice innovation to play it safe in order to get accepted to a journal. In the worst cases, hype can remove resources from unsexy solutions that work- worsening the problem they set out to solve.
- Hype occasionally leads to upper management sanctioning projects that adversely impact their employees' careers. JP Morgan folk found out the hard way, when JPM released WADU- an AI Surveillance system that was meant to track employee productivity. Spoiler Alert: the system created a lot of problems including worsening mental health and false flagging. Instead of scrapping it, management has doubled down on it.
- Preys on the people who are most vulnerable/unaware about them. The 2008 crisis hit the financially illiterate who bought into the story that real estate never goes down (many people who pushed this agenda walked out rich). Laypeople lost money when they invested in the SPACs that charismatic salesmen like Chamath pushed. Even now, companies IPO and go public to capitalize on hype, even when they’re not profitable.
Once again, it’s the laypeople w/o the ability to see past the hype that lose out in this case.
Even if Devin isn’t meant to be a scam, this hype-based approach to pushing their products perpetuates these same problems. This creates an environment that makes it easier for scammers to prey on people (FOMO is a bitch). Through this article, I hope to make you more aware of some of the tactics used to create this hype and guard against it better.
I lost 2K Dollars in online scams
-A message from a reader I got after my work on the Business of AI Hype
With that out of the way, let’s look through some of Cognition’s videos to see the techniques used to inaccurately portray (lionize) Devin and its capabilities.
PS: If you are associated with Cognition and want to reach out to clarify/dispute my article, you are more than welcome to do so. My contact links are at the end of the article.
Devin and Upwork
Starting with the main attraction, let’s analyze the infamous video of Devin solving a task on Upwork to help Cognition cover the costs of development. It’s very impressive at face value, but upon digging into it, we start to see some interesting trends.
Take a look at the Screenshot of the task description. You can clearly see that “road damage” was explicitly searched for. Looks like this task was cherry-picked to put Devin in the best light. Not terrible, this is a fairly common tactic. But to me this is already a little strange b/c Devin is meant to be a software engineer. The bare minimum you expect from a software engineer (or any professional) is the ability to pick tasks that they are capable of solving.
Next, the video directly jumps to the person prompting Devin with the task (copying the first two lines). This skips the entire client communication part that an SWE would have to participate in, but we’re not going to dwell on it. Devin does something (we will cover this in a sec), and we finally have the finished output. Look at it.
Now look back at what was defined as the deliverable. Look back at this output. Look real close, b/c this is fairly common in a lot of demo/hype pieces on the internet.
Unless all that CTE from MMA is catching up to me, Devin did not even meet the requirement. It was expected to deliver instructions on how to set up something on AWS. Instead, it ran it locally and diagnosed the results. I’ve rewatched the whole video a few times, but this change in what was shown and what was delivered isn’t acknowledged. This is a bait and switch. We’re lured in with the promise of Devin doing something and it does something else entirely.
Either nobody in Cognition realized this very glaring hole, or they knew what they were doing. In either case, I’d be more skeptical of any big claims. Just to compare, I gave a screenshot of the task to Gemini, and it did a much better job at doing what it was meant to do. For something billed as a software engineer, it’s not very good at its job. But here’s where things get even more fun. Let’s look at what Devin does (a git clone) and more interestingly, how it does what it does.
When setting up, Devin runs into an error. This is where we see one of the coolest parts about this Devin demo- the iterative run-debug cycle that it used to clean house on error messages. It’s able to constantly run print statements to diagnose bugs, and write code to fix them. However, the nature of the bugs in this case have a huge asterix.
Here is Devin fixing a broken file in the task-
Notice the file name. Now head on over to the repo. Try to look for that file over there. Refresh your page, and try again. Restart your computer. Buy a new one. No matter what you do, this file will not show up.
Turns out Devin has cracked the secret to debugging that is only accessible to a few wise beings: you can fix 1000s of bugs if you’re the one who causes them. Now that is the kind of AGI I can get behind.
What’s funny about this whole thing is that Devin allegedly can read documentation. If that were the case, cloning and local setup would be breezy, b/c the repo is pretty clear. There would be no need to create your own files in this manner.
Let’s reiterate: Devin creates its own bugs. It fixes those bugs. Cognition completely omits this in their demo, making it seem like Devin was fixing issues with the system. The influencers/media hyping up Devin never found out about this (or didn’t care to report it). One of Devin’s most impressive ‘qualities’ turns out to be a lie (or at least a strong misdirection).
Karl’s video (I strongly recommend a watch, he does an excellent job) provides an excellent counterpoint. He sets up the repo successfully but never runs into errors since he doesn’t invent new files with errors. He also mentions that Devin’s file reading code shown in the demo is done badly. The approach is outdated and was most likely caused by a strong prevalence of old C code forcing an outdated style.
Finally, let’s take a look at how long it takes to complete this.
This is the start time
and this is the report generation
Not exactly the performance of champions.
I get it, AI is hard. The same creativity we laud in AlphaGo and LLMs became the Hallucinations that everyone is out to remove. There’s a fine line between memorization and training data leakage. Stochasticity makes no sense. We’re essentially trying to build skyscrapers on shifting sands. If the people behind Devin sold it as a coding assistant or were more honest in their communications about what Devin was actually doing and its limitations- I wouldn’t care. But these omissions feel deliberate, which crosses a line for me. It starts to feel a deliberate attempt to shock and awe the viewer, so that they don’t ask too many questions and take things at face value.
Let’s look at some of their other videos to see if this is a pattern. Don’t worry, we’ll do these relatively quickly since the other demos are much shorter.
Can Devin Fix Bugs that the Developer couldn’t catch
Next, let’s look at the video “AI finds and fixes a bug that I didn’t catch!” The developer in the demo, has a GitHub repo where he shares code for algorithms for competitive programming. However, one of his algorithms has a bug and Devin can fix it.
Right off the bat: we see cherry-picking again. Competitive Programming plays towards AI-based coders and their strengths (clear problem, expected inputs + outputs, and prior training data). Not saying it’s easy, but it’s not stunning the way it once was(try to run the problems in Gemini/GPT). But this is a demo, so maybe that isn’t the slam-dunk I’m pretending it is.
Next, take a look at the prompts given to the bot. Andrew (demo guy) uses Devin to write tests for the code after he has debugged. This is a different task than using Devin to find a bug unprompted.
Notably, the prompts themselves are fairly detailed. For a lowly code assistant, this is great practice. But for the mighty ‘fully autonomous AI software engineer’ this feels a little too hand-holdy. But maybe that’s just me.
Devin runs into an error with the test cases, and it’s able to both source and resolve the errors on its own by adding some code. That’s amazing, no questions. However, this ‘spectacular’ demonstration of Devin’s ability to work autonomously really just showcases its ability to build on top of the foundations that Andrew had laid out. At the risk of redundancy: this is not so much an autonomous engineer as it is a very powerful coding assistant. The main innovation seems to be a run-debug loop that can work on its own iteratively. While this is cool, similar mechanisms also work extremely well with ChatGPT (anecdotally: this is one area where even free ChatGPT is far ahead of Gemini Advanced- mainly b/c of the weird alignment with Gemini that makes it not complete the code it’s generating).
The time taken here seems to be about one hour. Again, quite slow. And a facet that was not mentioned in the demo.
I also went through Our AI software engineer fixes a bug in Python algebra system. It’s an amazing demo and shows how useful such systems could be for housekeeping. It solves a bug that saves the engineer some time, but the problem it solves is once again very well-defined and with a straightforward solution. It doesn’t demonstrate the ability to deal with ambiguity, show the ability to make decisions about architectural tradeoffs, or highlight any real client communication skills.
At the risk of beating a dead horse, the pattern seems pretty clear: Cognition overpromises with Devin, refuses to touch upon critical limitations of the systems, and relies on demos that feel very bait and switch-y: the product that we are sold isn’t fully congruent with the product we see. Or maybe it’s just me.
I will end by stressing something crucial: just b/c Devin has limitations, doesn’t mean it’s not useful. The goal here isn’t to tell you not to use it, but rather to highlight how one-sided media communications can be. Once you start to recognize techniques like cherry-picking, bait and switches, or strat to look for omissions in PR campaigns: it becomes easier to guard against the negative effects of hype. The goal is always to be more critical of the information shared online (especially in unproven, hype-heavy fields) to make better decisions.
If you liked this article and wish to share it, please refer to the following guidelines.
That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.
I put a lot of effort into creating work that is informative, useful, and independent from undue influence. If you’d like to support my writing, please consider becoming a paid subscriber to this newsletter. Doing so helps me put more effort into writing/research, reach more people, and supports my crippling chocolate milk addiction. Help me democratize the most important ideas in AI Research and Engineering to over 100K readers weekly.
PS- We follow a “pay what you can” model, which allows you to support within your means. Check out this post for more details and to find a plan that works for you.
I regularly share mini-updates on what I read on the Microblogging sites X(https://twitter.com/Machine01776819), Threads(https://www.threads.net/@iseethings404), and TikTok(https://www.tiktok.com/@devansh_ai_made_simple)- so follow me there if you’re interested in keeping up with my learnings.
Reach out to me
Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.
Small Snippets about Tech, AI and Machine Learning over here
AI Newsletter- https://artificialintelligencemadesimple.substack.com/
My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819