OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems
99 comments
·February 24, 2025rurp
deegles
I'm not convinced LLMs will evolve into general AI. The promises that it's just around the corner feels increasingly like a big scam.
kristopolous
I was never on board with it. It feels like the same step change google was - there was a time when it was just miles ahead of everything else out there around 1998. The first time you used it, it was like "geez, you got it right, didn't know that was possible". It's big, changed things, but wasn't an end of history event a bunch of people are utterly convinced this is.
petesergeant
Depends what you mean by evolve. I don't think we'll get general AI by scaling, but I think general AI, if it arrives, will be able to trace its lineage very much back to LLMs. Journeys through the embedding space very much feel like the way forward to me, and that's what LLMs are.
brookst
I mean it’s been a couple of years!
It may or may not happen but “scam” means intentional deceit. I don’t think anyone actually knows where LLMs are going with enough certainty to use that pejorative.
krupan
Is it intentional deceit to tell everyone it's leading to something when, as you correctly point out, nobody actually knows if it will?
sesteel
It has made me stop using Google and StackOverflow. I can look most things up quickly, not rubber duck with other people, and thus I am more efficient. It also it is good at spotting bugs in a function if the APIs are known and the APIs version is something it was trained on. If I need to understand what something is doing, it can help annotate the lines.
I use it to improve my code, but I still cannot get it to do anything that is moderately complex. The paper tracks with what I've experienced.
I do think it will continue to rapidly evolve, but it probably is more of a cognitive aid than a replacement. I try to only use it when I am tight on time. or need a crutch to help me keep going.
jameslk
I had to do something similar with BigQuery and some open source datasets recently.
I had bad results with Claude as you mentioned. It kept hallucinating parts of the docs for the open datasets, coming up with nonsense columns. Not fixing errors when presented the error text and more context. I had a similar outcome with 4o.
But I tried the same with o1 and it was much better consistently, with full generations of queries and alterations. I fed it in some parts of docs anytime it struggled and it figured it out.
Ultimately I was able to achieve what I was trying to do with o1. I’m guessing the reasoning helped, especially when I confronted it about hallucinations and provided bits of the docs.
Maybe the model and the lack of CoT could be part of the challenge you ran into?
whilenot-dev
> and provided bits of the docs.
At this point I'd ask myself whether I want my original problem solved or if I just want the LLM to succeed with my requested task.
jameslk
Yes, I imagine some do like to read and then ponder over the BigQuery docs. I like to get my work done. In my case, o1 nailed BigQuery flawlessly, saving me time. I just needed to feed in some parts of the open source dataset docs
jchw
I've had a pretty similar outlook and still kind of do, but I think I do understand the hype a little bit: I've found that Claude and Gemini 2 Pro (experimental) sometimes are able to do things that I genuinely don't expect them to be able to do. Of course, that was the case before to a lesser extent already, and I already know that that alone doesn't translate to useful necessarily.
So, I have been trying Gemini 2 Pro, mainly because I have free access to it for now, and I think it strikes a bit above being interesting and into the territory of being useful. It has the same failure mode issues that LLMs have always had, but honestly it has managed to generate code and answer questions that Google definitely was not helping with. When not dealing with hallucinations/knowledge gaps, the resulting code was shockingly decent, and it could generate hundreds of lines of code without an obvious error or bug at times, depending on what you asked. The main issues were occasionally missing an important detail or overly complicating some aspect. I found the quality of unit tests generated to be sub par, as it often made unit tests that strongly overlapped with each other and didn't necessarily add value (and rarely worked out-of-the-box anyways, come to think of it.)
When trying to use it for real-world tasks where I actually don't know the answers, I've had mixed results. On a couple occasions it helped me get to the right place when Google searches were going absolutely nowhere, so the value proposition is clearly somewhere. It was good at generating decent mundane code, bash scripts, CMake code, Bazel, etc. which to me looked decently written, though I am not confident enough to actually use its output yet. Once it suggested a non-existent linker flag to solve an issue, but surprisingly it actually did inadvertently suggest a solution to my problem that actually did work at the same time (it's a weird rabbit hole, but compiling with -D_GNU_SOURCE fixed an obscure linker error with a very old and non-standard build environment, helping me get my DeaDBeeF plugin building with their upstream apbuild-based system.)
But unfortunately, hallucination remains an issue, and the current workflow (even with Cursor) leaves a lot to be desired. I'd like to see systems that can dynamically grab context and use web searches, try compiling or running tests, and maybe even have other LLMs "review" the work and try to get to a better state. I'm sure all of that exists, but I'm not really a huge LLM person so I haven't kept up with it. Personally, with the state frontier models are in, though, I'd like to try this sort of system if it does exist. I'd just like to see what the state of the art is capable of.
Even that aside, though, I can see this being useful especially since Google Search is increasingly unusable.
I do worry, though. If these technologies get better, it's probably going to make a lot of engineers struggle to develop deep problem-solving skills, since you will need them a lot less to get started. Learning to RTFM, dig into code and generally do research is valuable stuff. Having a bot you can use as an infinite lazyweb may not be the greatest thing.
aprilthird2021
I do something like this every day at work lol. It's a good base to start with, but often you'll eventually have to Google or look at the docs to see what it's messing up
hackit2
Makes perfect sense why it couldn't answer your question, you didn't have the vocabulary of relational algebra to correctly prime the model. Any rudimentary field have their own corpus vocabulary to express ideas and concepts specific to that domain.
krupan
I honestly can't tell if this is a sarcastic reply or not
jasonthorsness
Half of the work is specification and iteration. I think there’s a focus on full SWE replacement because it’s sensational, but we’ll more end up with SWE able to focus on the less patterned or ambiguous work and made way more productive with the LLM handling subtasks more efficiently. I don’t see how full SWE replacement can happen unless non-SWE people using LLMs become technical enough to get what they need out of them, in which case they probably have just become SWE anyway.
tkgally
> unless non-SWE people using LLMs become technical enough to get what they need out of them
Non-SWE person here. In the past year I've been able to use LLMs to do several tasks for which I previously would have paid a freelancer on Fiverr.
The most complex one, done last spring, involved writing a Python program that I ran on Google Colab to grab the OCR transcriptions of dozens of 19th-century books off the Internet Archive, send the transcriptions to Gemini 1.5, and collect Gemini's five-paragraph summary of each book.
If I had posted the job to Fiverr, I would have been willing to pay several hundred dollars for it. Instead, I was able to do it all myself with no knowledge of Python or previous experience with Google Colab. All it cost was my subscription to ChatGPT Plus (which I would have had anyway) and a few dollars of API usage.
I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.
jameslk
> I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.
I think this is the nuance most miss when they think about how AI models will displace work.
Most seem to think “if it can’t fully replace a SWE then it’s not going to happen”
When in reality, it starts by lowering the threshold for someone who’s technical but not a SWE, to jump in and do the work themselves. Or it makes the job of an existing engineer more efficient. Each hour less work needed spread across many tasks that would have otherwise gone to an engineer eventually sum up to a full time worth of an engineer. If it’s a Fiverr dev you eliminated the work of, that means the Fiverr dev will eventually go after the work that’s remaining, putting supply pressure on other devs
It’s the same mistake many had about self driving cars not happening because they couldn’t handle every road. No, they just need to start with 1 road, master that, and then keep expanding to more roads. Until they can do all of SF, and then more and more cities
ripped_britches
This is a good anecdote but most software engineering is not scripting. It’s getting waist (or neck) deep in a large codebase and many intricacies.
That being said I’m very bullish on AI being able to handle more and more of this very soon. Cursor definitely does a great job giving us a taste of cross codebase understanding.
koito17
Seconded. Zed makes it trivial to provide entire codebases as context to Claude 3.5 Sonnet. That particular model has felt as good as a junior developer when given small, focused tasks. A year ago, I wouldn’t have imagined that my current use of LLMs was even possible.
sanxiyn
Everyone is a typist now, so I don't think it is farfetched that everyone is a SWE in the future.
riffraff
Very few people are typist.
Most people can use a keyboard, but the majority of non-technical people type at a speed which is orders of magnitude less than a professional typist.
Another comment here mentions how they used colab while not being a SWE, but that is already miles ahead of what average people do with computers.
There's people who have used computers for decades and wouldn't be able to do a sum in a spreadsheet, nor know that is something spreadsheets can do.
jimbob45
What’s the WPM cutoff to be considered a typist?
zitterbewegung
If the llm can’t find me a solution in 3 to 5 tries while I improve the prompt I fall back to mire traditional methods and or use another model like Gemini.
petesergeant
> in which case they probably have just become SWE anyway
or learn to use something like Bubble
jr-ai-interview
This has been obvious for a couple years to anyone in the industry that has been faced with an onslaught of PRs to review from AI enabled coders who sometimes can't even explain the changes being made at all. Great job calling it AI.
pton_xd
Well, OpenAI does currently have 288 job openings, including plenty of software engineers, so that says something.
pzo
> The models weren't allowed to access the internet
How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
I think then it's not the best comparison to make any judgement. Future benchmark should test agents where they allowed to solve the problem in 5-10 minutes, allow give access to internet, documentation, linter, terminal with MCP servers.
thrw011
> How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
Many, there was a time when SO did not exist and people were able to solve non trivial problems. There was a time coding problems on exams had to be solved on paper and if they were not compiling you would not pass.
pzo
you miss my point about zero short style where you have only one shot to compile and execute you code. Even in old times when people programmed using punched cards it required a lot of reviews and iterations. This is the reason why scripting languages like python, ruby, php, javascript got popular because you had very fast feedback loop and do dozens of mini experiments. Majority of coding problems we have today are not algorithmic in nature.
thrw011
I had one shot at my exams, was writing them on paper, compiling code in my brain.
ipython
What would searching the Internet provide the models that they don’t already have? Most likely data sources such as stack overflow, documentation on the language it’s targeting, and a variety of relevant forum posts are already part of its training set.
Unless someone else came along and said “here’s how to solve x problem step by step”, I don’t see how additional information past its cutoff point would help. (Perhaps the AI could post on a forum and wait for an answer?)
Yes, iterative programming could help via access to tools- I can see that helping.
brookst
Why do programmers search for specific questions rather than always relying on their inherent knowledge?
I’m a crappy hobbyist programmer but for me it is useful to see if someone has implemented exactly what I need, or debugged the problem I’m having. I don’t think it’s reasonable to expect programmers or LLMs to know everything about every library’s use in every context just from first principles.
ipaddr
I do it to save the limited brain power I have before rest or food is required. You could spend 5 minutes writing a sort (at a high level processing) or just use existing code which might take 5 minutes to find but uses less brain power.
This allows you to use that brain power on specific things that need you and let google remember the format of that specific command or let an ai write out your routing file.
The older I get the less I'm bound by time, lack of knowledge or scope but more limited by clarity. Delegate tasks where possible and keep the clarity for the overall project and your position.
ipython
But why would that information not be included in the wide crawl already encoded in the model weights before the knowledge cutoff? I believe the article mentions frontier models so we are talking about models trained on trillions of tokens here
ipaddr
You sound like someone who never used punch cards.
I think most developers could do that if they trained. As someone who learned how to program before the internet, its just a different mindset and would take some time to adjust.
I am doing that now where changes take a day to make it to staging and no local environment. You roll with it.
CPLX
> You sound like someone who never used punch cards.
I hope HN never changes.
rurp
It depends a lot on the type of problem. If we're talking about fixing a bug or adding a new feature to a large existing code base, which probably describes a huge portion or professional software engineering work, I would say most engineers could do most of those tasks without the internet. Especially if the goal is to simply pass a benchmark test of getting it working without future considerations.
a2128
I think about this a lot. AI in the current state is like working with an intern who is on a stranded island with no internet access or compiler, they have to write down all of the code in forward sequence on piece of paper, god help them if they have to write any UI while also being blind. None of the "build an app with AI start-to-finish" products work well at all because of this.
sky2224
AI models are trained on the data from the internet, so sure, they couldn't do their search feature to scour the internet, but I doubt the material is much different than what the models were already trained on.
Additionally, before the age of stackoverflow and google, SWEs cracked open the book or documentation for whatever technology they were using.
JohnKemeny
As one who organises competitive programming contests on a regular basis for university students, I would say almost every single one.
null
achierius
Isn't this how interviews tend to work? So I think a good number of devs would, yes.
pzo
Interviews like leetcode on whiteboard only testing your reasoning not if your solution will execute out of the box in zero shot style. Humans solve problem in iterative way that's why fast feedback loop and access to tools is essential. When you start coding compiler or linter hints you that you forgot to close some braces or miss semicolon. Compiler tips you that API in new version changed, intellisense hints you what methods you can use in current context and what parameters you can use and their types. Once you execute program you get runtimes tips that maybe you missed installing some node or python package. When you installing packages you get hints that maybe one package has additional dependency and 2 package version are not compatible. Command line tools like `ls` tells you what's project structure etc.
_def
> even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year.
"low/high level" starts to lose its meaning to me because it gets used in opposite ways
mrayycombi
Where are the low level CEOs vs high level CEOs?
I'll bet AI could do their jobs right now.
Can SOMEONE please write AI software to replace these people?
richardw
Management consultants. Main attributes are confidence and ability to generate content. No need to stick around to see it through.
ITB
Are you bothered by the fact that software engineers might be easier to automate?
rpmisms
Considering that there are chickens who outperform stockbrokers, no.
throwaway290
It's the opposite. An LLM is better at CEO stuff than working code. A good developer + LLM instead of CEO can succeed. A good CEO + LLM instead of developer cannot succeed. (For a tech company)
anonymoushn
Even better, if you click through to the linked source he doesn't say "low-level" at all, or make any claim that is at all like the claim he is cited as making!
realitysballs
Yeah low-level language conflated with low-level coders , means the opposite in some sense
mohsen1
I wonder how many of the solutions that passes SWE-lancer evals would not be accepted by the poster due to low quality
I’ve been trying so many things to automate solving bugs and adding features 100% by AI and I have to admit it’s been a failure. Without someone that can read the code and fully understand the AI generated code and suggests improvements (SWE in the loop) AI code is mostly not good.
blindriver
They should feed it bootcamp study materials and Cracking the Coding Interview book in order to improve its ability to code.
Ozzie_osman
If it can master Binary Search Trees, it can master anything.
blindriver
"If you need to improve speed, add Hash Tables."
simonw
I find the framing of this story quite frustrating.
The purpose of new benchmarks is to gather tasks that today's LLMs can't solve comprehensively.
It an AI lab built a benchmark that their models scored 100% on they would have been wasting everyone's time!
Writing a story that effectively says "ha ha ha, look at OpenAI's models failing to beat the new benchemark they created!" is a complete misunderstanding of the research.
WiSaGaN
So this is an in-house benchmarks after their undisclosed partnership with a previous benchmark company. Really hope they do not have their next model to vastly outperform on this benchmark in the coming weeks.
spartanatreyu
Link to the original paper: https://arxiv.org/pdf/2502.12115
TL;DR:
They tested with programming tasks and manager's tasks.
The vast majority of tasks given require bugfixes.
Claude 3.5 Sonnet (the best performing LLM) passed 21.1% of programmer tasks and 47.0% of manager tasks.
The LLMs have a higher probability of passing the tests when they are given more attempts, but there's not a lot of data showing where the improvement tails off. (probably due to how expensive it is to run the tests)
Personally, I have other concerns:
- A human being asked to review repeated LLM attempts to resolve a problem is going to lead that human to review things less thoroughly after a few attempts and over time is going to let false positives slip through
- An LLM being asked to review repeated LLM attempts to resolve a problem is going to lead to the LLM convincing itself that it is correct with no regard for the reality of the situation.
- LLM use increases code churn in a code base
- Increased code churn is known to be bad the health of projects
anandnair
Coding, especially the type mentioned in the article (building an app based on a specification)—is a highly complex task. It cannot be completed with a single prompt and an immediate, flawless result.
This is why even most software projects (built by humans) go through multiple iterations before they work perfectly.
We should consider a few things before asking, "Can AI code like humans?":
- How did AI learn to code? What structured curriculum was used?
- Did AI receive mentoring from an experienced senior who has solved real-life issues that the AI hasn't encountered yet?
- Did the AI learn through hands-on coding or just by reading Stack Overflow?
If we want to model AI as being on par with (or even superior to) human intelligence, don’t we at least need to consider how humans learn these complex skills?
Right now, it's akin to giving a human thousands of coding books to "read" and "understand," but offering no opportunity to test their programs on a computer. That’s essentially what's happening!
Without doing that, I don't think we'll ever be able to determine whether the limitation of current AI is due to its "low intelligence" or because it hasn’t been given a proper opportunity to learn.
DarkmSparks
>How did AI learn to code?
It didn't, it's just very good at copying already existing code and tweeking it a bit.
>Did AI receive mentoring from an experienced senior
It doesnt even comprehend what an experienced senior is, all it cares about is how frequently certain patterns occurred in certain circumstances.
>Did the AI learn through hands-on coding or just by reading Stack Overflow?
it "learnt" by collecting a large database of existing code, most of which is very low quality open source proofs of concept, then spits out the bits that are probably related to a question.
tsimionescu
LLMs can fundamentally only do something similar to learning in the training phase. So by the time you interact with it, it has learned all it can. The question we then care about is whether it has learned enough to be useful for problem X. There's no meaningful concept of "how intelligent" the system is beyond what it has learned, no abstract IQ test decoupled from base knowledge you could even conceive of.
I recently had to do a one-off task using SQL in a way that I wasn't too familiar with. Since I could explain conceptually what I needed but didn't know all the right syntax this seemed like a perfect use case to loop in Claude.
The first couple back and forths went ok but it quickly gave me some SQL that was invalid. I sent back the exact error and line number and it responded by changing all of the aliases but repeated the same logical error. I tried again and this time it rewrote more of the code, but still used the exact same invalid operation.
At that point I just went ahead and read some docs and other resources and solved things the traditional way.
Given all of the hype around LLMs I'm honestly surprised to see top models still failing in such basic and straightforward ways. I keep trying to use LLMs in my regular work so that I'm not missing out on something potentially great but I still haven't hit a point where they're all that useful.