OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems

173 comments

·February 24, 2025

rurp

I recently had to do a one-off task using SQL in a way that I wasn't too familiar with. Since I could explain conceptually what I needed but didn't know all the right syntax this seemed like a perfect use case to loop in Claude.

The first couple back and forths went ok but it quickly gave me some SQL that was invalid. I sent back the exact error and line number and it responded by changing all of the aliases but repeated the same logical error. I tried again and this time it rewrote more of the code, but still used the exact same invalid operation.

At that point I just went ahead and read some docs and other resources and solved things the traditional way.

Given all of the hype around LLMs I'm honestly surprised to see top models still failing in such basic and straightforward ways. I keep trying to use LLMs in my regular work so that I'm not missing out on something potentially great but I still haven't hit a point where they're all that useful.

deegles

I'm not convinced LLMs will evolve into general AI. The promises that it's just around the corner feels increasingly like a big scam.

kristopolous

I was never on board with it. It feels like the same step change google was - there was a time when it was just miles ahead of everything else out there around 1998. The first time you used it, it was like "geez, you got it right, didn't know that was possible". It's big, changed things, but wasn't an end of history event a bunch of people are utterly convinced this is.

blitzar

We just need a little more of your (not mine) money to get there. Would I lie to you for $100 billion?

petesergeant

Depends what you mean by evolve. I don't think we'll get general AI by simply scaling LLMs, but I think general AI, if it arrives, will be able to trace its lineage very much back to LLMs. Journeys through the embedding space very much feel like the way forward to me, and that's what LLMs are.

giardini

Embedding spaces are one thing, LLMs are quite another.

I believe the former are understandable and likely a part of true AGI but the latter a series of hacks, at worst a red herring leading us off the proper track into a deadend.

boppo1

Is there a resource I can use to understand the difference between embedding space and latent space?

brookst

I mean it’s been a couple of years!

It may or may not happen but “scam” means intentional deceit. I don’t think anyone actually knows where LLMs are going with enough certainty to use that pejorative.

krupan

Is it intentional deceit to tell everyone it's leading to something when, as you correctly point out, nobody actually knows if it will?

johnnyanmac

>“scam” means intentional deceit.

Yes. I'm pretty sure any engineer working on this knows it's not "a few years away". But it doesn't stop product teams from taking adcvantadge of the hype cycle. Hence, "use deception to deprive (someone) of money or possessions.".

Yizahi

LLMs coding performance is directly proportional to amount of stolen data for the learning process. That's why frontend folks are swearing by it and are forecasting our new god dominance in just a few years. That's because frontend code is literally out there mostly, just take it and compile into a dataset. Stuff like SQL DBs is not laying on every internet corner and is probably very underrepresented in the dataset, producing inferior performance. Same with rare or systems languages, like Rust for example, LLMs are also very bad with it.

sesteel

It has made me stop using Google and StackOverflow. I can look most things up quickly, not rubber duck with other people, and thus I am more efficient. It also it is good at spotting bugs in a function if the APIs are known and the APIs version is something it was trained on. If I need to understand what something is doing, it can help annotate the lines.

I use it to improve my code, but I still cannot get it to do anything that is moderately complex. The paper tracks with what I've experienced.

I do think it will continue to rapidly evolve, but it probably is more of a cognitive aid than a replacement. I try to only use it when I am tight on time. or need a crutch to help me keep going.

tasuki

This happens in about one third of my coding interactions with LLMs. I've been trying to get better at handling the situation. At some point it's clear you've explained the problem well enough and the LLM actually is regurgitating the same wrong answer, unable to make progress. It would be useful to spot this asap.

I enjoy working with very strongly typed languages (Elm, Haskell), and it's hard for me to avoid "just paste the compile error to the LLM it only takes a second" trap. At some point (usually around three back-and-forths), if the LLM can't fix the error, it will just generate increasingly different compile errors. It's a matter of choosing which one I decide to actually dive into (this is more of a problem with Haskell than Elm, as Elm compile errors are second to none).

cyberpunk

Honest question -- not trying to be offensive, but what are you using elm for? Everywhere I've encountered it it's some legacy system that no one has cared to migrate yet and it's a complete dumpster fire.

You spend about three days trying to get it to build then say fuck it and rewrite it.

At least, that's the story of the last (and only) three times I've seen elm code in the wild.

tasuki

I'm not really a frontend developer. I'm using Elm for toy projects, in fact I did one recently.[0] Elm is my favourite language!

> You spend about three days trying to get it to build then say fuck it and rewrite it.

What are the problems you encounter? I can't quite imagine in what way an Elm project could be hard to build! (Also not trying to be offensive, but I almost don't believe you!)

And into which language do you rewrite those "dumpster fire" Elm codebases?

[0] https://github.com/tasuki/iso-maze

halis

While I do find llm’s useful, it’s mostly for simple and repetitive tasks.

In my opinion, they aren’t actually coding anything and have no amount of understanding. They are simply advanced at searching things and pasting back an answer that they scraped online. They can also run simple transformations on those snippets like rename variables. But if you tell it there’s a problem, it doesn’t try to actually solve the problem. It just traverses to the same branch in the tree and tries to give you another similar solution in the tree or if there’s nothing better it will give you the same solution but maybe run a transformation on it.

So, in short, learn how to code or teach your kids how to code. Because going forward, I think it’s going to be more valuable than ever.

AbstractH24

> So, in short, learn how to code or teach your kids how to code. Because going forward, I think it’s going to be more valuable than ever.

Teach your kids how to be resourceful and curious. Coding is just a means to an end to do that. Agreed though it’s a great one.

jameslk

I had to do something similar with BigQuery and some open source datasets recently.

I had bad results with Claude as you mentioned. It kept hallucinating parts of the docs for the open datasets, coming up with nonsense columns. Not fixing errors when presented the error text and more context. I had a similar outcome with 4o.

But I tried the same with o1 and it was much better consistently, with full generations of queries and alterations. I fed it in some parts of docs anytime it struggled and it figured it out.

Ultimately I was able to achieve what I was trying to do with o1. I’m guessing the reasoning helped, especially when I confronted it about hallucinations and provided bits of the docs.

Maybe the model and the lack of CoT could be part of the challenge you ran into?

whilenot-dev

> and provided bits of the docs.

At this point I'd ask myself whether I want my original problem solved or if I just want the LLM to succeed with my requested task.

jameslk

Yes, I imagine some do like to read and then ponder over the BigQuery docs. I like to get my work done. In my case, o1 nailed BigQuery flawlessly, saving me time. I just needed to feed in some parts of the open source dataset docs

threatofrain

That depends on how hard it is to auto ingest context.

aprilthird2021

I do something like this every day at work lol. It's a good base to start with, but often you'll eventually have to Google or look at the docs to see what it's messing up

benhurmarcel

For what it’s worth, I recently wrote an SQL file that gave an error. I tried to fix it myself and searched the internet but couldn’t solve it. I pasted it into Claude and it solved the error immediately.

jasonthorsness

Half of the work is specification and iteration. I think there’s a focus on full SWE replacement because it’s sensational, but we’ll more end up with SWE able to focus on the less patterned or ambiguous work and made way more productive with the LLM handling subtasks more efficiently. I don’t see how full SWE replacement can happen unless non-SWE people using LLMs become technical enough to get what they need out of them, in which case they probably have just become SWE anyway.

tkgally

> unless non-SWE people using LLMs become technical enough to get what they need out of them

Non-SWE person here. In the past year I've been able to use LLMs to do several tasks for which I previously would have paid a freelancer on Fiverr.

The most complex one, done last spring, involved writing a Python program that I ran on Google Colab to grab the OCR transcriptions of dozens of 19th-century books off the Internet Archive, send the transcriptions to Gemini 1.5, and collect Gemini's five-paragraph summary of each book.

If I had posted the job to Fiverr, I would have been willing to pay several hundred dollars for it. Instead, I was able to do it all myself with no knowledge of Python or previous experience with Google Colab. All it cost was my subscription to ChatGPT Plus (which I would have had anyway) and a few dollars of API usage.

I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.

jameslk

> I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.

I think this is the nuance most miss when they think about how AI models will displace work.

Most seem to think “if it can’t fully replace a SWE then it’s not going to happen”

When in reality, it starts by lowering the threshold for someone who’s technical but not a SWE, to jump in and do the work themselves. Or it makes the job of an existing engineer more efficient. Each hour less work needed spread across many tasks that would have otherwise gone to an engineer eventually sum up to a full time worth of an engineer. If it’s a Fiverr dev you eliminated the work of, that means the Fiverr dev will eventually go after the work that’s remaining, putting supply pressure on other devs

It’s the same mistake many had about self driving cars not happening because they couldn’t handle every road. No, they just need to start with 1 road, master that, and then keep expanding to more roads. Until they can do all of SF, and then more and more cities

player1234

Entirely possible. Have you got any numbers and real world examples? Growth? Profits? Actual quantified productivity gains?

The nuance your 'gotcha' scenario miss is that displacing fiverr, speeding up small side project, making scripts fo non-SWE, creating boilerplate, etc is not the trillions of dollars disruption that is needed by now.

ripped_britches

This is a good anecdote but most software engineering is not scripting. It’s getting waist (or neck) deep in a large codebase and many intricacies.

That being said I’m very bullish on AI being able to handle more and more of this very soon. Cursor definitely does a great job giving us a taste of cross codebase understanding.

koito17

Seconded. Zed makes it trivial to provide entire codebases as context to Claude 3.5 Sonnet. That particular model has felt as good as a junior developer when given small, focused tasks. A year ago, I wouldn’t have imagined that my current use of LLMs was even possible.

ai-christianson

> This is a good anecdote but most software engineering is not scripting. It’s getting waist (or neck) deep in a large codebase and many intricacies.

The agent I'm working on (RA.Aid) handles this by crawling and researching the codebase before doing any work. I ended up making the first version precisely because I was working on a larger monorepo project with lots of files, backend, api layer, app, etc.

So I think the LLMs can do it, but only if techniques are used to allow it to hone in on the specific information in a codebase that is relevant to a particular change.

dkjaudyeqooe

If the goal is to get something to run correctly roughly once with some known data or input, then that's fine. Actual software development aims to run under 100% of circumstances, and LLMs are essentially cargo culting the development process and entrusting an automation that is unreliable to do mundane tasks. Sadly the quality of software will keep going down, perhaps even faster.

player1234

Stop with the realism, one off scripts is going to give trillions in ROI any day now. Personally could easily chip in maybe a million a month in subsbription fees be cause my bolierplate code I write once in a blue moon has speed up infinitely and I will cash out in profits any day now.

numba888

> I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.

Who would use LLM anyway these days. Interesting when Fiverr will add non-human freelancers. Something similar to algorithmic traders. Passive income.

hiq

IOW LLMs make programming somewhat higher-level similar to what new programming languages in the past, either via code generation from natural language (main use-case right now?), or by interpreting a "program" written in natural language ("sum all the numbers in the 3rd column of this CSV").

The latter case enables more people to program to a certain extent, similar to what spreadsheets did, while we still need full SWEs in the first case, as you pointed out.

sanxiyn

Everyone is a typist now, so I don't think it is farfetched that everyone is a SWE in the future.

riffraff

Very few people are typist.

Most people can use a keyboard, but the majority of non-technical people type at a speed which is orders of magnitude less than a professional typist.

Another comment here mentions how they used colab while not being a SWE, but that is already miles ahead of what average people do with computers.

There's people who have used computers for decades and wouldn't be able to do a sum in a spreadsheet, nor know that is something spreadsheets can do.

jimbob45

What’s the WPM cutoff to be considered a typist?

zitterbewegung

If the llm can’t find me a solution in 3 to 5 tries while I improve the prompt I fall back to mire traditional methods and or use another model like Gemini.

petesergeant

> in which case they probably have just become SWE anyway

or learn to use something like Bubble

vonunov

Yeah, I tried Copilot for the first time the other day and it seemed to be able to handle this approach fairly well -- I had to refine the details, but none of it was because of hallucinations or anything like that. I didn't give it a chance to try to handle the high-level objective, but based on past experience, it would have done something pointlessly overwrought at best.

Also, as an aside, re "not a real programmer" salt: If we suppose, as I've been led to believe, that the "true essence" of programming is the ability to granularize instructions and conceptualize data flow like this, and if LLMs remain unsuitable for coding tasks unless the user can do so, this would seem to undermine the idea that someone can only pretend to be a programmer if they use the LLMs.

Anyway, I used Copilot in VSCode to "Fix" this "code" (it advised me that I should "fix" my "code" by . . . implementing it, and then helpfully provided a complete example):

  # Take a URL from stdin (prompt)  
  # If the URL contains "www.reddit.com", replace this substring with "old.reddit.com"  
  # Curl the URL and extract all links matching /https:\/\/monkeytype\.com\/profile\/[^>]+/ from the html;  
  # put them in a defaultdict as the first values;  
  # for each first value, the key is the username that appears in the nearest previous p.tagline > a.author  
  # For each first value, use Selenium to browse to the monkeytype.com/profile url;  
  # wait until 'div[class=\'pbsTime\'] div:nth-child(3) div:nth-child(1) div:nth-child(2)' is visible AND contains numbers;  
  # assign this value as the second value in the defaultdict  
  # Print the defaultdict as a json object

jr-ai-interview

This has been obvious for a couple years to anyone in the industry that has been faced with an onslaught of PRs to review from AI enabled coders who sometimes can't even explain the changes being made at all. Great job calling it AI.

pton_xd

Well, OpenAI does currently have 288 job openings, including plenty of software engineers, so that says something.

jr-ai-interview

Lol, that place is causing so many worldwide horrors misrepresenting AI I would't give sama an answer to an email at this point.

nostrebored

This mirrors what I've seen. I've found that LLMs are most helpful in places where I have the most experience.

Maybe this is because of explicitness in prompt and preempting edge cases. Maybe it's because I know exactly what should be done. In these cases, I will still sometimes be surprised by a more complete answer then I was envisioning, a few edge cases that weren't front of mind.

But if I have _no_ idea things go wildly off course. I was doing some tricky frontend work with dynamically placed reactflow nodes and bezier curve edges. It took me easily 6 hours of bashing my head against the problem, and it was hard to stop using the assistant because of sunk cost. But I probably would have gotten more out of it and been faster if I'd just sat down and really broken down the problem for a few hours and then moved to implement.

The most tempting part of LLMs is letting them figure out design when you're in a time crunch. And the way it solves things when you understand the domain and the bottoms-up view of the work is deceptive in terms of capability.

And in this case, it's hoping that people on upwork understand their problems deeply. If they did, they probably wouldn't be posting on upwork. That's what they're trying to pay for.

ipython

I just had this conversation with a customer. And it’s hard to avoid anthropomorphizing ai. Once you equate the ai system with a human - a human who creates perfectly pep8 formatted python is probably a decent python programmer, whereas someone who bangs out some barely readable code with mixed spacing and variable naming styles is most likely a novice.

We use these signals to indicate how much we should trust the code - same with written text. Poorly constructed sentences? Gaps or pauses? Maybe that person isn’t as knowledgeable.

These shortcuts fail miserably on a system that generates perfect grammar, so when you bring your stereotypes gleaned from dealing with humans into the ai world, you’re in for an unpleasant surprise when you unpack the info and find it’s only about 75% correct, despite the impeccable grammar.

georgemcbay

> But if I have _no_ idea things go wildly off course.

This is the key to getting some amount of productivity from LLMs in my experience, the ability to spot very quickly when they veer off course into fantasyland and nip it in the bud.

Then you point out the issue to them, they agree that they made a dumb mistake and fix it, then you ask them to build on what you just agreed to and they go and reintroduce the same issue they just agreed with you was an obvious problem... because ultimately they are more fancy auto complete machines than they are actual thinking machines.

I have found them to be a time saver on the whole even when working with new languages but I think this may in large part be helped by the fact that I have literally decades of coding experience that sets off my spidey senses as soon as they start going rampant.

I can't begin to imagine how comical it must be when someone who doesn't have a strong programming foundation just blindly trusts these things to produce useful code until the runtime or compile time bugs become unavoidably obvious.

_def

> even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year.

"low/high level" starts to lose its meaning to me because it gets used in opposite ways

mrayycombi

Where are the low level CEOs vs high level CEOs?

I'll bet AI could do their jobs right now.

Can SOMEONE please write AI software to replace these people?

richardw

Management consultants. Main attributes are confidence and ability to generate content. No need to stick around to see it through.

ITB

Are you bothered by the fact that software engineers might be easier to automate?

throwaway290

It's the opposite. An LLM is better at CEO stuff than working code. A good developer + LLM instead of CEO can succeed. A good CEO + LLM instead of developer cannot succeed. (For a tech company)

rpmisms

Considering that there are chickens who outperform stockbrokers, no.

rsynnott

Is that a fact? I mean, see the linked article; even the company whose whole business model lies in convincing people that that _is_ a fact is kinda saying “yeah, perhaps not”, with vague promises of jam tomorrow.

anonymoushn

Even better, if you click through to the linked source he doesn't say "low-level" at all, or make any claim that is at all like the claim he is cited as making!

realitysballs

Yeah low-level language conflated with low-level coders , means the opposite in some sense

floppiplopp

LLMs are still just text generators. These are statistical models that cannot think or solve logical problems. They might fool people, as Weizenbaum's "Eliza" did in the late 60s, by generating code that sort of runs sometimes, but identifying and solving a logic problem is something I reliably see these things fail at.

tomduncalf

Have you tried the latest models, using them with Cursor etc? They might not be truly intelligent but I’d be surprised if an SWE can’t see that they are already offering a lot of value.

They probably can’t solve totally novel problems but they are good at transposing existing solutions to new domains. I’ve built some pretty crazy stuff with just prompts - granted I can prompt with detailed technical instructions when needed as I’m a SWE, similar to instructing a junior. I’ve built prototypes which would take days in hours which to me is hugely exciting.

The code quality of pure AI generated code isn’t great but my approach right now is to use that to prototype things mostly with prompts (it takes as much time to build a prototype as it would to create a mock up or document explaining the idea previously) then once we are committed to it, I’ll rebuild it mostly by hand but using Cursor to help.

ldjkfkdsjnv

I’ve got 15 years of coding experience at some of the biggest tech companies. My personal opinion is that most people have no clue how good these AI coding systems already are. If you use something like RepoPrompt, where you selectively choose which files to include in the prompt, and then also provide a clear description of what changes you want to make—along with a significant portion of the source code—a model like O1Pro will nail the solution the first time.

The real issue is that people are not providing proper context to the models. Take any random coding library you’re interfacing with, like a Postgres database connection client. The LLM isn’t going to inherently know all of the different configurations and nuances of that client. However, if you pass in the source code for the client along with the relevant portions of your own codebase, you’re equipping the model with the exact information it needs.

Every time you do this, including a large prompt size—maybe 50,000 to 100,000 tokens—you dramatically improve the model’s ability to generate an accurate and useful response. With a strong model like O1Pro, the results can be exceptional. The key isn’t that these models are incapable; it’s that users aren’t feeding them the right data.

tsimionescu

Are you suggesting that OpenAI published a paper assessing their own models on real-world problems, but failed to properly use their own models? And/or that you know better than OpenAI scientists how to use OpenAI models most effectively?

ldjkfkdsjnv

thou shall not question the high priests

tsimionescu

That's not what I'm saying.

But telling us that the designers of a product are stupid and don't know how to use their own product when they're disclosing its limitations should really come with more than a "trust me bro" as evidence.

Dban1

the limiting factor is no longer the answers but the questions

simonw

I find the framing of this story quite frustrating.

The purpose of new benchmarks is to gather tasks that today's LLMs can't solve comprehensively.

It an AI lab built a benchmark that their models scored 100% on they would have been wasting everyone's time!

Writing a story that effectively says "ha ha ha, look at OpenAI's models failing to beat the new benchemark they created!" is a complete misunderstanding of the research.

mckngbrd

Shhh ... you're spoiling everybody's confirmation bias against LLMs. They are obviously terrible at coding, just as we have known all along, and everybody should laugh at them. Nothing to see here!

johnnyanmac

As long as these companies keep pretending AI is ready to replace humans, I will be biased against lies, thank you.

player1234

Since you are one of the cool kids in the know, can you share the road map to profitability and even better the expected/hyped ROI? Without extrpolations into science fiction, please.

mohsen1

I wonder how many of the solutions that passes SWE-lancer evals would not be accepted by the poster due to low quality

I’ve been trying so many things to automate solving bugs and adding features 100% by AI and I have to admit it’s been a failure. Without someone that can read the code and fully understand the AI generated code and suggests improvements (SWE in the loop) AI code is mostly not good.

WiSaGaN

So this is an in-house benchmarks after their undisclosed partnership with a previous benchmark company. Really hope they do not have their next model to vastly outperform on this benchmark in the coming weeks.

contractorwolf

To all those devs saying "i tried it and it wasnt perfect the first time, so I gave up", I am reminded of something my father used to say: "A bad carpenter blames his tools"

So AI not going to answer your question right on its first attempt in many cases. It is forced to make a lot of assumptions based on the limited info you gave it, some of those may not match your individual case. Learn to prompt better and it will work better for you. It is a skill, just like everything else in life.

Imagine going into a job today and saying "i tried google but it didnt give me what I was looking for as the first result, so I dont use google anymore". I just wouldnt hire a dev that couldnt learn to use AI as a tool to get there job done 10x faster. If that is your attitude, 2026 might really be a wake-up call for your new life.

gar1t

One of the reasons SO works is that the correct or best answer tends to move toward the top of the list. AI struggles to do this reliably - and so its closer to SO where the answers are randomly selected and you have to try a few to get the one that's correct.

Also, I bet your father imagined the carpenter's toolbox full of well accepted useful tools. For many of us, non-bad carpenters, AI hasn't made the cut yet.

contractorwolf

All I am saying is that if you are expecting it to fail for you, that is absolutely what it will do. You weren't good at typing on your first day either. I went from a decent engineer with 20 years experience to a 10x engineer able to take on any problem, in less than a year. All because I learned how to use the tool effectively.

Just dont give up, get back on that bike and keep peddling. I promise it will amaze you if you give it a chance.

This is a tool, just like all the other ones you have learned, but it will make you a far better engineer. It can fill all those gaps in your understanding of code. You can ask it all those questions you are unwilling to ask your colleagues, because you think you will sound dumb for not knowing. You can ask it to explain everything again if you still dont get it. It is powerful if you know how to use it.

HN

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems