AI Blindspots – Blindspots in LLMs I've noticed while AI coding
31 comments
·March 19, 2025antasvara
MostlyStable
This is, I think, a better way to think about LLM mistakes compared to the usual "hallucinations". I think of them as similar to human optical illusions. There are things about the human visual cortex (and also other sensory systems, see the McGurk Effect [0]), that, when presented with certain kinds of inputs, will consistently produce wrong interpretations/outputs. Even when we are 100% ware of the issue, we can't prevent our brains from generating the incorrect interpretation.
LLMs seem to have similar issues along dramatically different axes, axes that humans are not used to seeing these kinds of mistakes; where nearly no human would make this kind of mistake and so we interpret it (in my opinion incorrectly) as lack of ability or intelligence.
Because these are engineered systems, we may figure out ways to solve these problems (although I personally think the best we will ever do is decrease their prevalence), but more important is probably learning to recognize the places that LLMs are likely to make these errors, and, as your comment suggests, design work flows and systems that can deal with them.
tharkun__
I don't think that's universally true. We have different humans with different levels of ability to catch errors. I see that with my teams. Some people can debug. Some can't. Some people can write tests. Some can't. Some people can catch stuff in reviews. Some can't.
I asked Sonnet 3.7 in Cursor to fix a failing test. While it made the necessary fix, it also updated a hard-coded expected constant to instead be computed using the same algorithm as the original file, instead of preserving the constant as the test was originally written.
Guess what?Guess the number of times I had to correct this from humans doing it in their tests over my career!
And guess where the models learned the bad behavior from.
__MatrixMan__
I agree. I've been been struck by how remarkably understandable the errors are. It's quite often something that I'd have done myself if I wasn't paying attention to the right thing.
sorokod
You may find this interesting: "AI Mistakes Are Very Different from Human Mistakes"
https://www.schneier.com/blog/archives/2025/01/ai-mistakes-a...
woopwoop
Agree, but I would point out that the errors that I make are selected on the fact that I don't notice I'm making them, which tips the scale toward LLM errors being not as bad.
taberiand
Based on the list, LLMs are at a "very smart junior programmer" level of coding - though with a much broader knowledge base than you'd expect from even a senior. They lack bigger-picture thinking, and default to doing what is asked of them instead of what needs to be done.
I expect the models will continue improving though, I feel like most of it comes down to the ephemeral nature of their context window / the ability to recall and attach relevant information to the working context when prompted.
dataviz1000
Are you using Cursor? I'm using Github Copilot in VSCode and I'm wondering if I will get more efficiency form a different coding assistant.
ezyang
Hi Hacker News! One of the things about this blog that has gotten a bit unwieldy as I've added more entries is that it's a sort of undifferentiated pile of posts. I want some sort of organization system but I haven't found one that's good. Very open to suggestions!
joshka
What about adding a bit more structure and investing in a pattern language approach like what you might find in a book by Fowler or a site like https://refactoring.guru/. You're much of the way there with the naming and content, but could refactor the content a bit better into headings (Problem, Symptoms, Examples, Mitigation, Related, etc.)
You could even pretty easily use an LLM to do most of the work for you in fixing it up.
Add a short 1-2 sentence summary[1] to each item and render that on the index page.
rav
My suggestion: Change the color of visited links! Adding a "visited" color for links will make it easier for visitors to see which posts they have already read.
datadrivenangel
Maybe organize them more clearly split between observed pitfalls/blindspots and prescriptions. Some of the articles (Use automatic formatting) are Practice forward, while others are pitfall forward. I like how many of the articles have examples!
cookie_monsta
Some sort of navigation would be nice a prev/next or some way to avoid having to go back to the links page all the time.
All of the pages that I visited were small enough that you could probably wrap them them <details> tags[1] and avoid navigation altogether
[1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/de...
smusamashah
How about listing all if these on 1 single page? Will be easy to navigate/find.
ezyang
They are listed on one page right now! Haha
elicash
They're indexed on one page, but you can't scan/scroll through these short posts without clicking because the content itself isn't all on a single page, at least not that I can find.
(I also like the other idea of separating out pitfalls vs. prescriptions.)
sfink
When I saw the title, I knew what this was going to be. It made me want to immediately write a corresponding "Human Blindspots" blog post to counteract it, because I knew it was going to be the usual drivel about how the LLMs understand <X> but sometimes they don't quite manage to get the reasoning right, but not to worry because you can nudge them and their logical brains will then figure it out and do the right thing. They'll stop hallucinating and start functioning properly, and if they don't, just wait for the next generation and everything will be fine.
I was wrong. This is great! I really appreciate how you not only describe the problems, but also describe why they happen using terminology that shows you understand how these things work (rather than the usual crap that is based on how people imagine them to work or want them to work). Also, the examples are excellent.
It would be a bunch of work, but the organization I would like to see (alongside the current, not replacing it, because the one-page list works for me already) would require sketching out some kind of taxonomy of topics. Categories of ways that Sonnet gets things wrong, and perhaps categories of things that humans would like them to do (eg types of tasks, or skill/sophistication levels of users, or starting vs fixing vs summarizing/reviewing vs teaching, or whatever). But I haven't read through all of the posts yet, so I don't have a good sense for how applicable these categorizations might be.
I personally don't have nearly enough experience using LLMs to be able to write it up myself. So far, I haven't found LLMs very useful for the type of code I write (except when I'm playing with learning Rust; they're pretty good for that). I know I need to try them out more to really get a feel for their capabilities, but your writeups are the first I've found that I feel I can learn from without having to experience it all for myself first.
(Sorry if this sounds like spam. Too gushing with the praise? Are you bracing yourself for some sketchy URL to a gambling site?)
boredtofears
Great read, I can definitely confirm a lot of these myself. Would be nice to see this aggregated into some kind of "best practices" document (although hard to say how quickly it'd be out of date).
datadrivenangel
Almost all of these are good things to consider with human coders as well. Product managers take note!
https://ezyang.github.io/ai-blindspots/requirements-not-solu...
teraflop
> I had some test cases with hard coded numbers that had wobbled and needed updating. I simply asked the LLM to keep rerunning the test and updating the numbers as necessary.
Why not take this a step farther and incorporate this methodology directly into your test suite? Every time you push a code change, run the new version of the code and use it to automatically update the "expected" output. That way you never have to worry about failures at all!
ezyang
In fact, the test framework I was using at the time (jest) did in fact support this. But the person who had originally written the tests hadn't had the foresight to use snapshot tests for this failing test!
fizx
The community seems rather divided as to whether these are intrinsic, or we solve these with today's tech, and more training, heuristics and workarounds.
logicchains
I found Gemini Flash Thinking Experimental is almost unusable in an agent workflow because it'll eventually accidentally remove a closing bracket, breaking compilation, and be unable to identify and fix the issue even with many attempts. Maybe it has trouble counting/matching braces due to fewer layers?
ezyang
Yeah, Sonnet 3.5/3.7 are doing heavy lifting. Maybe the SOTA Gemini models would do better, I haven't tried them. Generating correct patches is a funny minigame that isn't really solved, despite how easy it is to RL on.
mystified5016
Recently I've been writing a resume/hire-me website. I'm not a stellar writer, but I'm alright, so I've been asking various LLMs to review it by just dropping the HTML file in.
Every single one has completely ignored the "Welcome to nginx!" Header at the top of the page. I'd left it in half as a joke to amuse myself but I expected it would get some kind of reaction from the LLMs, even if just a "it seems you may have forgotten this line"
Kinda weird. I even tried guiding them into seeing it without explicitly mentioning it and I could not get a response.
SparkyMcUnicorn
Have you tried "Let's get this production ready" as a prompt for this or any other coding tasks?
Sometimes when I ask for "production ready" it can go a bit too far, but I've found it'll usually catch things like this that I might miss.
null
Mc91
One thing I do is go to Leetcode, see the optimal big O time and space solutions, then give the LLM the Leetcode medium/hard problem, and limit it to the optimal big O time/space solution and suggest the method (bidirectional BFS). I ask for the solution in some fairly mainstream modern language (although not Javascript, Java or Python). I also say to do it as compact as possible. Sometimes I reiterate that.
It's just a function usually, but it does not always compile. I'd set this as a low bar for programming. We haven't even gotten into classes, architecture, badly-defined specifications and so on.
LLMs are useful for programming, but I'd want them to clear this low hurdle first.
bongodongobob
You're using a shitty model then or are lying. 4o one or two shotted the first 12 days of advent of code for me without anything other than the problem description.
This highlights a thing I've seen with LLM's generally: they make different mistakes than humans. This makes catching the errors much more difficult.
What I mean by this is that we have thousands of years of experience catching human mistakes. As such, we're really good at designing systems that catch (or work around) human mistakes and biases.
LLM's, while impressive and sometimes less mistake-prone than humans, make errors in a fundamentally different manner. We just don't have the intuition and understanding of the way that LLM's "think" (in a broad sense of the word). As such, we have a hard time designing systems that account for this and catch the errors.