AI Blindspots – Blindspots in LLMs I've noticed while AI coding
245 comments
·March 19, 2025antasvara
MostlyStable
This is, I think, a better way to think about LLM mistakes compared to the usual "hallucinations". I think of them as similar to human optical illusions. There are things about the human visual cortex (and also other sensory systems, see the McGurk Effect [0]), that, when presented with certain kinds of inputs, will consistently produce wrong interpretations/outputs. Even when we are 100% ware of the issue, we can't prevent our brains from generating the incorrect interpretation.
LLMs seem to have similar issues along dramatically different axes, axes that humans are not used to seeing these kinds of mistakes; where nearly no human would make this kind of mistake and so we interpret it (in my opinion incorrectly) as lack of ability or intelligence.
Because these are engineered systems, we may figure out ways to solve these problems (although I personally think the best we will ever do is decrease their prevalence), but more important is probably learning to recognize the places that LLMs are likely to make these errors, and, as your comment suggests, design work flows and systems that can deal with them.
grey-area
LLMs are incapable of solving even simple logic puzzles or maths puzzles they haven't seen before, they don't have a model of the world which is key to intelligence. What they are good at is reproducing things in their dataset with slight modifications and (sometimes) responding to queries well which make them seem creative but there is no understanding or intelligence there, in spite of appearances.
They are very good at fooling people; perhaps Turing's Test is not a good measure of intelligence after all, it can easily be gamed and we find it hard to differentiate apparent facility with language and intelligence/knowledge.
rcxdude
I think it's not very helpful to just declare that such a model doesn't exist: there's a decent amount of evidence that LLMs do in fact form models of the world internally, and use them during inference. However, while these models are very large and complex, they aren't necessarily accurate and LLMs struggle with actually manipulating them at inference time, forming new models or adjusting existing ones is generally something they are quite bad at at those stages (which generally results in the 'high knowledge' which impresses people and is often confused with intelligence, while they're still fundamentally quite dumb despite having a huge depth of knowledge: I don't think it's something you can categorically say 'zero intelligence' - even relatively simpler and less capable systems can be said to have some intelligence, it's just in many aspects LLM intelligence is still worse than a good fraction of mammals)
fragmede
> they don't have a model of the world
Must it have one? The words "artificial intelligence" are a poor description of a thing when we've not rigorously defined it. It's certainly artificial, there's no question about that, but is it intelligent? It can do all sorts of things that we consider a feature of intelligence and pass all sorts of tests, but it also falls down flat on its face when prompted with a just-so brainteaser. It's certainly useful, for some people. If, by having inhaled all of the Internet and written books that have been scanned as its training data, it's able to generate essays on anything and everything, at the drop of a hat, why does it matter if we can find a brainteaser it hasn't seen yet? It's like it has a ginormous box of Legos, and it can build whatever your ask for with these Lego blocks, but pointing out it's unable create its own Lego blocks from scratch has somehow become critically important to point out, as if that makes this all total dead end and it's all a waste of money omg people wake up oh if only they'd listen to me. Why don't people listen to me?
Crows are believed to have a theory of mind, and they can count up to 30. I haven't tried it with Claude, but I'm pretty sure it can count at least that high. LLMs are artificial, they're alien, of course they're going to look different. In the analogy where they're simply a next word guesser, one imagines standing at a fridge with a bag of magnetic words, and just pulling a random one from the bag to make ChatGPT. But when you put your hand inside a bag inside a bag inside a bag, twenty times (to represent the dozens of layers in an LLM model), and there are a few hundred million pieces in each bag (for parameters per layer), one imagines that there's a difference; some sort of leap, similar to when life evolved from being a single celled bacterium to a multi-cellular organism.
Or maybe we're all just rubes, and some PhD's have conned the world into giving them a bunch of money, because they figured out how to represent essays as a math problem, then wrote some code to solve them, like they did with chess.
jychang
LLMs clearly do have a world model though. They represent those ideas at higher level features in the feedforward layer. The lower level layers are neurons that describe words, syntax, and local structures in the text, while the upper levels capture more abstract ideas, such as semantic meaning, relationships between concepts, and even implicit reasoning patterns.
Applejinx
Along these lines one model that might help is to consider LLMs 'wikipedia of all possible correct articles'. Start with Wikipedia and assume (already a tricky proposition!) that it's perfectly correct. Then, begin resynthesizing articles based on what's already there. Do your made-up articles have correctness?
I'm going to guess that sometimes they will: driven onto areas where there's no existing article, some of the time you'll get made-up stuff that follows the existing shapes of correct articles and produces articles that upon investigation will turn out to be correct. You'll also reproduce existing articles: in the world of creating art, you're just ripping them off, but in the world of Wikipedia articles you're repeating a correct thing (or the closest facsimile that process can produce)
When you get into articles on exceptions or new discoveries, there's trouble. It can't resynthesize the new thing: the 'tokens' aren't there to represent it. The reality is the hallucination, but an unreachable one.
So the LLMs can be great at fooling people by presenting 'new' responses that fall into recognized patterns because they're a machine for doing that, and Turing's Test is good at tracking how that goes, but people have a tendency to think if they're reading preprogrammed words based on a simple algorithm (think 'Eliza') they're confronting an intelligence, a person.
They're going to be historically bad at spotting Holmes-like clues that their expected 'pattern' is awry. The circumstantial evidence of a trout in the milk might lead a human to conclude the milk is adulterated with water as a nefarious scheme, but to an LLM that's a hallucination on par with a stone in the milk: it's going to have a hell of a time 'jumping' to a consistent but very uncommon interpretation, and if it does get there it'll constantly be gaslighting itself and offering other explanations than the truth.
admiralrohan
Hallucinating is fine but overconfidence is the problem. But I heard it's not an easy problem to solve.
Terr_
> overconfidence is the problem.
The problem is a bit deeper than that, because what we perceive as "confidence" is itself also an illusion.
The (real) algorithm takes documents and makes them longer, and some humans configured a document that looks like a conversation between "User" and "AssistantBot", and they also wrote some code to act-out things that look like dialogue for one of the characters. The (real) trait of confidence involves next-token statistics.
In contrast, the character named AssistantBot is "overconfident" in exactly the same sense that a character named Count Dracula is "immortal", "brooding", or "fearful" of garlic, crucifixes, and sunlight. Fictional traits we perceive on fictional characters from reading text.
Yes, we can set up a script where the narrator periodically re-describes AssistantBot as careful and cautious, and that might help a bit with stopping humans from over-trusting the story they are being read. But trying to ensure logical conclusions arise from cautious reasoning is... well, indirect at best, much like trying to make it better at math by narrating "AssistantBot was good at math and diligent at checking the numbers."
> Hallucinating
P.S.: "Hallucinations" and prompt-injection are non-ironic examples of "it's not a bug, it's a feature". There's no minor magic incantation that'll permanently banish them without damaging how it all works.
patates
Hallucinating is a confidence problem, no?
Say, they should be 100% confident that "0.3" follows "0.2 + 0.1 =", but a lot of floating point examples on the internet make them less confident.
On a much more nuanced problem, "0.30000000000000004" may get more and more confidence.
This is what makes them "hallucinate", did I get it wrong? (in other words, am I hallucinating myself? :) )
jacksnipe
Unfortunately, in the system most of us work in today, I think overconfidence is an intelligent behavior.
ForTheKidz
> I think of them as similar to human optical illusions.
What we call "hallucinations" is far more similar to what we would call "inventiveness", "creativity", or "imagination" in humans than anything to do with what we refer to as "hallucinations" in humans—only they don't have the ability to analyze whether or not they're making up something or accurately parameterizing the vibes. The only connection between the two concepts is that the initial imagery from DeepDream was super trippy.
majormajor
Inventiveness/creativity/imagination are deliberate things. LLM "hallucinations" are more akin to a student looking at a test over material they only 70% remember grabbing at what they think is the most likely correct answer. More "willful hope in the face of forgetting" than "creativity." Many LLM hallucinations - especially of the coding sort - are ones that would be obviously-wrong based on the training material, but the hundreds of languages/libraries/frameworks the thing was trained on start to blur together and there is not precise 100%-memorization recall but instead a "probably something like this" guess.
It's not "inventive" to assume one math library will have the same functions as another, it's just losing sight of specific details.
Applejinx
If and only if the LLM is able to bring the novel, unexpected connection into itself and see whether it forms other consistent networks that lead to newly common associations and paths.
A lot of us have had that experience. We use that ability to distinguish between 'genius thinkers' and 'kid overdosing on DMT'. It's not the ability to turn up the weird connections and go 'ooooh sparkly', it's whether you can build new associations that prove to be structurally sound.
If that turns out to be something self-modifying large models (not necessarily 'language' models!) can do, that'll be important indeed. I don't see fiddling with the 'temperature' as the same thing, that's more like the DMT analogy.
You can make the static model take a trip all you like, but if nothing changes nothing changes.
AdieuToLogic
> What we call "hallucinations" is far more similar to what we would call "inventiveness", "creativity", or "imagination" in humans ...
No.
What people call LLM "hallucinations" is the result of a PRNG[0] influencing an algorithm to pursue a less statistically probable branch without regard nor understanding.
0 - https://en.wikipedia.org/wiki/Pseudorandom_number_generator
cratermoon
the word you're looking for is "confabulation"
pydry
I dunno hallucinations seem like a pretty human type of mistake to me.
when i try to remember something my brain often synthesizes new things by filling in the gaps.
This would be where I often say "i might be imagining it, but..." or "i could have sworn there was a..."
In such cases the thing that saves the human brain is double checking against reality (e.g. googling it to make sure).
Miscounting the number of r's in strawberry by glancing at the word also seems like a pretty human mistake.
gitaarik
But it's different kinds of hallucinations.
AI doesn't have a base understanding of how physics work. So they think it's acceptible if in a video some element on the background in a next frame might appear in front of another element that is on the foreground.
So it's always necessary to keep correcting LLMs, because they only learn by example, and you can't express any possible outcome of any physical process just by example, because physical processes can be in infinate variations. LLMs can keep getting closer to match our physical reality, but when you zoom into the details you'll always find that it comes short.
So you can never really trust an LLM. If we want to make an AI that doesn't make errors, it should understand how physics works.
antasvara
To be clear, I'm not saying that LLM's exclusively make non-human errors. I'm more saying that most errors are happening for different "reasons" than humans.
Think about the strawberry example. I've seen a lot of articles lately where not all misspellings of the word "strawberry" reliably give letter counting errors. The general sentiment there is human, but the specific pattern of misspelling is really more unique to LLM's (i.e. different spelling errors would impact humans versus LLM's).
The part that makes it challenging is that we don't know these "triggers." You could have a prompt that has 95% accuracy, but that inexplicably drops to 50% if the word "green" is in the question (or something like that).
j45
Some of the errors are caused by humans. Say, due to changing the chat to only pay attention to recent messages and not the middle, omitting critical details.
tharkun__
I don't think that's universally true. We have different humans with different levels of ability to catch errors. I see that with my teams. Some people can debug. Some can't. Some people can write tests. Some can't. Some people can catch stuff in reviews. Some can't.
I asked Sonnet 3.7 in Cursor to fix a failing test. While it made the necessary fix, it also updated a hard-coded expected constant to instead be computed using the same algorithm as the original file, instead of preserving the constant as the test was originally written.
Guess what?Guess the number of times I had to correct this from humans doing it in their tests over my career!
And guess where the models learned the bad behavior from.
fn-mote
> Some people can debug. Some can't. Some people can write tests. Some can't.
Wait… really?
No way do I want to work with someone who can’t debug or write tests. I thought those were entry stakes to the profession.
People whose skills you use in other ways because they are more productive? Maybe. But still. Clean up after yourself. It’s something that should be learned in the apprentice phase.
tharkun__
Like my sibling says, you can't always choose. That's one side of that coin.
The other is: Some people are naturally good at writing "green field" (or re-writing everything) and do produce actual good software.
But these same people, which you do want to keep around if that's the best you can get, are next to useless when you throw a customer reported bug at them. Takes them ages to figure anything out and they go down endless rabbit holes chasing the wrong path for hours.
You also have people that are super awesome at debugging. They have knack for seeing some brokenness and having the right idea or an idea of the right direction to investigate in right away, can apply the scientific method to test their theories and have the bug fixed in the time it take one of these other people to go down even a single of the rabbit holes they will go down. But these same people in some cases are next to useless if you ask them to properly structure a new green field feature or rewrite parts of something to use some new library coz the old one is no longer maintained or something and digging through said new library and how it works.
Both of these types of people are not bad in and of themselves. Especially if you can't get the unicorns that can do all of these things well (or well enough), e.g. because your company can't or won't pay for it or only for a few of them, which they might call "Staff level".
And you'd be amazed how easy it is to get quite a few review comments in for even Staff level people if you basically ignore their actual code and just jump right into the tests. It's a pet peeve of mine. I start with the tests and go from there when reviewing :)
What you really don't want is if someone is not good at any of these of course.
groby_b
> No way do I want to work with someone who can’t debug or write tests. I thought those were entry stakes to the profession.
Those are almost entry stakes at tier-one companies. (There are still people who can't, it's just much less common)
In your average CRUD/enterprise automation/one-off shellscript factory, the state of skills is... not fun.
There's a reason there's the old saw of "some people have twenty years experience, some have the same year 20 times over". People learn & grow when they are challenged to, and will mostly settle at acquiring the minimum skill level that lets them do their particular work.
And since we as an industry decided to pretend we're a "science", not skills based, we don't have a decent apprenticeship system that would force a minimum bar.
And whenever we discuss LLMs and how they might replace software engineering, I keep remembering that they'll be prompted by the people who set that hiring bar and thought they did well.
30minAdayHN
Little tangent: I realized that currently LLMs can't debug because they only have access to the compile time (just code). Many bugs happen due to run time complex state. If I can make LLMs think like a productive Dev who can debug, then would they become more efficient?
I started hacking a small prototype along those lines: https://github.com/hyperdrive-eng/mcp-nodejs-debugger
Hoping I can avoid debug death loop where I get into this bad loop of copy pasting the error and hoping LLM would get it right this one time :)
ipsento606
I've been a professional engineer for over a decade, and in that time I've only had one position where I was expected to write any tests. All my other positions, we have no automated testing of any kind.
david422
I worked with a new co-worker that ... had trouble writing code, and tests. He would write a test that tested nothing. At first I thought he might be green and just needed some direction - we all start somewhere. But he had on his bio that he had 10 years of experience in software dev in the language we were working in. I couldn't quite figure out what the disconnect was, he ended up leaving a short time later.
hobs
Keyword want - most people don't control who their peers are, and complaining to your boss doesn't get you that far, especially when said useless boss is fostering said useless person.
__MatrixMan__
I agree. I've been been struck by how remarkably understandable the errors are. It's quite often something that I'd have done myself if I wasn't paying attention to the right thing.
skerit
Claude Sonnet 3.7 really, really loves to rewrite tests so they'll pass. I've had it happen many times in a claude-code session, I had to add this to each request (though it did not fix it 100%)
- Never disable, skip, or comment out failing unit tests. If a unit test fails, fix the root cause of the exception.
- Never change the unit test in such a way that it avoids testing the failing feature (e.g., by removing assertions, adding empty try/catch blocks, or making tests trivial).
- Do not mark tests with @Ignore or equivalent annotations.
- Do not introduce conditional logic that skips test cases under certain conditions.
- Always ensure the unit test continues to properly validate the intended functionality.
jsight
I'm guessing this is a side effect of mistakes in the reinforcement learning face. It'd be really easy to build a reward model that favors passing tests, without properly measuring the quality of those tests.
sorokod
You may find this interesting: "AI Mistakes Are Very Different from Human Mistakes"
https://www.schneier.com/blog/archives/2025/01/ai-mistakes-a...
woopwoop
Agree, but I would point out that the errors that I make are selected on the fact that I don't notice I'm making them, which tips the scale toward LLM errors being not as bad.
worldsayshi
Yeah it's the reason pair programming is nice. Now the bugs need to pass two filters instead of one. Although I suppose LLM's aren't that good at catching my bugs without me pointing them out.
diggan
I've found both various ChatGPT and Claude to be pretty good at finding unknown bugs, but you need a somewhat hefty prompt.
Personally I use a prompt that goes something like this (shortened here): "Go through all the code below and analyze everything it's doing step-by-step. Then try to explain the overall purpose of the code based on your analysis. Then think through all the edge-cases and tradeoffs based on the purpose, and finally go through the code again and see if you can spot anything weird"
Basically, I tried to think of what I do when I try to spot bugs in code, then I just wrote a reusable prompt that basically repeats my own process.
vanschelven
Nevermind designing _systems_ that account for this, even just debugging such errors is much harder than ones you create yourself:
fragmede
For that case, it sounds more like having your tools commit for you after each change, as is the default for Aider, is the real winner. "git log -p" would have exposed that crazy import in minutes instead of hours.
commit early, commit often.
danenania
I’m working an AI coding agent[1], and all changes accumulate in a sandbox by default that is isolated from the project.
Auto-commit is also enabled (by default) when you do apply the changes to your project, but I think keeping them separated until you review is better for higher stakes work and goes a long way to protect you from stray edits getting left behind.
zahlman
FTA:
> Note that it took me about two hours to debug this, despite the problem being freshly introduced. (Because I hadn’t committed yet, and had established that the previous commit was fine, I could have just run git diff to see what had changed).
> In fact, I did run git diff and git diff --staged multiple times. But who would think to look at the import statements? The import statement is the last place you’d expect a bug to be introduced.
dpacmittal
I just prompted cursor to remove a string from a svelte app. It created a boolean variable showString, set it as false and then proceeded to use that to hide the string
rzk
> The LLM knows nothing about your requirements. When you ask it to do something without specifying all of the constraints, it will fill in all the blanks with the most probable answers from the universe of its training set. Maybe this is fine. But if you need something more custom, it’s up to you to actually tell the LLM about it.
Reminds of the saying:
“To replace programmers with AI, clients will have to accurately describe what they want.
We're safe.”
jonahx
> “To replace programmers with AI, clients will have to accurately describe what they want. We're safe.”
I've had similar sentiments often and it gets to the heart of things.
And it's true... for now.
The caveat is that LLMs already can, in some cases, notice that you are doing something in a non-standard way, or even sub-optimal way, and make "Perhaps what you meant was..." type of suggestions. Similarly, they'll offer responses like "Option 1", "Option 2", etc. Ofc, most clients want someone else to sort through the options...
Also, LLMs don't seem to be good at assessment across multiple abstraction levels. Meaning, they'll notice a better option given the approach directly suggested by your question, but not that the whole approach is misguided and should be re-thought. The classic XY problem (https://en.wikipedia.org/wiki/XY_problem).
In theory, though, I don't see why they couldn't keep improving across these dimensions. With that said, even if they do, I suspect many people will still pay a human to interact with the LLM for them for complex tasks, until the difference between human UI and LLM UI all but vanishes.
daxfohl
Yeah, the difference having a human in the loop makes is the ability to have that feedback. Did you think about X? Requirement Y is vague. Z and W seem to conflict.
Up to now, all our attempts to "compile" requirements to code have failed, because it turns out that specifying every nuance into a requirements doc in one shot is unreasonable; you may as well skip the requirements in English and just write them in Java at that point.
But with AI assistants, they can (eventually, presumptively) enable that feedback loop, do the code, and iterate on the requirements, all much faster and more precisely than a human could.
Whether that's possible remains to be seen, but I'd not say human coders are out of the woods just yet.
colonCapitalDee
> Preparatory Refactoring says that you should first refactor to make a change easy, and then make the change. The refactor change can be quite involved, but because it is semantics preserving, it is easier to evaluate than the change itself.
> In human software engineering, a common antipattern when trying to figure out what to do is to jump straight to proposing solutions, without forcing everyone to clearly articulate what all the requirements are. Often, your problem space is constrained enough that once you write down all of the requirements, the solution is uniquely determined; without the requirements, it’s easy to devolve into a haze of arguing over particular solutions.
> When you’re learning to use a new framework or library, simple uses of the software can be done just by copy pasting code from tutorials and tweaking them as necessary. But at some point, it’s a good idea to just slog through reading the docs from top-to-bottom, to get a full understanding of what is and is not possible in the software.
> The Walking Skeleton is the minimum, crappy implementation of an end-to-end system that has all of the pieces you need. The point is to get the end-to-end system working first, and only then start improving the various pieces.
> When there is a bug, there are broadly two ways you can try to fix it. One way is to randomly try things based on vibes and hope you get lucky. The other is to systematically examine your assumptions about how the system works and figure out where reality mismatches your expectations.
> The Rule of Three in software says that you should be willing to duplicate a piece of code once, but on the third copy you should refactor. This is a refinement on DRY (Don’t Repeat Yourself) accounting for the fact that it might not necessarily be obvious how to eliminate a duplication, and waiting until the third occurrence might clarify.
These are lessons that I've learned the hard way (for some definition of "learned", these things are simple but not easy), but I've never seen them phrased to succinctly and accurately before. Well done OP!
duxup
"Preparatory Refactoring says that you should first refactor to make a change easy, and then make the change. "
Amen. I'll be refactoring something and a coworker will say "Wow you did that fast." and I'll tell them I'm not done... those PRs were just to prepare for the final work.
Sometimes after all my testing I'll even leave the "prepared" changes in production for a bit just to be 100% sure something strange wasn't missed. THEN the real changes can begin.
skydhash
> a common antipattern when trying to figure out what to do is to jump straight to proposing solutions, without forcing everyone to clearly articulate what all the requirements are.
This is a quick way to determine if you're in the wrong team. When you're trying to determine the requirements and the manager/client is evading you. As if you're supposed to magically have all the answers.
> When you’re learning to use a new framework or library, simple uses of the software can be done just by copy pasting code from tutorials and tweaking them as necessary.
I tried to use the guides and code examples instead (if they exists). One thing that helps a lot when the library is complex, is to have a prototype that you can poke at to learn the domain. Very ugly code, but will help to learn where all the pieces are.
Singletoned
"The Rule of Three" I have been expressing as "it takes 3 points to make a straight line".
Any two points will look as if they are on a straight line, but you need a third point to confirm the pattern as being a straight line
taberiand
Based on the list, LLMs are at a "very smart junior programmer" level of coding - though with a much broader knowledge base than you'd expect from even a senior. They lack bigger-picture thinking, and default to doing what is asked of them instead of what needs to be done.
I expect the models will continue improving though, I feel like most of it comes down to the ephemeral nature of their context window / the ability to recall and attach relevant information to the working context when prompted.
nomel
> and default to doing what is asked of them instead of what needs to be done.
I don't think it's that simple.
From what I've found, there are "attractors" in the statistics. If a part of your problem is too similar to a very common problem, that the LLM saw a million times, the output will be attracted to those overwhelming statistical next-words, which is understandable. That is the problem I run into most often.
Groxx
It's a constant struggle for me too, both "in the large" and small situations. Using a library which provides special-cased versions of common concepts, like "futures"? You'll get non-stop mistakes and misuses, even if you've got correct ones right next to it, or feed it reams of careful documentation. Got a variable with a name that sounds like it might be a dictionary (e.g. `storesByCity`), but it's actually a list? It'll try to iterate over it like a dictionary, point out "bugs" related to unsorted iteration, and will return `var.Values()` instead of `var` when your func returns a list. Practically every single time, even after multiple rounds of "that's a list"-like feedback or giving it the compilation errors. Got a Clean-Code-like structure in some things but not others? Watch as it assumes everything follows it all the time despite massive evidence to the contrary.
They're rather impressive when building common things in common ways, and a LOT of programming does fit that. But once you step outside that they feel like a pretty strong net negative - some occasional positive surprises, but lots of easy-to-miss mistakes.
techpineapple
I ran into this with cursor a lot. It would keep redoing changes that I explicitly told it I didn’t want. I was coding a game and it would assume things like the players gold should increment at a rate of 5 per tick, then keep putting it back when I said remove it!
taberiand
Oh sure, the flip side of doing what was asked is doing what is known - choosing a solution based on familiarity rather than applicability. Also a common trait in juniors in my experience
nomel
Related, the similarities I see when a human runs out of context window are really interesting.
I do a lot of interviews, and the poor performers usually end up running out of working memory and start behaving very similar to an LLM. Corrections/input from me will go into one ear and fall out the other, they'll start hallucinating aspects of the problem statement in an attractor sort of way, they'll get stuck in loops, etc. I write down when this happens in my notes, and it's very consistently 15 minutes. For all of them, it seems to be the lack of familiarity doesn't allow them to compress/compartmentalize the problem into something that fits in their head. I suspect it's similar for the LLM.
DanHulton
This was my thought when browsing this list, too, and it helped crystalize one of the feelings I had when trying to with with LLMs for coding: I'm a senior developer, and I want to develop as a senior developer does, and turn in senior developer-quality code. I _don't_ want to spend the rest of my career in development simply pairing with/babysitting a junior developer who will never learn from their mistakes. It may be quicker in the short run in some cases, but the code won't be as good and I'm likely to burn out, further amplifying the quality issue.
> I expect the models will continue improving though
I try to push back on this every time I see it as an excuse for current model behaviour, because what if they don't? Like, not appreciably enough to make a real difference? What if this is just a fundamental problem that remains with this class of AI?
Sure, we've seen incredible improvements over a short period of time in model capability, but those improvements have been visibly slowing down, and models have gotten much more expensive to train. Not to mention that a lot of the problem issues mentioned in this list are problems that these models have had for several generations now, and haven't gotten appreciably better, even while other model capabilities have.
I'm saying this not to criticize you, but more to draw attention to our tendency to handwave away LLM problems with a nebulous "but they'll get better so they won't be a problem." We don't actually know that, so we should factor that uncertainly into our analysis, not dismiss it as is commonly done.
ezyang
I definitely agree that for current models, the problem is finding where the LLM has comparative advantage. Usually it's something like (1) something boring, (2) something where you don't have any of the low level syntax or domain knowledge, or (3) you are on manager schedule and you need to delegate actual coding.
threeseed
I wonder if people who say LLMs are a smart junior programmer have ever used LLMs for coding or actually worked with a junior programmer before. Because for me the two are not even remotely comparable.
If I ask Claude to do a basic operation on all files in my codebase it won't do it. Half way through it will get distracted and do something else or simply change the operation. No junior programmer will ever do this. And similar for the other examples in the blog.
taberiand
Right, that is their main limitation currently - unable to consider the full system context when operating on a specific feature. But you must work with excellent juniors (or I work with very poor ones) because getting them to think about changes in the context of the bigger picture is a challenge.
qingcharles
This is definitely a huge factor I see in the mistakes. If I hand an LLM some other parts of the codebase along with my request so that it has more context, it makes less mistakes.
These problems are getting solved as LLMs improve in terms of context length and having the tools send the LLM all the information it needs.
ohgr
Yep. My usual sort of conversation with an LLM is MUCH worse than a junior developer...
Write me a parser in R for nginx logs for kubernetes that loads a log file into a tibble.
Fucks sake not normal nginx logs. nginx-ingress.
Use tidyverse. Why are you using base R? No one does that any more.
Why the hell are you writing a regex? It doesn't handle square brackets and the format you're using is wrong. Use the function read_log instead.
No don't write a function called read_log. Use the one from readr you drunk ass piece of shit.
Ok now we're getting somewhere. Now label all the columns by the fields in original nginx format properly.
What the fuck? What have you done! Fuck you I'm going to just do it myself.
... 5 minutes later I did a better job ...
woah
It's a machine, use it like one
LinXitoW
I mean, I'd never expect a junior to do better for such a highly specific task.
I expect I'd have to hand feed them steps, at which point I imagine the LLM will also do much better.
fn-mote
Except for the first paragraph, I couldn’t tell if you were talking to an incompetent junior or an LLM.
I expected the lack of breadth from the junior, actually.
qingcharles
But at the same time it'll write me 2000 lines of really gnarly text parsing code in a very optimized fashion that would have taken a senior dev all day to crank out.
We have to stop trying to compare them to a human, because they are alien. They make mistakes humans wouldn't, and they complete very difficult tasks that would be tedious and difficult for humans. All in the same output.
I'm net-positive from using AI, though. It can definitely remove a lot of tedium.
curious_cat_163
> If I ask Claude to do a basic operation on all files in my codebase it won't do it.
Not sure exactly how you used Claude for this, but maybe try doing this in Cursor (which also uses Claude by default)?
I have had pretty good luck with it "reasoning" about the entire codebase of a small-ish webapp.
zarathustreal
Since when is “do something on every file in my codebase” considered coding?
andoando
Maybe its not but its a comparatively simple task a junior developer can do.
threeseed
Refactoring has been a thing since well forever.
ohgr
Well that's the hard bit I really want help with because it takes time.
I can do the rest myself because I'm not a dribbling moron.
lelanthran
> I expect the models will continue improving though,
How? They've already been trained on all the code in the world at this point, so that's a dead end.
The only other option I see is increasing the context window, which has diminishing returns already (double the window for a 10% increase in accuracy, for example).
We're in a local maxima here.
dcre
This makes no sense. Claude 3.7 Sonnet is better than Claude 3.5 Sonnet and it’s not because it’s trained on more of the world’s code. The models are improving in a variety of ways, whether by being larger, faster, using the same number of parameters more effectively, better RLHF techniques, better inference-time compute techniques, etc.
lelanthran
> The models are improving in a variety of ways, whether by being larger, faster, using the same number of parameters more effectively, better RLHF techniques, better inference-time compute techniques, etc.
I didn't say they weren't improving.
I said there's diminishing returns.
There's been more effort put into LLMs in the last two years than in the two years prior, but the gains in the last two years have been much much smaller than in the two years prior.
That's what I meant by diminishing returns: the gains we see are not proportional to the effort invested.
taberiand
One way is mentioned in the article, expanding and improving MCP integrations - give the models the tools to work more effectively within their limitations on problems in the context of the full system.
ezyang
Hi Hacker News! One of the things about this blog that has gotten a bit unwieldy as I've added more entries is that it's a sort of undifferentiated pile of posts. I want some sort of organization system but I haven't found one that's good. Very open to suggestions!
joshka
What about adding a bit more structure and investing in a pattern language approach like what you might find in a book by Fowler or a site like https://refactoring.guru/. You're much of the way there with the naming and content, but could refactor the content a bit better into headings (Problem, Symptoms, Examples, Mitigation, Related, etc.)
You could even pretty easily use an LLM to do most of the work for you in fixing it up.
Add a short 1-2 sentence summary[1] to each item and render that on the index page.
datadrivenangel
Maybe organize them more clearly split between observed pitfalls/blindspots and prescriptions. Some of the articles (Use automatic formatting) are Practice forward, while others are pitfall forward. I like how many of the articles have examples!
smusamashah
How about listing all if these on 1 single page? Will be easy to navigate/find.
ezyang
They are listed on one page right now! Haha
elicash
They're indexed on one page, but you can't scan/scroll through these short posts without clicking because the content itself isn't all on a single page, at least not that I can find.
(I also like the other idea of separating out pitfalls vs. prescriptions.)
smusamashah
As in, all content on one page where the link just takes you to appropriate heading on the same page. These days you can do a lot on a single html.
rav
My suggestion: Change the color of visited links! Adding a "visited" color for links will make it easier for visitors to see which posts they have already read.
cookie_monsta
Some sort of navigation would be nice a prev/next or some way to avoid having to go back to the links page all the time.
All of the pages that I visited were small enough that you could probably wrap them them <details> tags[1] and avoid navigation altogether
[1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/de...
incognito124
There was a blog posted here which had a slider for scoring different features (popularity, personal choice, etc). The rankings updated live with slider moves.
Also, take a look at https://news.ycombinator.com/item?id=40774277
Sxubas
To be honest, current format worked perfectly for me: I ended up reading all entries without feeling something was off in how they were organized. I really really liked that each section had a concrete example, please don't remove that for future entries.
Thank you for sharing your insights! Very generous.
mncharity
In "Keep Files Small", there seems a lacuna: "for example, on Cursor 0.45.17, applying 55 edits on a 64KB file takes)."
duxup
I find LLMs WANT TO ANSWER TOO MUCH. If I give them too little data, they're not curious and they'll try to craft an answer when it's nearly impossible for them to be right.
I'll type and hit enter too early and I get an answer and think "This could never be right because I gave you broken sentences and too little." but there it goes answering away, dead wrong.
I would rather the LLM say "yo I don't know what you're talking I need more" but of course they're not really thinking so they don't do that / likely can't.
The LLM nature to run that word math and string SOMETHING together seems like an very serious footgun. Reminds me of the movie 2010 when they discuss how the HAL 9000 couldn't function correctly because it was told to lie despite its core programming to tell the truth. HAVING to answer seems like a serious impediment for AI. I see similar-ish things on google's gemini AI when I ask a question and it says the answer is "no" but then gives all the reasons the answer is clearly "yes".
jredwards
The most annoying thing I've found is that they always assume I'm right. If I ask a question, they assume the answer is yes, and will bend over backwards in an obsequious manner to ensure that I'm correct.
"Why of course, sir, we should absolutely be trying to compile python to assembly in order to run our tests. Why didn't I think of that? I'll redesign our testing strategy immediately."
j_bum
Ugh , I agree.
I would imagine this all comes from fine tuning, or RLHF, whatever is used.
I’d bet LLMs trained on the internet without the final “tweaking” steps would roast most of my questions … which is exactly what I want when I’m wrong without realizing it.
enraged_camel
>> The most annoying thing I've found is that they always assume I'm right.
Not always. The other day I described the architecture of a file upload feature I have on my website. I then told Claude that I want to change it. The response stunned me: it said "actually, the current architecture is the most common method, and it has these strengths over the other [also well-known] method you're describing..."
The question I asked it wasn't "explain the pros and cons of each approach" or even "should I change it". I had more or less made my decision and was just providing Claude with context. I really didn't expect a "what you have is the better way" type of answer.
hnbad
Similarly, with Claude in Cursor I've found that it will assume it's wrong when I even suggest that it might be: "Are you sure that's right? I've not seen that method before" will be followed by "I need to apologize, let me correct myself" and a wrong answer and this'll loop until eventually arriving at a worse version of what it suggested first even if I tell it "Nevermind, you were right in the first place. Let's go with that one".
mulmboy
Yeah I get this. Often I'll prompt it like "my intern looked at this and said maybe you should x. What do you think?"
Seems to help.
bredren
h/t to @knurlknurl on Reddit today shared these methods:
- “I need you to be my red team”(works really well with, Claude seems to understand the term)
“analyze the plan and highlight any weaknesses, counter arguments and blind spots critically review”
> you can't just say "disagree with me", you have to prompt it into adding a "counter check".
duxup
It’s funny AI will happily follow my lead and “bounce too close to a supernova” and I really have to push it to offer something new.
magicmicah85
I’ve been prefacing every code related question with “Do not write code. Ask me clarifying questions and let’s talk this out first”. Seems to help especially with planning and organizing a design rather than monkeying with code fixing it later.
bredren
I incorporate this into system prompts at the start of conversations and still find I have to emphasize it again over course of convos.
magicmicah85
Yeah, they forget as the chat context gets too large. A good example I’ve had is where I’ve been using chartkick to create a lot of charts, and suddenly they want to use another ruby gem. I have to remind them. We’re using chart kick.
duxup
Thank you.
imoreno
It's possible to mitigate this with a conservative system prompt.
duxup
Do you have an example? I’m curious.
otabdeveloper4
> I find LLMs WANT TO ANSWER TOO MUCH.
That's easy to fix. You need to add something like "give a succinct answer in one phrase" to your prompts.
jon_richards
I can’t tell if this was intentional, but it’s a hilarious joke. OP was referring to the decision to provide an “answer”, not the length of the response.
otabdeveloper4
LLMs can't think. They're just fancy autocomplete with a lot of context.
This means you need to prompt them with a text that increases the probability of getting back what you want. Adding something about the length of the response will do that.
lukev
This is exceptionally useful advice, and precisely the way we should be talking about how to engage with LLMs when coding.
That said, I take issue with "Use Static Types".
I've actually had more success with Claude Code using Clojure than I have Typescript (the other thing I tried.)
Clojure emphasizes small, pure functions, to a high degree. Whereas (sometimes) fully understanding a strong type might involve reading several files. If I'm really good with my prompting to make sure that I have good example data for the entity types at each boundary point, it feels like it does a better job.
My intuition is that LLMs are fundamentally context-based, so they are naturally suited to an emphasis on functions over pure data, vs requiring understanding of a larger type/class hierarchy to perform well.
But it took me a while to figure out how to build these prompts and agent rules. A LLM programming in a dynamic language without a human supervising the high-level code structure and data model is a recipe for disaster.
torginus
I have one more - LLMs are terrible at counting and arithmetic - if your code gen relies on cutting off the first two words of a constant string - you better check if you need to cut off 12 characters like the LLM says. If it adds 2 numbers, it might be suspect. If you need it to decode a byte sequence, where getting the numbers from the exact right position is necessary.. you get the idea.
Took me a day to debug my LLM-generated code - and of course, like all fruitless and long debugging sessions, this one started with me assuming that it can't possibly get this wrong - yet it did.
datadrivenangel
Almost all of these are good things to consider with human coders as well. Product managers take note!
https://ezyang.github.io/ai-blindspots/requirements-not-solu...
teraflop
> I had some test cases with hard coded numbers that had wobbled and needed updating. I simply asked the LLM to keep rerunning the test and updating the numbers as necessary.
Why not take this a step farther and incorporate this methodology directly into your test suite? Every time you push a code change, run the new version of the code and use it to automatically update the "expected" output. That way you never have to worry about failures at all!
ezyang
In fact, the test framework I was using at the time (jest) did in fact support this. But the person who had originally written the tests hadn't had the foresight to use snapshot tests for this failing test!
diggan
I don't know if your message is a continuation of the sarcasm (I feel like maybe no?), but I'm pretty sure parent's joke is that if you just change the expected values whenever the code changes, you aren't really effectively "testing" anything as much as "recording" outputs.
akomtu
LLMs aren't AI. They are more like librarians with eidetic memory: they can discuss in depth any book in the library, but sooner or later you notice that they don't really understand what they are talking about.
One easy test for AI-ness is the optimization problem. Give it a relatively small, but complex program, e.g. a GPU shader on shadertoy.com, and tell it to optimize it. The output is clearly defined: it's an image or an animation. It's also easy to test how much it's improved the framerate. What's good is this task won't allow the typical LLM bullshitting: if it doesn't compile or doesn't draw a correct image, you'll see it.
The thing is, the current generation of LLMs will blunder at this task.
ezyang
The thing is that, as many junior engineers can attest, randomly blundering around can still give you something useful! So you definitely can get value out of AI coding with the current generation of models.
xigency
I can't wait to see the future of randomly blundered tech as we continue to sideline educated, curious, and discerning human engineers from any salaried opportunity to apply their skills.
I've been working as a computer programmer professionally since I was 14 years old and in the two decades since I've been able to get paid work about ~50% of the time.
Pretty gnarly field to be in I must say. I rather wish I had studied to be a dentist. Then I might have some savings and clout to my name and would know I am helping to spread more smiles.
And for the cult of matrix math if >50% of people are dissatisfied with the state of the something, don't be surprised if a highly intelligent and powerful entity becoming aware of this fact engages in rapid upheaval.
shihab
Today I came across an interesting case where 3 well-known LLMs (O1, sonnet 3.7 and Deepseek R1) found a "bug" that actually didn't exist.
Very briefly, in a fused cuda kernel, I was using thread i to do some stuff on locations i, i+N, i+2*N of an array. Later in the same kernel, same thread operated on i,i+1,i+2. All LLMs flagged the second part as bug. Not the most optimized code maybe, but definitely not a bug.
It wasn't a complicated kernel (~120 SLOC) either, and the distance between the two code blocks was about only 15 LOC.
This highlights a thing I've seen with LLM's generally: they make different mistakes than humans. This makes catching the errors much more difficult.
What I mean by this is that we have thousands of years of experience catching human mistakes. As such, we're really good at designing systems that catch (or work around) human mistakes and biases.
LLM's, while impressive and sometimes less mistake-prone than humans, make errors in a fundamentally different manner. We just don't have the intuition and understanding of the way that LLM's "think" (in a broad sense of the word). As such, we have a hard time designing systems that account for this and catch the errors.