Skip to content(if available)orjump to list(if available)

Hallucinations in code are the least dangerous form of LLM mistakes

Terr_

[Recycled from an older dupe submission]

As much as I've agreed with the author's other posts/takes, I find myself resisting this one:

> I'll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”

> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people.

No, that does not follow.

1. Reviewing depends on what you know about the expertise (and trust) of the person writing it. Spending most of your day reviewing code written by familiar human co-workers is very different from the same time reviewing anonymous contributions.

2. Reviews are not just about the code's potential mechanics, but inferring and comparing the intent and approach of the writer. For LLMs, that ranges between non-existent and schizoid, and writing it yourself skips that cost.

3. Motivation is important, for some developers that means learning, understanding and creating. Not wanting to do code reviews all day doesn't mean you're bad at them. Also, reviewing an LLM's code has no social aspect.

However you do it, somebody else should still be reviewing the change afterwards.

notepad0x90

My fear is that LLM generated code will look great to me, I won't understand it fully but it will work. But since I didn't author it, I wouldn't be great at finding bugs in it or logical flaws. Especially if you consider coding as piecing together things instead of implementing a well designed plan. Lots of pieces making up the whole picture but a lot of those pieces are now put there by an algorithm making educated guesses.

Perhaps I'm just not that great of a coder, but I do have lots of code where if someone took a look it, it might look crazy but it really is the best solution I could find. I'm concerned LLMs won't do that, they won't take risks a human would or understand the implications of a block of code beyond its application in that specific context.

Other times, I feel like I'm pretty good at figuring out things and struggling in a time-efficient manner before arriving at a solution. LLM generated code is neat but I still have to spend similar amounts of time, except now I'm doing more QA and clean up work instead of debugging and figuring out new solutions, which isn't fun at all.

noisy_boy

I do these things for this:

- keep the outline in my head: I don't give up the architect's seat. I decide which module does what and how it fits in the whole system, it's contract with other modules etc.

- review the code: this can be construed as negating the point of LLMs as this is time consuming but I think it is important to go through line by line and understand every line. You will absorb some of the LLM generated code in the process which will form an imperfect map in your head. That's essential for beginning troubleshooting next time things go wrong.

- last mile connectivity: several times the LLM takes you there but can't complete the last mile connectivity; instead of wasting time chasing it, do the final wiring yourself. This is a great shortcut to achieve the previous point.

zahlman

The way you've written this comes across like the AI is influencing your writing style....

plxxyzs

Three bullet points, each with three sentences (ok last one has a semicolon instead) is a dead giveaway

happymellon

> This is a great shortcut to achieve the previous point.

How does doing the hard part provide a shortcut for reviewing all the LLVM code?

If anything it's a long cut, because now you have to understand the code and write it yourself. This isn't great, it's terrible.

intended

I think this is a great line: > My fear is that LLM generated code will look great to me, I won't understand it fully but it will work

This is a degree of humility that made the scenario we are in much clearer.

Our information environment got polluted by the lack of such humility. Rhetoric that sounded ‘right’ is used everywhere. If it looks like an Oxford Don, sounds like an Oxford Don, then it must be an academic. Thus it is believable, even if they are saying the Titanic isn’t sinking.

Verification is the heart of everything humanity does, our governance structures, our judicial systems, economic systems, academia, news, media - everything.

It’s a massive computation effort to figure out what the best ways to allocate resources given current information, allowing humans to create surplus and survive.

This is why we dislike monopolies, or manipulations of these markets - they create bad goods, and screw up our ability to verify what is real.

sunami-ai

Worst part is that the patterns of implementation won't be consistent across the pieces. So debug a whole codebase that was authored with LLM generated code is like having to debug a codebase where ever function was written by a different developer and no one followed any standards. I guess you can specify the coding standards in the prompt and ask it to use FP-style programming only, but I'm not sure how well it can follow.

QuiDortDine

Not well, at least for ChatGPT. It can't follow my custom instructions which can be summed up as "follow PEP-8 and don't leave trailing whitespace".

eru

> But since I didn't author it, I wouldn't be great at finding bugs in it or logical flaws.

Alas, I don't share your optimism about code I wrote myself. In fact, it's often harder to find flaws in my own code, then when reading someone else's code.

Especially if 'this is too complicated for me to review, please simplify' is allowed as a valid outcome of my review.

ajmurmann

To fight this I mostly do ping-pong pairing with llms. After e discuss the general goal and approach I usually write the first test. The llm the makes it pass and writes the next test which I'll make pass and so on. It forces me to stay 100% in the loop and understand everything. Maybe it's not as fast as having the llm write as much as possible but I think it's a worthwhile tradeoff.

hakaneskici

When it comes to relying on code that you didn't write yourself, like an npm package, do you care if it's AI code or human code? Do you think your trust toward AI code may change over time?

sfink

Of course I care. Human-written code was written for a purpose, with a set of constraints in mind, and other related code will have been written for the same or a complementary purpose and set of constraints. There is intention in the code. It is predictable in a certain way, and divergences from the expected are either because I don't fully understand something about the context or requirements, or because there's a damn good reason. It is worthwhile to dig further until I do understand, since it will very probably have repercussions elsewhere and elsewhen.

For AI code, that's a waste of time. The generated code will be based on an arbitrary patchwork of purposes and constraints, glued together well enough to function. I'm not saying it lacks purpose or constraints, it's just that those are inherited from random sources. The parts flow together with robotic but not human concern for consistency. It may incorporate brilliant solutions, but trying to infer intent or style or design philosophy is about as useful as doing handwriting analysis on a ransom note made from pasted-together newspaper clippings.

Both sorts of code have value. AI code may be well-commented. It may use features effectively that a human might have missed. Just don't try to anthropomorphize an AI coder or a lawnmower, you'll end up inventing an intent that doesn't exist.

gunian

what if you

- generate - lint - format - fuzz - test - update

infintely?

PessimalDecimal

Publicly available code with lots of prior usage seems less likely to be buggy than LLM-generated code produced on-demand and for use only by me.

fuzztester

>My fear is that LLM generated code will look great to me, I won't understand it fully but it will work.

puzzled. if you don't understand it fully, how can you say that it will look great to you, and that it will work?

raincole

It happens all the time. Way before LLM. There were countless times I implemented an algorithm from a paper or a book while not fully understanding it (in other words, I can't prove the correctness or time complexity without referencing the original paper).

Nevermark

> if you don't understand it fully, how can you say that it will look great to you, and that it will work?

Presumably, that simply reflects that a primary developer always has an advantage of having a more reliable understanding of a large code base - and the insights into the problem that come about during development challenges - than a reviewer of such code.

A lot of important bug subtle insights, many sub-verbal, into a problem come from going through the large and small challenges of creating something that solves it. Reviewers just don't get those insights as reliably.

Reviewers can't see all the subtle or non-obvious alternate paths or choices. They are less likely to independently identify subtle traps.

unclebucknasty

All of this. Could have saved me a comment [0] if I'd seen this earlier.

When people talk about 30% or 50% coding productivity gains with LLMs, I really want to know exactly what they're measuring.

[0] https://news.ycombinator.com/item?id=43236792

layer8

> Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing. No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!

I would have stated this a bit differently: No amount of running or testing can prove the code correct. You actually have to reason through it. Running/testing is merely a sanity/spot check of your reasoning.

johnrob

I’m not sure it’s possible to have the full reasoning in your head without authoring the code yourself - or, spending a comparable amount of effort to mentally rewrite it.

skydhash

Which is why everyone is so keen on standards (Convention, formatting, architecture,...), because it is less a burden when you're just comparing expected to actual, than learning unknowns.

layer8

I tend to agree, which is why I’m skeptical about large-scale LLM code generation, until AIs exhibit reliable diligence and more general attention and awareness, and probably also long-term memory about a code base and its application domain.

Snuggly73

Agree - case in point - dealing with race conditions. You have to reason thru the code.

nnnnico

not sure that human reasoning actually beats testing when checking for correctness

ljm

The production of such tests presumably requires an element of human reasoning.

The requirements have to come from somewhere, after all.

layer8

Both are necessary, they complement each other.

fragmede

Human reason is fine, the problem is that human attention spans aren't great at checking for correctness. I want every corner case regression tested automatically because there's always going to be some weird configuration that a human's going to forget to regression test.

dmos62

Well, what if you run a complete test suite?

layer8

There is no complete test suite, unless your code is purely functional and has a small-ish finite input domain.

suzzer99

And even then, your code could pass all tests but be a spaghetti mess that will be impossible to maintain and add features to.

MattSayar

Seems to be a bit of a catch 22. No LLM can write perfect code, and no test suite can catch all bugs. Obviously, no human can write perfect code either.

If LLM-generated code has been "reasoned-through," tested, and it does the job, I think that's a net-benefit compared to human-only generated code.

shakna

If the complete test suite were enough, then SQLite, who famously has one of the largest and most comprehensive, would not encounter bugs. However, they still do.

If you employ AI, you're adding a remarkable amount of speed, to a processing domain that is undecidable because most inputs are not finite. Eventually, you will end up reconsidering the Gambler's Fallacy, because of the chances of things going wrong.

null

[deleted]

e12e

You mean, for example test that your sieve finds all primes, and only primes that fit in 4096 bits?

bandrami

Paging Dr. Turing. Dr. Turing, please report to the HN comment section.

atomic128

Last week, The Primeagen and Casey Muratori carefully review the output of a state-of-the-art LLM code generator.

They provide a task well-represented in the LLM's training data, so development should be easy. The task is presented as a cumulative series of modifications to a codebase:

https://www.youtube.com/watch?v=NW6PhVdq9R8

This is the actual reality of LLM code generators in practice: iterative development converging on useless code, with the LLM increasingly unable to make progress.

mercer

In my own experience, I have all sorts of ways that I try to 'drag' the llm out of some line of 'thinking' by editing the conversation as a whole, or just restarting the whole prompt, and I've been kind of just doing this over time since GPT3.

While I still think all this code generation is super cool, I've found that the 'density' of the code makes it even more noticeable - and often annoying - to see the model latch on, say, some part of the conversation that should essentially be pruned from the whole thinking process, or pursue some part of earlier code that makes no sense to me, and then 'coaxing' it again.

t_mann

Hallucinations themselves are not even the greatest risk posed by LLMs. A much greater risk (in simple terms of probability times severity) I'd say is that chat bots can talk humans into harming themselves or others. Both of which have already happened, btw [0,1]. Still not sure if I'd call that the greatest overall risk, but my ideas for what could be even more dangerous I don't even want to share here.

[0] https://www.qut.edu.au/news/realfocus/deaths-linked-to-chatb...

[1] https://www.theguardian.com/uk-news/2023/jul/06/ai-chatbot-e...

tombert

I don't know if the model changed in the last six months, or maybe the wow factor has worn off a bit, but it also feels like ChatGPT has become a lot more "people-pleasy" than it was before.

I'll ask it opinionated questions, and it will just do stuff to reaffirm what I said, even when I give contrary opinions in the same chat.

I personally find it annoying (I don't really get along with human people pleasers either), but I could see someone using it as a tool to justify doing bad stuff, including self-harm; it doesn't really ever push back on what I say.

unclebucknasty

Yeah, I think it's coded to be super-conciliatory as some sort of apology for its hallucinations, but I find it annoying as well. Part of it is just like all automated prompts that try to be too human. When you know it's not human, it's almost patronizing and just annoying.

But, it's actually worse, because it's generally apologizing for something completely wrong that it told you just moments before with extreme confidence.

renewiltord

It's obvious, isn't it? The average Hacker News user, who has converged to the average Internet user, wants exactly that experience. LLMs are pretty good tools but perhaps they shouldn't be made available to others. People like me can use them but others seem to be killed when making contact. I think it's fine to restrict access to the elite. We don't let just anyone fly a fighter jet. Perhaps the average HN user should be protected from LLM interactions.

tombert

Is that really what you got from what I wrote? I wasn't suggesting that we restrict access to anyone, and I wasn't trying to imply that I'm somehow immune to the problems that were highlighted.

I mentioned that I don't like people-pleasers and I find it a bit obnoxious when ChatGPT does it. I'm sure that there might be other bits of subtle encouragement it gives me that I don't notice, but I can't elaborate on those parts because, you know, I didn't notice them.

I genuinely do not know how you got "we should restrict access" from my comment or the parent, you just extrapolated to make a pretty stupid joke.

hexaga

More generally - AI that is good at convincing people is very powerful, and powerful things are dangerous.

I'm increasingly coming around to the notion that AI tooling should have safety features concerned with not directly exposing humans to asymptotically increasing levels of 'convincingness' in generated output. Something like a weaker model used as a buffer.

Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.

Like most safety regulations, it'll take blood for the inking. Exposing mass numbers of people to these models strikes me as wildly negligent if we expect continued improvement along this axis.

southernplaces7

>Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.

Seriously? Do you suppose that it will pull this trick off through some sort of hypnotizing magic perhaps? I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.

The kinds of people who would be convinced by such "dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings anyhow.

Aside from demonstrating the persistent AI woo that permeats many comments on this site, the logic above reminds me of the harping nonsense around the supposed dangers of video games or certain violent movies "making kinds do bad things", in years past. The prohibitionist nanny tendencies behind such fears are more dangerous than any silly chatbot AI..

kjs3

Yeah...this. I'm not so concerned that AI is going to put me out of a job or become Skynet. I'm concerned that people are offloading decision making and critical thinking to the AI, accepting it's response at face value and responding to concerns with "the AI said so...must be right". Companies are already maliciously exploiting this (e.g. the AI has denied your medical claim, and we can't tell you how it decided that because our models are trade secrets), but it will soon become de rigueur and people will think you're weird for questioning the wisdom of the AI.

Nevermark

The combination of blind faith in AI, and good faith highly informed understanding and agreement, achieved with help of AI, covers the full spectrum of the problem.

southernplaces7

In both of your linked examples, the people in question very likely had at least some sort of mental instability working in their minds.

I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.

The kinds of people who would be convinced by such "harm dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings, or by books, or movies or any other sort of excuse for a mind that had problems well before seeing X or Y.

By the logic of regulating AI for these supposed dangers, you could argue that literature, movie content, comic books, YouTube videos and that much loved boogeyman in previous years of violent video games should all be banned or regulated for the content they express.

Such notions have a strongly nannyish, prohibitionist streak that's much more dangerous than some algorithm and the bullshit it spews to a few suggestible individuals.

The media of course loves such narratives, because their breathless hysteria and contrived fear-mongering plays right into more eyeballs. Seeing people again take seriously such nonsense after idiocies like the media frenzy around video games in the early 2000s and prior to that, similar media fits about violent movies and even literature, is sort of sad.

We don't need our tools for expression, and sources of information "regulated for harm" because a small minority of others can't get an easy grip on their psychological state.

zahlman

Is this somehow worse than humans talking each other into it?

objectified

> The moment you run LLM generated code, any hallucinated methods will be instantly obvious: you’ll get an error. You can fix that yourself or you can feed the error back into the LLM and watch it correct itself.

But that's for methods. For libraries, the scenario is different, and possibly a lot more dangerous. For example, the LLM generates code that imports a library that does not exist. An attacker notices this too while running tests against the LLM. The attacker decides to create these libraries on the public package registry and injects malware. A developer may think: "oh, this newly generated code relies on an external library, I will just install it," and gets owned, possibly without even knowing for a long time (as is the case with many supply chain attacks).

And no, I'm not looking for a way to dismiss the technology, I use LLMs all the time myself. But what I do think is that we might need something like a layer in between the code generation and the user that will catch things like this (or something like Copilot might integrate safety measures against this sort of thing).

namaria

Prompt injection means that unless people using LLMs to generate code are willing to hunt down and inspect all dependencies, it will become extremely easy to spread malware.

verbify

An anecdote: I was working for a medical centre, and had some code that was supposed to find the 'main' clinic of a patient.

The specification was to only look at clinical appointments, and find the most recent appointment. However if the patient didn't have a clinical appointment, it was supposed to find the most recent appointment of any sort.

I wrote the code by sorting the data (first by clinical-non-clinical and then by date). I asked chatgpt to document it. It misunderstood the code and got the sorting backwards.

I was pretty surprised, and after testing with foo-bar examples eventually realised that I had called the clinical-non-clinical column "Clinical", which confused the LLM.

This is the kind of mistake that is a lot worse than "code doesn't run" - being seemingly right but wrong is much worse than being obviously wrong.

zahlman

To be clear, by "clinical-non-clinical", you mean a boolean flag for whether the appointment is clinical?

bigstrat2003

> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.

This seems like a very flawed assumption to me. My take is that people look at hallucinations and say "wow, if it can't even get the easiest things consistently right, no way am I going to trust it with harder things".

JusticeJuice

You'd be surprised. I know a few people who couldn't really code before LLMs, but now with LLMs they can just brute-force through problems. They seem pretty undetered about 'trusting' the solution, if they ran it and it worked for them, it gets shipped.

tcoff91

Well I hope this isn’t backend code because the amount of vulnerabilities that are going to come from these practices will be staggering

null

[deleted]

nojs

> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.

If you’re writing code in Python against well documented APIs, sure. But it’s an issue for less popular languages and frameworks, when you can’t immediately tell if the missing method is your fault due to a missing dependency, version issue, etc.

zahlman

IMX, quite a few Python users - including ones who think they know what they're doing - run into that same confusion, because they haven't properly understood fundamentals e.g. about how virtual environments work, or how to read documentation effectively. Or sometimes just because they've been careless and don't have good habits for ensuring (or troubleshooting) the environment.

not2b

If the hallucinated code doesn't compile (or in an interpreted language, immediately throws exceptions), then yes, that isn't risky because that code won't be used. I'm more concerned about code that appears to work for some test cases but solves the wrong problem or inadequately solves the problem, and whether we have anyone on the team who can maintain that code long-term or document it well enough so others can.

wavemode

I once submitted some code for review, in which the AI had inserted a recursive call to the same function being defined. The recursive call was completely unnecessary and completely nonsensical, but also not wrong per se - it just caused the function to repeat what it was doing. The code typechecked, the tests passed, and the line of code was easy to miss while doing a cursory read through the logic. I missed it, the code reviewer missed it, and eventually it merged to production.

Unfortunately there was one particular edge case which caused that recursive call to become an infinite loop, and I was extremely embarrassed seeing that "stack overflow" server error alert come through Slack afterward.

t14n

fwiw this problem already exists with my more junior co-workers. and also my own code that I write when exhausted!

if you have trusted processes for review and aren't always rushing out changes without triple checking your work (plus a review from another set of eyes), then I think you catch a lot of the subtler bugs that are emitted from an LLM.

01100011

Timely article. I really, really want AI to be better at writing code, and hundreds of reports suggest it works great if you're a web dev or a python dev. Great! But I'm a C/C++ systems guy(working at a company making money off AI!) and the times I've tried to get AI to write the simplest of test applications against a popular API it mostly failed. The code was incorrect, both using the API incorrectly and writing invalid C++. Attempts to reason with the LLMs(grokv3, deepseek-r1) led further and further away from valid code. Eventually both systems stopped responding.

I've also tried Cursor with similar mixed results.

But I'll say that we are getting tremendous pressure at work to use AI to write code. I've discussed it with fellow engineers and we're of the opinion that the managerial desire is so great that we are better off keeping our heads down and reporting success vs saying the emperor wears no clothes.

It really feels like the billionaire class has fully drunk the kool-aid and needs AI to live up to the hype.

tombert

I use ChatGPT to generate code a lot, and it's certainly useful, but it has given me issues that are not obvious.

For example, I had it generate some C code to be used with ZeroMQ a few months ago. The code looked absolutely fine, and it mostly worked fine, but it made a mistake with its memory allocation stuff that caused it to segfault sometimes, and corrupt memory other times.

Fortunately, this was such a small project and I already know how to write code, so it wasn't too hard for me to find and fix, though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.

zahlman

>though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.

They used to do the same with Stack Overflow. But now it's more dangerous, because the code can be "subtly wrong in ways the user can't fathom" to order.

KoolKat23

And subtle bugs existed pre-2022, how often my apps are updated for "minor bug fixes" would mean this is par for the course.

tombert

Sure, it's possible that the code it gave me was based on some incorrectly written code it scraped from Gitlab or something.

I'm not a luddite, I'm perfectly fine with people using AI for writing code. The only thing that really concerns me is that it has the potential to generate a ton of shitty code that doesn't look shitty, creating a lot of surface area for debugging.

Prior to AI, the quantity of crappy code that could be generated was basically limited by the speed in which a human could write it, but now there's really no limit.

Again, just to reiterate, this isn't "old man yells at cloud". I think AI is pretty cool, I use it all the time, I don't even have a problem with people generating large quantities of code, it's just something we have to be a bit more weary of.