Hallucinations in code are the least dangerous form of LLM mistakes

324 comments

·March 2, 2025

Terr_

[Recycled from an older dupe submission]

As much as I've agreed with the author's other posts/takes, I find myself resisting this one:

> I'll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”

> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people.

No, that does not follow.

1. Reviewing depends on what you know about the expertise (and trust) of the person writing it. Spending most of your day reviewing code written by familiar human co-workers is very different from the same time reviewing anonymous contributions.

2. Reviews are not just about the code's potential mechanics, but inferring and comparing the intent and approach of the writer. For LLMs, that ranges between non-existent and schizoid, and writing it yourself skips that cost.

3. Motivation is important, for some developers that means learning, understanding and creating. Not wanting to do code reviews all day doesn't mean you're bad at them. Also, reviewing an LLM's code has no social aspect.

However you do it, somebody else should still be reviewing the change afterwards.

Eridrus

Yeah, I strongly disagree with this too.

I've spent a lot of time reviewing code and doing code audits for security (far more than the average engineer) and reading code still takes longer than writing it, particularly when it is dense and you cannot actually trust the comments and variable names to be true.

AI is completely untrustable in that sense. The English and code have no particular reason to align so you really need to read the code itself.

These models may also use unfamiliar idioms where you don't know the edge cases where you either have to fight the model to do it a different way, or go investigate the idiom and think through the edge cases if you really want to understand it.

I think most people just don't read the code these models produce at all and just click accept and then just see if tests pass or just look at the output manually.

I am still trying to give it a go, and sometimes it really does make things easier on simpler tasks and I am blown away, and it has been getting better, but I feel like I need to set myself a hard timeout with these tools where if they haven't done basically what I wanted quickly, I should just start from scratch since the task is beyond them and I'll spend more time on the back and forth.

They are useful for giving me the motivation to do things that I'm avoiding because they're too boring though because after fighting with them for 20 minutes I'm ready to go write the code.

lsy

I'm also put off by the author's condescension towards people who aren't convinced after using the technology. It's not the user's job to find a product useful, it's a product's job to be useful for the user. If a programmer puts a high value on being able to trust a program's output to be minimally conformant to libraries and syntax that are literally available to the program, and places a high value on not having to babysit every line of code that you review and write, that's the programmer's prerogative in their profession, not some kind of moral failing.

elcritch

> 2. Reviews are not just about the code's potential mechanics, but inferring and comparing the intent and approach of the writer. For LLMs, that ranges between non-existent and schizoid, and writing it yourself skips that cost.

With humans you can be reasonably sure they've followed through with a mostly consistent level of care and thouhht. LLMs will just outright lie to make their jobs easier in one section while in another area generate high quality code.

I've had to do a 'git reset --hard' after trying out the Claude code and spending $20 bucks. It always seems great at first, but it just becomes non-sense on larger changes. Maybe chain of thought models do better though.

aaronbaugher

It's like cutting and pasting from Stack Overflow, if SO didn't have a voting system to give you some hope that the top answer at least works and wasn't hallucinated by someone who didn't understand the question.

I asked Gemini for the lyrics of a song that I knew was on all the main lyrics sites. It gave me the lyrics to a different song with the same title. On the second try, it hallucinated a batch of lyrics. Third time, I gave it a link to the correct lyrics, and it "lied" and said it had consulted that page to get it right but gave me another wrong set.

It did manage to find me a decent recipe for chicken salad, but I certainly didn't make it without checking to make sure the ingredients and ratios looked reasonable. I wouldn't use code from one of these things without closely inspecting every line, which makes it a pointless exercise.

simonw

I'm pretty sure Gemini (and likely other models too) have been deliberately engineered to avoid outputting exact lyrics, because the LLM labs know that the music industry is extremely litigious.

I'm surprised it didn't outright reject your request to be honest.

boesboes

I did the exact same today! It started out reasonable, but as you iterate on the commits/PR it become complete crap. And expensive too for very little value.

Terr_

> With humans you can be reasonably sure they've followed through with a mostly consistent level of care and thouhht.

And even if they fail, other humans are more likely to fail in ways we are familiar-with and can internally model and anticipate ourselves.

saghm

The crux of this seems to be that "reviewing code written by other people" isn't the same as "reviewing code written by LLMs". The "human" element of human-written code allows you to utilize social knowledge as well as technical, and that can even be built up over time when reviewing the same persons' code. Maybe there's some equivalent of this that people can develop when dealing with LLM code, but I don't think many people have it now (if it even does exist), and I don't even know what it would look like.

mcpar-land

the part of their claim that does the heavy lifting is "code written by other people" - LLM-produced code does not fall into that category. LLM code is not written by anyone. There was no model in a brain I can empathize with and think about why they might have done this decision or that decision, or a person I can potentially contact and do code review with.

theshrike79

You can see the patterns a.k.a. "code smells"[0] in code 20x faster than you can write code yourself.

I can browse through any Java/C#/Go code and without actually reading every keyword see how it flows and if there's something "off" about how it's structured. And if I smell something I can dig down further and see what's cooking.

If your chosen language is difficult/slow to read, then it's on you.

And stuff should have unit tests with decent coverage anyway, those should be even easier for a human to check, even if the LLM wrote them too.

[0] https://en.wikipedia.org/wiki/Code_smell

skywhopper

Wow, what a wildly simplistic view you have of programming. “Code smells” (god, I hate that term) are not the only thing that can be wrong. Unit tests only cover what they cover. Reviewing the code is only one piece of the overall cost here.

theshrike79

-4 points and one reply, what is this, Reddit? The downvote button isn't for "I disagree".

throwuxiytayq

You’re catching some downvotes, but I agree with your perspective. I’m feeling very productive with LLMs and C# specifically. There’s definitely some LLM outputs that I don’t even bother checking, but very often the code is visibly correct and ready for use. Ensuring that the LLM output conforms to your preferred style (e.g. functional-like with static functions) helps a lot. I usually do a quick formatting/refactoring pass with the double purpose of also understanding and checking the code. In case there’s doubts about correctness (usually in just one or two spots), they can be cleared up very quickly. I’m sure this workflow isn’t a great fit for every language, program type and skill level (there’s experts out there that make me embarrassed!), but reading some people I feel like a lot of my peers are missing out.

notepad0x90

My fear is that LLM generated code will look great to me, I won't understand it fully but it will work. But since I didn't author it, I wouldn't be great at finding bugs in it or logical flaws. Especially if you consider coding as piecing together things instead of implementing a well designed plan. Lots of pieces making up the whole picture but a lot of those pieces are now put there by an algorithm making educated guesses.

Perhaps I'm just not that great of a coder, but I do have lots of code where if someone took a look it, it might look crazy but it really is the best solution I could find. I'm concerned LLMs won't do that, they won't take risks a human would or understand the implications of a block of code beyond its application in that specific context.

Other times, I feel like I'm pretty good at figuring out things and struggling in a time-efficient manner before arriving at a solution. LLM generated code is neat but I still have to spend similar amounts of time, except now I'm doing more QA and clean up work instead of debugging and figuring out new solutions, which isn't fun at all.

noisy_boy

I do these things for this:

- keep the outline in my head: I don't give up the architect's seat. I decide which module does what and how it fits in the whole system, it's contract with other modules etc.

- review the code: this can be construed as negating the point of LLMs as this is time consuming but I think it is important to go through line by line and understand every line. You will absorb some of the LLM generated code in the process which will form an imperfect map in your head. That's essential for beginning troubleshooting next time things go wrong.

- last mile connectivity: several times the LLM takes you there but can't complete the last mile connectivity; instead of wasting time chasing it, do the final wiring yourself. This is a great shortcut to achieve the previous point.

FiberBundle

In my experience you just don't keep as good a map of the codebase in your head when you have LLMs write a large part of your codebase as when you write everything yourself. Having a really good map of the codebase in your head is what brings you large productivity boosts when maintaining the code. So while LLMs do give me a 20-30% productivity boost for the initial implementation, they bring huge disadvantages after that, and that's why I still mostly write code myself and use LLMs only as a stackoverflow alternative.

simonw

I have enough projects that I'm responsible for now (over 200 packages on PyPI, over 800 GitHub repositories) that I gave up on keeping a map of my codebases in my head a long time ago - occasionally I'll stumble across projects I released that I don't even remember existing!

My solution for this is documentation, automated tests and sticking to the same conventions and libraries (like using Click for command line argument parsing) across as many projects as possible. It's essential that I can revisit a project and ramp up my mental model of how it works as quickly as possible.

I talked a bit more about this approach here: https://simonwillison.net/2022/Nov/26/productivity/

MrMcCall

The evolution of a codebase is an essential missing piece of our development processes. Barring detailed design docs that no one has time to write and then update, understanding that evolution is the key to understanding the design intent (the "why") of the codebase. Without that why, there will be no consistency, and less chance of success.

"Short cuts make long delays." --Tolkien

happymellon

> This is a great shortcut to achieve the previous point.

How does doing the hard part provide a shortcut for reviewing all the LLVM code?

If anything it's a long cut, because now you have to understand the code and write it yourself. This isn't great, it's terrible.

noisy_boy

Sure whatever works for you; my approach works for me

zahlman

The way you've written this comes across like the AI is influencing your writing style....

noisy_boy

thatistrue I us ed to write lik this b4 ai it has change my life

plxxyzs

Three bullet points, each with three sentences (ok last one has a semicolon instead) is a dead giveaway

intended

I think this is a great line: > My fear is that LLM generated code will look great to me, I won't understand it fully but it will work

This is a degree of humility that made the scenario we are in much clearer.

Our information environment got polluted by the lack of such humility. Rhetoric that sounded ‘right’ is used everywhere. If it looks like an Oxford Don, sounds like an Oxford Don, then it must be an academic. Thus it is believable, even if they are saying the Titanic isn’t sinking.

Verification is the heart of everything humanity does, our governance structures, our judicial systems, economic systems, academia, news, media - everything.

It’s a massive computation effort to figure out what the best ways to allocate resources given current information, allowing humans to create surplus and survive.

This is why we dislike monopolies, or manipulations of these markets - they create bad goods, and screw up our ability to verify what is real.

unclebucknasty

All of this. Could have saved me a comment [0] if I'd seen this earlier.

When people talk about 30% or 50% coding productivity gains with LLMs, I really want to know exactly what they're measuring.

[0] https://news.ycombinator.com/item?id=43236792

fuzztester

>My fear is that LLM generated code will look great to me, I won't understand it fully but it will work.

puzzled. if you don't understand it fully, how can you say that it will look great to you, and that it will work?

raincole

It happens all the time. Way before LLM. There were countless times I implemented an algorithm from a paper or a book while not fully understanding it (in other words, I can't prove the correctness or time complexity without referencing the original paper).

fuzztester

imo, your last phrase, excerpted below:

>(in other words, I can't prove the correctness ... without referencing the original paper).

agrees with what I said in my previous comment:

>if you don't understand it fully, how can you say .... that it will work?

(irrelevant parts from our original comments above, replaced with ... , without loss of meaning to my argument.)

both those quoted fragments, yours and mine, mean basically the same thing, i.e. that both you and the GP don't know whether it will work.

it's not that one cannot use some piece of code without knowing whether it works; everybody does that all the time, from algorithm books for example, as you said.

Nevermark

> if you don't understand it fully, how can you say that it will look great to you, and that it will work?

Presumably, that simply reflects that a primary developer always has an advantage of having a more reliable understanding of a large code base - and the insights into the problem that come about during development challenges - than a reviewer of such code.

A lot of important bug subtle insights, many sub-verbal, into a problem come from going through the large and small challenges of creating something that solves it. Reviewers just don't get those insights as reliably.

Reviewers can't see all the subtle or non-obvious alternate paths or choices. They are less likely to independently identify subtle traps.

rsynnott

I mean, depends what you mean by ‘work’. For instance, something which produces the correct output, and leaks memory, is that working? Something which produces the correct output, but takes a thousand times longer than it should; is that working? Something which produces output which looks superficially correct and passes basic tests, is that working?

‘Works for me’ isn’t actually _that_ useful a signal without serious qualification.

fuzztester

>‘Works for me’ isn’t actually _that_ useful a signal without serious qualification.

yes, and it sounds a bit like "works on my machine", a common cop-out which I am sure many of us have heard of.

google: works on my machine meme

fuzztester

exactly.

what you said just strengthens my argument.

upcoming-sesame

You write a decent amount of tests

fuzztester

Famous quote, read many years ago:

testing can prove the presence of errors, but not their absence.

https://www.google.com/search?q=quote+testing+can+prove+the+...

- said by Steve McConnell (author of Code Complete), Edsger Dijkstra, etc. ...

ajmurmann

To fight this I mostly do ping-pong pairing with llms. After e discuss the general goal and approach I usually write the first test. The llm the makes it pass and writes the next test which I'll make pass and so on. It forces me to stay 100% in the loop and understand everything. Maybe it's not as fast as having the llm write as much as possible but I think it's a worthwhile tradeoff.

hakaneskici

When it comes to relying on code that you didn't write yourself, like an npm package, do you care if it's AI code or human code? Do you think your trust toward AI code may change over time?

sfink

Of course I care. Human-written code was written for a purpose, with a set of constraints in mind, and other related code will have been written for the same or a complementary purpose and set of constraints. There is intention in the code. It is predictable in a certain way, and divergences from the expected are either because I don't fully understand something about the context or requirements, or because there's a damn good reason. It is worthwhile to dig further until I do understand, since it will very probably have repercussions elsewhere and elsewhen.

For AI code, that's a waste of time. The generated code will be based on an arbitrary patchwork of purposes and constraints, glued together well enough to function. I'm not saying it lacks purpose or constraints, it's just that those are inherited from random sources. The parts flow together with robotic but not human concern for consistency. It may incorporate brilliant solutions, but trying to infer intent or style or design philosophy is about as useful as doing handwriting analysis on a ransom note made from pasted-together newspaper clippings.

Both sorts of code have value. AI code may be well-commented. It may use features effectively that a human might have missed. Just don't try to anthropomorphize an AI coder or a lawnmower, you'll end up inventing an intent that doesn't exist.

gunian

what if you

- generate - lint - format - fuzz - test - update

infintely?

whymeogod

[flagged]

PessimalDecimal

Publicly available code with lots of prior usage seems less likely to be buggy than LLM-generated code produced on-demand and for use only by me.

madeofpalk

Do you not review code from your peers? Do you not search online and try to grok code from StackOverflow or documentation examples?

All of these can vary wildly in quality. Maybe its because I mostly use coding LLMs as either a research tool, or to write reasonably small and easy to follow chunks of code, but I find it no different than all of the other types of reading and understanding other people's code I already have to do.

eru

> But since I didn't author it, I wouldn't be great at finding bugs in it or logical flaws.

Alas, I don't share your optimism about code I wrote myself. In fact, it's often harder to find flaws in my own code, then when reading someone else's code.

Especially if 'this is too complicated for me to review, please simplify' is allowed as a valid outcome of my review.

layer8

> Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing. No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!

I would have stated this a bit differently: No amount of running or testing can prove the code correct. You actually have to reason through it. Running/testing is merely a sanity/spot check of your reasoning.

johnrob

I’m not sure it’s possible to have the full reasoning in your head without authoring the code yourself - or, spending a comparable amount of effort to mentally rewrite it.

layer8

I tend to agree, which is why I’m skeptical about large-scale LLM code generation, until AIs exhibit reliable diligence and more general attention and awareness, and probably also long-term memory about a code base and its application domain.

theshrike79

Spoken by someone who hasn't had to maintain Somene Else's Code on a budget.

You can't just rewrite everything to match your style. You take what's in there and adapt to the style, your personal preference doesn't matter.

np-

Someone Else’s Code was understood by at least one human at some point in time before it was committed. That means that another equally skilled human is likely to be able to get the gist of it, if not understand it perfectly.

horsawlarway

It's a giant misdirection to assume the complaint is "style".

Writing is a very solid choice as an approach to understanding a novel problem. There's a quip in academia - "The best way to know if you understand something is to try teaching it to someone else". This happens to hold true for teaching it to the compiler with code you've written.

You can't skip details or gloss over things, and you have to hold "all the parts" of the problem together in your head. It builds a very strong intuitive understanding.

Once you have an intuitive understanding of the problem, it's very easy to drop into several different implementations of the solution (regardless of the style) and reason about them.

On the other hand, if you don't understand the problem, it's nearly impossible to have a good feel for why any given solution does what it does, or where it might be getting things wrong.

---

The problem with using an AI to generate the code for you is that unless you're already familiar with the problem you risk being completely out of your depth "code reviewing" the output.

The difficulty in the review isn't just literally reading the lines of code - it's in understanding the problem well enough to make a judgement call about them.

layer8

They said “mentally rewrite”, not actually rewrite.

skydhash

Which is why everyone is so keen on standards (Convention, formatting, architecture,...), because it is less a burden when you're just comparing expected to actual, than learning unknowns.

tuyiown

> spending a comparable amount of effort to mentally rewrite it.

I'm pretty sure mentally rewrite it requires _more_ effort than writing it in the first place. (maybe less time though)

Snuggly73

Agree - case in point - dealing with race conditions. You have to reason thru the code.

wfn

> case in point - dealing with race conditions.

100%. Case in point for case in point - I was just scratching my head over some Claude-produced lines for me, thinking if I should ask what this kind entity had in mind when using specific compiler builtins (vs. <stdatomic.h>), like, "is there logic to your madness..." :D

  size_t unique_ips = __atomic_load_n(&((ip_database_t*)arg)->unique_ip_count, __ATOMIC_SEQ_CST);

I think it just likes compiler builtins because I mentioned GCC at some point...

nnnnico

not sure that human reasoning actually beats testing when checking for correctness

ljm

The production of such tests presumably requires an element of human reasoning.

The requirements have to come from somewhere, after all.

MrMcCall

I would argue that designing and implementing a working project requires human reasoning, too, but that line of thinking seems to be falling out of fashion in favor of "best next token" guessing engines.

I know what Spock would say about this approach, and I'm with him.

Gupie

"Beware of bugs in the above code; I have only proved it correct, not tried it."

Donald E. Knuth

fragmede

Human reason is fine, the problem is that human attention spans aren't great at checking for correctness. I want every corner case regression tested automatically because there's always going to be some weird configuration that a human's going to forget to regression test.

sarchertech

With any non trivial system you can’t actually test every corner case. You depend on human reason to identify the ones most likely to cause problems.

layer8

Both are necessary, they complement each other.

dmos62

Well, what if you run a complete test suite?

layer8

There is no complete test suite, unless your code is purely functional and has a small-ish finite input domain.

suzzer99

And even then, your code could pass all tests but be a spaghetti mess that will be impossible to maintain and add features to.

MattSayar

Seems to be a bit of a catch 22. No LLM can write perfect code, and no test suite can catch all bugs. Obviously, no human can write perfect code either.

If LLM-generated code has been "reasoned-through," tested, and it does the job, I think that's a net-benefit compared to human-only generated code.

shakna

If the complete test suite were enough, then SQLite, who famously has one of the largest and most comprehensive, would not encounter bugs. However, they still do.

If you employ AI, you're adding a remarkable amount of speed, to a processing domain that is undecidable because most inputs are not finite. Eventually, you will end up reconsidering the Gambler's Fallacy, because of the chances of things going wrong.

e12e

You mean, for example test that your sieve finds all primes, and only primes that fit in 4096 bits?

null

[deleted]

bandrami

Paging Dr. Turing. Dr. Turing, please report to the HN comment section.

dmos62

Gave me a chuckle!

atomic128

Last week, The Primeagen and Casey Muratori carefully review the output of a state-of-the-art LLM code generator.

They provide a task well-represented in the LLM's training data, so development should be easy. The task is presented as a cumulative series of modifications to a codebase:

https://www.youtube.com/watch?v=NW6PhVdq9R8

This is the actual reality of LLM code generators in practice: iterative development converging on useless code, with the LLM increasingly unable to make progress.

mercer

In my own experience, I have all sorts of ways that I try to 'drag' the llm out of some line of 'thinking' by editing the conversation as a whole, or just restarting the whole prompt, and I've been kind of just doing this over time since GPT3.

While I still think all this code generation is super cool, I've found that the 'density' of the code makes it even more noticeable - and often annoying - to see the model latch on, say, some part of the conversation that should essentially be pruned from the whole thinking process, or pursue some part of earlier code that makes no sense to me, and then 'coaxing' it again.

bigstrat2003

> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.

This seems like a very flawed assumption to me. My take is that people look at hallucinations and say "wow, if it can't even get the easiest things consistently right, no way am I going to trust it with harder things".

JusticeJuice

You'd be surprised. I know a few people who couldn't really code before LLMs, but now with LLMs they can just brute-force through problems. They seem pretty undetered about 'trusting' the solution, if they ran it and it worked for them, it gets shipped.

tcoff91

Well I hope this isn’t backend code because the amount of vulnerabilities that are going to come from these practices will be staggering

namaria

The backlash will be enormous. In the near future, there will be less competent coders and a tsunami of bad code to fix. If 2020 was annoying to hiring managers they have no idea how bad it will become.

null

[deleted]

t_mann

Hallucinations themselves are not even the greatest risk posed by LLMs. A much greater risk (in simple terms of probability times severity) I'd say is that chat bots can talk humans into harming themselves or others. Both of which have already happened, btw [0,1]. Still not sure if I'd call that the greatest overall risk, but my ideas for what could be even more dangerous I don't even want to share here.

[0] https://www.qut.edu.au/news/realfocus/deaths-linked-to-chatb...

[1] https://www.theguardian.com/uk-news/2023/jul/06/ai-chatbot-e...

tombert

I don't know if the model changed in the last six months, or maybe the wow factor has worn off a bit, but it also feels like ChatGPT has become a lot more "people-pleasy" than it was before.

I'll ask it opinionated questions, and it will just do stuff to reaffirm what I said, even when I give contrary opinions in the same chat.

I personally find it annoying (I don't really get along with human people pleasers either), but I could see someone using it as a tool to justify doing bad stuff, including self-harm; it doesn't really ever push back on what I say.

taneq

I haven't played with it too much, and maybe it's changed recently or the paid version is different, but last week I found it irritatingly obtuse.

> Me: Hi! Could you please help me find the problem with some code?

> ChatGPT: Of course! Show me the code and I'll take a look!

> Me: [bunch o' code]

> ChatGPT: OK, it looks like you're trying to [do thing]. What did you want help with?

> Me: I'm trying to find a problem with this code.

> ChatGPT: Sure, just show me the code and I'll try to help!

> Me: I just pasted it.

> ChatGPT: I can't see it.

MrMcCall

Maybe they taught it that a) it doesn't work, and b) not to tell anyone.

Lying will goad a person into trying again; the brutally honest truth will stop them like a brick wall.

unclebucknasty

Yeah, I think it's coded to be super-conciliatory as some sort of apology for its hallucinations, but I find it annoying as well. Part of it is just like all automated prompts that try to be too human. When you know it's not human, it's almost patronizing and just annoying.

But, it's actually worse, because it's generally apologizing for something completely wrong that it told you just moments before with extreme confidence.

renewiltord

It's obvious, isn't it? The average Hacker News user, who has converged to the average Internet user, wants exactly that experience. LLMs are pretty good tools but perhaps they shouldn't be made available to others. People like me can use them but others seem to be killed when making contact. I think it's fine to restrict access to the elite. We don't let just anyone fly a fighter jet. Perhaps the average HN user should be protected from LLM interactions.

tombert

Is that really what you got from what I wrote? I wasn't suggesting that we restrict access to anyone, and I wasn't trying to imply that I'm somehow immune to the problems that were highlighted.

I mentioned that I don't like people-pleasers and I find it a bit obnoxious when ChatGPT does it. I'm sure that there might be other bits of subtle encouragement it gives me that I don't notice, but I can't elaborate on those parts because, you know, I didn't notice them.

I genuinely do not know how you got "we should restrict access" from my comment or the parent, you just extrapolated to make a pretty stupid joke.

hexaga

More generally - AI that is good at convincing people is very powerful, and powerful things are dangerous.

I'm increasingly coming around to the notion that AI tooling should have safety features concerned with not directly exposing humans to asymptotically increasing levels of 'convincingness' in generated output. Something like a weaker model used as a buffer.

Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.

Like most safety regulations, it'll take blood for the inking. Exposing mass numbers of people to these models strikes me as wildly negligent if we expect continued improvement along this axis.

kjs3

Yeah...this. I'm not so concerned that AI is going to put me out of a job or become Skynet. I'm concerned that people are offloading decision making and critical thinking to the AI, accepting it's response at face value and responding to concerns with "the AI said so...must be right". Companies are already maliciously exploiting this (e.g. the AI has denied your medical claim, and we can't tell you how it decided that because our models are trade secrets), but it will soon become de rigueur and people will think you're weird for questioning the wisdom of the AI.

Nevermark

The combination of blind faith in AI, and good faith highly informed understanding and agreement, achieved with help of AI, covers the full spectrum of the problem.

southernplaces7

>Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.

Seriously? Do you suppose that it will pull this trick off through some sort of hypnotizing magic perhaps? I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.

The kinds of people who would be convinced by such "dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings anyhow.

Aside from demonstrating the persistent AI woo that permeats many comments on this site, the logic above reminds me of the harping nonsense around the supposed dangers of video games or certain violent movies "making kinds do bad things", in years past. The prohibitionist nanny tendencies behind such fears are more dangerous than any silly chatbot AI..

aaronbaugher

I've seen people talk about using ChatGPT as a free therapist, so yes, I do think there's a good chance that they could be talked into self-destructive behavior by a chat bot that latched onto something they said and is "trying" to tell them what they want to hear. Maybe not killing themselves, but blowing up good relationships or quitting good jobs, absolutely.

These are people who have jobs and apartments and are able to post online about their problems in complete sentences. If they're not "of sound mind," we have a lot more mentally unstable people running around than we like to think we do.

hexaga

If you believe current models exist at the limit of possible persuasiveness, there obviously isn't any cause for concern.

For various reasons, I don't believe that, which is why my argument is predicated on them improving over time. Obviously current models aren't overly hazardous in the sense I posit - it's a concern for future models that are stronger, or explicitly trained to be more engaging and/or convincing.

The load bearing element is the answer to: "are models becoming more convincing over time?" not "are they very convincing now?"

> [..] I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot [..]

Then you're not engaging with the premise at all, and are attacking a point I haven't made. The tautological assurance that non-convincing AI is not convincing is not relevant to a concern predicated on the eventual existence of highly convincing AI: that sufficiently convincing AI is hazardous due to induced loss of control, and that as capabilities increase the loss of control becomes more difficult to resist.

southernplaces7

In both of your linked examples, the people in question very likely had at least some sort of mental instability working in their minds.

I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.

The kinds of people who would be convinced by such "harm dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings, or by books, or movies or any other sort of excuse for a mind that had problems well before seeing X or Y.

By the logic of regulating AI for these supposed dangers, you could argue that literature, movie content, comic books, YouTube videos and that much loved boogeyman in previous years of violent video games should all be banned or regulated for the content they express.

Such notions have a strongly nannyish, prohibitionist streak that's much more dangerous than some algorithm and the bullshit it spews to a few suggestible individuals.

The media of course loves such narratives, because their breathless hysteria and contrived fear-mongering plays right into more eyeballs. Seeing people again take seriously such nonsense after idiocies like the media frenzy around video games in the early 2000s and prior to that, similar media fits about violent movies and even literature, is sort of sad.

We don't need our tools for expression, and sources of information "regulated for harm" because a small minority of others can't get an easy grip on their psychological state.

skywhopper

Pretty much everyone has “some sort of mental instability working in their minds”.

southernplaces7

Don't be obtuse. There are degrees of mental instability and no, some random person having a touch of it in very specific ways isn't the same as someone deciding to try killing the Queen of England because a chatbot said so. Most people wouldn't be quite that deluded in that context.

I'd love to see evidence of mental instability in "everyone" and its presence in many people is in any case no justification for what are in effect controls on freedom of speech and expression, just couched in a new boogeyman.

zahlman

Is this somehow worse than humans talking each other into it?

skywhopper

Yes.

zahlman

How?

AndyKelley

> Chose boring technology. I genuinely find myself picking libraries that have been around for a while partly because that way it’s much more likely that LLMs will be able to use them.

This is an appeal against innovation.

> I’ll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”

> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.

As someone who has spent [an incredible amount of time reviewing other people's code](https://github.com/ziglang/zig/pulls?q=is%3Apr+is%3Aclosed), my perspective is that reviewing code is fundamentally slower than writing it oneself. The purpose of reviewing code is mentorship, investing in the community, and building trust, so that those reviewees can become autonomous and eventually help out with reviewing.

You get none of that from reviewing code generated by an LLM.

xboxnolifes

> This is an appeal against innovation.

No it is not. It is arguing for using more stable and better documented tooling.

em-bee

so it's an appeal to not innovate on tooling and languages?

xboxnolifes

It's not appealing to anything.

verbify

An anecdote: I was working for a medical centre, and had some code that was supposed to find the 'main' clinic of a patient.

The specification was to only look at clinical appointments, and find the most recent appointment. However if the patient didn't have a clinical appointment, it was supposed to find the most recent appointment of any sort.

I wrote the code by sorting the data (first by clinical-non-clinical and then by date). I asked chatgpt to document it. It misunderstood the code and got the sorting backwards.

I was pretty surprised, and after testing with foo-bar examples eventually realised that I had called the clinical-non-clinical column "Clinical", which confused the LLM.

This is the kind of mistake that is a lot worse than "code doesn't run" - being seemingly right but wrong is much worse than being obviously wrong.

zahlman

To be clear, by "clinical-non-clinical", you mean a boolean flag for whether the appointment is clinical?

verbify

Yes, although we weren't using a boolean.

(There was a reason for this - the field was used elsewhere within a PowerBI model, and the clinicians couldn't get their heads around True/False, PowerBI doesn't have an easy way to map True/False values to strings, so we used 'Clinical/Non-Clinical' as string values).

I am reluctant to share the code example, because I'm preciously guarding an example of an LLM making an error in the hope that I'll be able to benchmark models using this, however here's the powerquery code (which you can put into excel) - ask an LLM to explain the code/predict what the output will look like, and compare it with what you get in excel.

let

    MyTable = #table(

        {"Foo"},

        {

            {"ABC"},

            {"BCD"},

            {"CDE"}

        }

    ),

    AddedCustom = Table.AddColumn(

        MyTable,

        "B",

        each if Text.StartsWith([Foo], "LIAS") or Text.StartsWith([Foo], "B") 

             then "B" 

             else "NotB"

    ),

    SortedRows = Table.Sort(

        AddedCustom, 

        {{"B", Order.Descending}}

    )

    SortedRows

I believe the issue arises because the column that sorts B/NotB is also called 'B' (i.e. the Clinical/Non-Clinical column was simply called 'Clinical', which is not an amazing naming convention).

tombert

I use ChatGPT to generate code a lot, and it's certainly useful, but it has given me issues that are not obvious.

For example, I had it generate some C code to be used with ZeroMQ a few months ago. The code looked absolutely fine, and it mostly worked fine, but it made a mistake with its memory allocation stuff that caused it to segfault sometimes, and corrupt memory other times.

Fortunately, this was such a small project and I already know how to write code, so it wasn't too hard for me to find and fix, though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.

zahlman

>though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.

They used to do the same with Stack Overflow. But now it's more dangerous, because the code can be "subtly wrong in ways the user can't fathom" to order.

tombert

Yeah, there's effectively no limit to how much code that you can have.

We're all guilty of copypasting from Stack Overflow, but as you said, that's not made to order. In order to use the code copied from there, you will likely have to edit it, at least a bit to fit your application, meaning that it does require a bit of understanding of what you're doing.

Since ChatGPT can be completely tuned to what you want without writing code, it's far more tempting to just copy and paste from it without auditing it.

krupan

The beauty of stack overflow is that the code you are copying and pasting has been reviewed and voted on by a decent number of other programmers

KoolKat23

And subtle bugs existed pre-2022, how often my apps are updated for "minor bug fixes" would mean this is par for the course.

tombert

Sure, it's possible that the code it gave me was based on some incorrectly written code it scraped from Gitlab or something.

I'm not a luddite, I'm perfectly fine with people using AI for writing code. The only thing that really concerns me is that it has the potential to generate a ton of shitty code that doesn't look shitty, creating a lot of surface area for debugging.

Prior to AI, the quantity of crappy code that could be generated was basically limited by the speed in which a human could write it, but now there's really no limit.

Again, just to reiterate, this isn't "old man yells at cloud". I think AI is pretty cool, I use it all the time, I don't even have a problem with people generating large quantities of code, it's just something we have to be a bit more weary of.

KoolKat23

Agree, just means less time developing and more time on quality control.

not2b

If the hallucinated code doesn't compile (or in an interpreted language, immediately throws exceptions), then yes, that isn't risky because that code won't be used. I'm more concerned about code that appears to work for some test cases but solves the wrong problem or inadequately solves the problem, and whether we have anyone on the team who can maintain that code long-term or document it well enough so others can.

wavemode

I once submitted some code for review, in which the AI had inserted a recursive call to the same function being defined. The recursive call was completely unnecessary and completely nonsensical, but also not wrong per se - it just caused the function to repeat what it was doing. The code typechecked, the tests passed, and the line of code was easy to miss while doing a cursory read through the logic. I missed it, the code reviewer missed it, and eventually it merged to production.

Unfortunately there was one particular edge case which caused that recursive call to become an infinite loop, and I was extremely embarrassed seeing that "stack overflow" server error alert come through Slack afterward.

t14n

fwiw this problem already exists with my more junior co-workers. and also my own code that I write when exhausted!

if you have trusted processes for review and aren't always rushing out changes without triple checking your work (plus a review from another set of eyes), then I think you catch a lot of the subtler bugs that are emitted from an LLM.

not2b

Yes, code review can catch these things. But code review for more complex issues works better when the submitter can walk the reviewers through the design and explain the details (sometimes the reviewers will catch a flaw in the submitter's reasoning before they spot the issue in the code: it can become clearer that the developer didn't adequately understand the spec or the problem to be solved). If an LLM produced it, a rigorous process will take longer, which reduces the value of using the LLM in the first place.

henning

If I have to spend lots of time learning how to use something, fix its errors, review its output, etc., it may just be faster and easier to just write it myself from scratch.

The burden of proof is not on me to justify why I choose not to use something. It's on the vendor to explain why I should turn the software development process into perpetually reviewing a junior engineer's hit-or-miss code.

It is nice that the author uses the word "assume" -- there is mixed data on actual productivity outcomes of LLMs. That is all you are doing -- making assumptions without conclusive data.

This is not nearly as strong an argument as the author thinks it is.

> As a Python and JavaScript programmer my favorite models right now are Claude 3.7 Sonnet with thinking turned on, OpenAI’s o3-mini-high and GPT-4o with Code Interpreter (for Python).

This is similar to Neovim users who talk about "productivity" while ignoring all the time spent tweaking dofiles that could be spent doing your actual job. Every second I spend toying with models is me doing something that does not directly accomplish my goals.

You have no idea how much code I read, so how can you make such claims? Anyone who reads plenty of code knows that it often feels like reading other people's code is often harder than just writing it yourself.

The level of hostility towards just sitting down and thinking through something without having an LLM insert text into your editor is unwarranted and unreasonable. A better policy is: if you like using coding assistants, great. If you don't and you still get plenty of work done, great.

skydhash

Also the thing that people miss is compounded experience. Just starting with any language, you have to read a lot of documentation, books, and articles. After a year or so, you have enough skeleton projects, code samples, and knowledge, that you could build a mini framework if the projects were repetitive. Even then, you could just copy paste features that you've already implemented, like that test harness or the Rabbitmq integration an be very productive that way.

sevensor

> you have to put a lot of work in to learn how to get good results out of these systems

That certainly punctures the hype. What are LLMs good for, if the best you can hope for is to spend years learning to prompt it for unreliable results?

jjevanoorschot

Many tools that increase your productivity as a developer take a while to master. For example, it takes a while to become proficient with a debugger, but I'd still wager that it's worth it to learn to use a debugger over just relying on print debugging.

MrMcCall

40+ years of successful coding with only print debugging FTW!

A tool that helps you by iteratively guessing the next token is not a "developer tool" any more than a slot machine is a wealth buidling tool.

Even when I was using Visual Studio Ultimate (that has a fantastic step-through debugging environment), the debugger was only useful for the very initial tests, in order to correct dumb mistakes.

Finding dumb mistakes is a different order of magnitude of the dev process than building a complex edifice of working code.

UncleEntity

I would say printf debugging is the functional equivalent of "guessing the next token". I only reach for it when my deductive reasoning (and gdb) skills fail and I'm just shining a flashlight in the dark hoping to see the bugs scurrying around.

Ironically, I used it to help the robots find a pretty deep bug in some code they authored in which the whole "this code isn't working, fix it" prompt didn't gain any traction. Giving them the code with the debug statements and the output set them on the right path. Easy peasy...true, they were responsible for the bug in the first place so I guess the humans who write bug free code have the advantage.

krupan

You missed the part about unreliable results. Never in software engineering have we had to put a lot of effort into a tool that gives unpredictable, unreliable results like LLMs.

rsynnott

You’re holding it wrong, magic robot edition.

Like, at a certain point, doing it yourself is probably less hassle.

naasking

Because LLMs are not a stationary target, they're only getting better. They're already much better than they were only 2 years ago.