Skip to content(if available)orjump to list(if available)

LLMs are still surprisingly bad at some simple tasks

jstrieb

The point from the end of the post that AI produces output that sounds correct is exactly what I try to emphasize to friends and family when explaining appropriate uses of LLMs. AI is great at tasks where sounding correct is the essence of the task (for example "change the style of this text"). Not so great when details matter and sounding correct isn't enough, which is what the author here seems to have rediscovered.

The most effective analogy I have found is comparing LLMs to theater and film actors. Everyone understands that, and the analogy offers actual predictive power. I elaborated on the idea if you're curious to read more:

https://jstrieb.github.io/posts/llm-thespians/

lsecondario

I like this analogy a lot for non-technical...erm...audiences. I do hope that anyone using this analogy will pair it with loud disclaimers about not anthropomorphizing LLMs; they do not "lie" in any real sense, and I think framing things in those terms can give the impression that you should interpret their output in terms of "trust". The emergent usefulness of LLMs is (currently at least) fundamentally opaque to human understanding and we shouldn't lead people to believe otherwise.

mexicocitinluez

> When LLMs say something true, it’s a coincidence of the training data that the statement of fact is also a likely sequence of words;

Do you know what a "coincidence" actually is? The definition you're using is wrong.

It's not a coincidence that I train a model on healthcare regulations and it answers a question about healthcare regulations correctly.

None of that is coincidental.

If I trained it on healthcare regulations and asked it about recipes, it won't get anything right. How is that coincidental?

jstrieb

LLMs are trained on text, only some of which includes facts. It's a coincidence when the output includes new facts not explicitly present in the training data.

anthonylevine

> It's a coincidence when the output includes facts,

That's not what a coincidence is.

A coincidence is: "a remarkable concurrence of events or circumstances without apparent causal connection."

Are you saying that training it on a subset of specific data and it responding with that data "does not have a causal connection"> Do you know how statistical pattern matching works?

delusional

> It's not a coincidence that I train a model on healthcare regulations and it answers a question about healthcare regulations

If you train a model on only healthcare regulations it wont answer questions about healthcare regulation, it will produce text that looks like healthcare regulations.

mexicocitinluez

And that's not a coincidence. That's not what the word "coincidence" means. It's a complete misunderstanding of how these tools works.

tromp

I wanted to check the prime factors of 1966 the other day so I googled it and it led me to https://brightchamps.com/en-us/math/numbers/factors-of-1966 , a site that seems focussed on number facts. It confidently states that prime factors of 1966 are 2, 3, 11, and 17. For fun I tried to multiply these numbers back in my head and concluded there's no way that 6 * 187 could reach 1966.

That's when I realized this site was making heavy use of AI. Sadly, lots of people are going to trust but not verify...

croes

This is also very wrong

> A factor of 1966 is a number that divides the number without remainder.

>The factors of 1966 are 1, 2, 3, 6, 11, 17, 22, 33, 34, 51, 66, 102, 187, 374, 589, 1178, 1966.

If I google for the factors of 1966 the Google AI gives the same wrong factors.

amelius

They're talking about prime factors, not that it changes much.

croes

The site also lists the factors and beside 1,2 and 1966 they are all wrong.

Google harvests its result from the same page

> The factors of 1966 are 1, 2, 3, 6, 11, 17, 22, 33, 34, 51, 66, 102, 187, 374, 589, 1178, and 1966. These are the whole numbers that divide 1966 evenly, leaving no remainder.

jw1224

> “To stave off some obvious comments:

> yoUr'E PRoMPTiNg IT WRoNg!

> Am I though?”

Yes. You’re complaining that Gemini “shits the bed”, despite using 2.5 Flash (not Pro), without search or reasoning.

It’s a fact that some models are smarter than others. This is a task that requires reasoning so the article is hard to take seriously when the author uses a model optimised for speed (not intelligence), and doesn’t even turn reasoning on (nor suggest they’re even aware of it being a feature).

I asked the exact prompt to ChatGPT 5 Thinking and got an excellent answer with cited sources, all of which appears to be accurate.

softwaredoug

In my experience reasoning and search come with their own set of tradeoffs. It works great when it works. But the variance can be wider than just an LLM.

Search and reasoning use up more context, leading to context rot, and subtler harder to detect hallucinations. Reasoning doesn’t always focus on evaluating the quality of evidence, just “problem solving” from some root set of axioms found in search.

I’ve had this happen in Claude code for example where it hallucinated a few details about a library based on what badly written forum post.

dgfitz

> … all of which appears to be accurate.

Isn’t that the whole goddamn rub? You don’t _know_ if they’re accurate.

delusional

I just ran the same test on Gemini 2.5 pro (I assume it enables search by default, because it added a bunch of "sources") and got the exact same result as the author. It claims ".bdi" is the ccTLD for Burundi, which is false they have .bi[1]. It claims ".time" and ".article" are TLDs.

I think the authors point stands.

EDIT: I tried it with "Deep Research" too. Here it doesn't invent either TLDs or HTML Element, but the resulting list is incomplete.

[1]: https://en.wikipedia.org/wiki/.bi

edent

OP here. I literally opened up Gemini and used the defaults. If the defaults are shit, maybe don't offer them as the default?

Or, if LLMs are so smart, why doesn't it say "Hmmm, would you like to use a different model for this?"

Either way, disappointing.

magicalhippo

> Or, if LLMs are so smart, why doesn't it say "Hmmm, would you like to use a different model for this?"

That is indeed an area where LLMs don't shine.

That is, not only are they trained to always respond with an answer, they have no ability to accurately tell how confident they are in that answer. So you can't just filter out low confidence answers.

mathewsanders

Something I think would be interesting for model APIs and consumer apps to exposed would be the probability of each individual token generated.

I’m presuming that one class of junk/low quality output is when the model doesn’t have high probability next tokens and works with whatever poor options it has.

Maybe low probability tokens that cross some threshold could have a visual treatment to give feedback the same way word processors give feedback in a spelling or grammatical error.

But maybe I’m making a mistake thinking that token probability is related to the accuracy of output?

hobofan

Then criticize the providers on their defaults instead of claiming that they can't solve the problem?

> Or, if LLMs are so smart, why doesn't it say "Hmmm, would you like to use a different model for this?"

That's literally what ChatGPT did for me[0], which is consistent from what they shared at the last keynote (quick-low reasoning answer per default first, with reasoning/search only if explicitly prompted or as a follow-up). It did miss one match tough, as it somehow didn't parse the `<search>` element from the MDN docs.

[0]: https://chatgpt.com/share/68cffb5c-fd14-8005-b175-ab77d1bf58...

pwnOrbitals

You are pointing out a maturity issue, not a capability problem. It's clear to everyone that LLM products are immature, but saying they are incapable is misleading

delusional

In you mind, is there anything an LLM is _incapable_ of doing?

maddmann

“Defaults are shit” — is that really true though?! Just because it shits the bed on some tasks does not mean it is shit. For people integrating llms into any workflow that requires a modicum of precision or determinism, one must always evaluate output closely/have benchmarks. You must treat the llm as an incompetent but overconfident intern, and thus have fast mechanisms for measuring output and giving feedback.

sieve

They are very good at some tasks and terrible at others.

I use LLMs for language-related work (translations, grammatical explanations etc) and they are top notch in that as long as you do not ask for references to particular grammar rules. In that case they will invent non-existent references.

They are also good for tutor personas: give me jj/git/emacs commands for this situation.

But they are bad in other cases.

I started scanning books recently and wanted to crop the random stuff outside an orange sheet of paper on which the book was placed before I handed the images over to ScanTailor Advanced (STA can do this, but I wanted to keep the original images around instead of the low-quality STA version). I spent 3-5 hours with Gemini 2.5 Pro (AI Studio) trying to get it to give me a series of steps (and finally a shell script) to get this working.

And it could not do it. It mixed up GraphicsMagick and ImageMagick commands. It failed even with libvips. Finally I asked it to provide a simple shell script where I would provide four pixel distances to crop from the four edges as arguments. This one worked.

I am very surprised that people are able to write code that requires actual reasoning ability using modern LLMs.

noosphr

Just use Pillow and python.

It is the only way to do real image work these days, and as a bonus LLMs suck a lot less at giving you nearly useful python code.

The above is a bit of a lie as opencv has more capabilities, but unless you are deep in the weeds of preparing images for neural networks pillow is plenty good enough.

jcupitt

pyvips (the libvips Python binding) is quite a bit better than pillow-simd --- 3x faster, 10x less memory use, same quality. On this benchmark at least:

https://github.com/libvips/libvips/wiki/Speed-and-memory-use

jcupitt

I'm the libvips author, I should have said, so I'm not very neutral. But at least on that test it's usefully quicker and less memory hungry.

BOOSTERHIDROGEN

Would you share your system prompt for that grammatical checker?

sieve

There is no single prompt.

The languages I am learning have verb conjugations and noun declensions. So I write a prompt asking the LLM to break the given paragraphs down sentence-by-sentence by giving me the general sentence level English translation plus word-by-word grammar and (contextual) meaning.

For the grammar, I ask for the verbal root/noun stem, the case/person/number, any information on indeclinables, the affix categories etc.

poszlem

I think Gemini is one of the best example of an LLM that is in some cases the best and in some cases truly the worst.

I once asked it to read a postcard written by my late grandfather in Polish, as I was struggling to decipher it. It incorrectly identified the text as Romanian and kept insisting on that, even after I corrected it: "I understand you are insistent that the language is Polish. However, I have carefully analyzed the text again, and the linguistic evidence confirms it is Romanian. Because the vocabulary and alphabet are not Polish, I cannot read it as such." Eventually, after I continued to insist that it was indeed Polish, it got offended and told me it would not try again, accusing me of attempting to mislead it.

markasoftware

as soon as an LLM makes a significant mistake in a chat (in this case, when it identified the text as Romanian), throw away the chat (or delete/edit the LLMs response if your chat system allows this). The context is poisoned at this point.

sieve

I find that surprising, actually. Gemini is VERY good with Sanskrit and a few other Indian languages. I would expect it to have completely mastered European languages.

noosphr

>Eventually, after I continued to insist that it was indeed Polish, it got offended and told me it would not try again, accusing me of attempting to mislead it.

I once had Claude tell me to never talk to it again after it got upset when I kept giving it peer reviewed papers explaining why it was wrong. I must have hit the tumbler dataset since I was told I was sealioning it, which took me back a while.

rsynnott

Not really what sealioning is, either. If it had been right about the correctness issue, you’d have been gaslighting it.

Dilettante_

>This is a pretty simple question to answer. Take two lists and compare them.

This continues a pattern as old as home computing: The author does not understand the task themselves, consequently "holds the computer wrong", and then blames the machine.

No "lists" were being compared. The LLM does not have a "list of TLDs" in its memory that it just refers to when you ask it. If you haven't grokked this very fundamental thing about how these LLMs work, then the problem is really, distinctly, on your end.

roxolotl

That’s the point the author is making. The LLMs don’t have the raw correct information required to accomplish the task so all they can do is provide a plausible sounding answer. And even if it did the way they are architected still can only results in a plausible sounding answer.

Dilettante_

They absolutely could have accomplished the task. The task was purposefully or ignorantly posed in a way that is known to be not suited to the LLM, and then the author concluded "the machine did not complete the task because it sucks."

Blahah

Not really. This works great in Claude Sonnet 4.1: 'Please could you research a list of valid TLDs and a list of valid HTML5 elements, then cross reference them to produce a list of HTML5 elements which are also valid TLDs. Use search to find URLs to the lists, then use the analysis tool to write a script that downloads the lists, normalises and intersects them.'

Ask a stupid question, get a stupid answer.

Lapel2742

> This works great in Claude Sonnet 4.1: 'Please could you research a list of valid TLDs and a list of valid HTML5 elements, then cross reference them to produce a list of HTML5 elements which are also valid TLDs. Use search to find URLs to the lists, then use the analysis tool to write a script that downloads the lists, normalises and intersects them.'

Ok, I only have to:

1. Generally solve the problem for the AI

2. Make a step by step plan for the AI to execute

3. Debug the script I get back and check by hand if it uses reliable sources.

4. Run that script.

For what do I need the AI?

Lapel2742

> No "lists" were being compared.

How would you solve that problem? You'd probably go to the internet, get the list of TLDs and the list of HTML5-Element and than compare those lists.

The author compares three commercial large‑language models that have direct internet access, but none of them appear capable of performing this seemingly simple task. I think his conclusion is valid.

joak

More generally LLMs are bad at exhaustivity: asking "give me all stuff matching a given property" almost always fails and provide at best a subset.

If possible in the context, the way to go is to ask for a piece of code processing the data to provide exhaustivity. This method have at least some chance to succeed.

unleaded

https://dubesor.de/WashingHands

This is my personal favourite example of LLMs being stupid. It's a bit old but it's very funny that Grok is the only one that gets it..

StilesCrisis

Several others “get it” but answer the question in a general-hygiene sense, e.g.:

``` Claude 3.7 Sonnet Thinking (¢0.87) The question contains an assumption - people without arms wouldn't have hands to wash in the traditional sense. ```

``` DeepSeek-R1 (¢0.47) People without arms (and consequently without hands) adapt their handwashing routine using a variety of methods and tools tailored to their abilities and needs. ```

``` Claude Opus 4.1 People without arms typically don't need to wash their hands in the traditional sense, since they use their feet or assistive devices for daily tasks instead. ```

I think realistically it’s still a valid question because people without arms still manipulate things in their environment, e.g. with feet, and still need to be hygienic while prepping food, etc., and the AI pivots to answering “what it thinks you were asking about” instead of just telling the user that they are wrong.

K0balt

The training data is not automatically in the context scope, and on list tasks LLMs have nearly no way to ensure completeness due to their fundamental characteristics.

To do a task like this with LLMs, you need to use a document for your source lists or bring them directly into context, then a smart model with good prompting might zero-shot it.

But if you want any confidence in the answer, you need to use tools: “here is two lists, write a python script to find the exact matches, and return a new list with only the exact matches. Write a test dataset and verify that there are no errors, omissions, or duplicates.”

LLMs plus tools / code are amazing. LLMs on their own are a professor with an intermittent heroin problem.

chrsw

I see a big issue with these tools and services we call "AI".

On one hand you hear things like "AI is as smart as college student", "AI won a math competition", "AI will replace white collar workers". And so on. I'm not going to bother looking up actual references of people saying these exact things. But unless I'm completely delusional, this is the gist of what some people have been saying about AI over the past few years.

To the layperson, this sounds like a good deal. Use a free (for now) tool or pay for an advanced version to get stuff done. Simple.

But then you start scratching beneath the surface and you start hearing different stories. "No, you didn't ask it right", "No, that's a bad question because they tokenize your input", "Well, you still have to check the results", "You didn't use the right model".

Huh? How is a normal person supposed to take this stuff seriously? Now me personally, I don't have much of an issue with this stuff. I've been a developer for many, many years and I've been aware of the various developments in the field of machine learning for over 15 years. I have kind of an intuition about what I should use these systems for.

But I think the general public is misinformed about what exactly these systems are and why they're not actually intelligent. That's a problem.

vinc

The other day I found that they were struggling with "find me two synonyms of 'downloading' and 'extracting' that are the same length" because I was writing a script and wanted to see if could align the next path parameter.

First there's the tokenization issue, the same old "how many R in STRAWBERRY" where they are often confidently wrong, but I also asked not to mix tense (-ing and -ed for example) and that was very hard for them.

ozgung

Claude Opus 4.1 generated me a small web app in two minutes to find the correct answer: https://claude.ai/public/artifacts/ffbb642b-8883-4b4d-8699-d...

thewisenerd

> To be clear, I would expect a moderately intelligent teenager to be able to find two lists and compare them. If an intern gave me the same attention to detail as above, we'd be having a cosy little chat about their attitude to work.

sure, but when I expect this [1] from _any_ full time hire, my "expectations are too high from people" and "everybody has their strengths"

[1] find a list of valid html5 elements, find a list of TLDs, have an understanding of ccTLDs and gTLDs