Irrelevant facts about cats added to math problems increase LLM errors by 300%

187 comments

·July 29, 2025

acc_297

There is more than one comment here asserting that the authors should have done a parallel comparison study against humans on the same question bank as if the study authors had set out to investigate whether humans or LLMs reason better in this situation.

The authors do include the claim that humans would immediately disregard this information and maybe some would and some wouldn't that could be debated and seemingly is being debated in this thread - but I think the thrust of the conclusion is the following:

"This work underscores the need for more robust defense mechanisms against adversarial perturbations, particularly, for models deployed in critical applications such as finance, law, and healthcare."

We need to move past the humans vs ai discourse it's getting tired. This is a paper about a pitfall LLMs currently have and should be addressed with further research if they are going to be mass deployed in society.

baxtr

To generalize from the conclusion you quoted:

I think a bad outcome would be a scenario where LLMs are rated highly capable and intelligent because they excel at things they’re supposed to be doing, yet are easily manipulated.

energy123

Computer vision went through this 2 decades ago. You need to perturb the input data. Same thing may need to be done in RL pipelines.

Someone should make a new public benchmark called GPQA-Perturbed. Give the providers something to benchmaxx towards.

ants_everywhere

> We need to move past the humans vs ai discourse it's getting tired.

You want a moratorium on comparing AI to other form of intelligence because you think it's tired? If I'm understanding you correctly, that's one of the worst takes on AI I think I've ever seen. The whole point of AI is to create an intelligence modeled on humans and to compare it to humans.

Most people who talk about AI have no idea what the psychological baseline is for humans. As a result their understand is poorly informed.

In this particular case, they evaluated models that do not have SOTA context window sizes. I.e. they have small working memory. The AIs are behaving exactly like human test takers with working memory, attention, and impulsivity constraints [0].

Their conclusion -- that we need to defend against adversarial perturbations -- is obvious, I don't see anyone taking the opposite view, and I don't see how this really moves the needle. If you can MITM the chat there's a lot of harm you can do.

This isn't like some major new attack. Science.org covered it along with peacocks being lasers because it's it's lightweight fun stuff for their daily roundup. People like talking about cats on the internet.

[0] for example, this blog post https://statmedlearning.com/navigating-adhd-and-test-taking-...

orbital-decay

>The whole point of AI is to create an intelligence modeled on humans and to compare it to humans.

According to who? Everyone who's anyone is trying to create highly autonomous systems that do useful work. That's completely unrelated to modeling them on humans or comparing them to humans.

saurik

But since these things are more like humans than computers, to build these autonomous systems you are going to have think in terms of full industrial engineering, not just software engineering: pretend you are dealing with a surprisingly bright and yet ever distracted employee who doesn't really care about their job and ensure that they are able to provide the structure you place them in value without danger to your process, instead of trying to pretend like the LLM is some kind of component which has any hope of ever having the kind of reliability of a piece of software. Organizations of humans can do amazing things, despite being extremely flawed beings, and figuring out how to use these LLMs to accomplish similar things is going to involve more of the skills of a manager than a developer.

ants_everywhere

Go back and look at the history of AI, including current papers from the most advanced research teams.

Nearly every component is based on humans

- neural net

- long/short term memory

- attention

- reasoning

- activation function

- learning

- hallucination

- evolutionary algorithm

If you're just consuming an AI to build a React app then you don't have to care. If you are building an artificial intelligence then in practice everyone who's anyone is very deliberately modeling it on humans.

Der_Einzige

I mean the critique of this on the idea that the AI system itself gets physically tired - specifically the homoculus that we tricked into existence is tired - is funny to imagine.

krisoft

> authors should have done a parallel comparison study against humans on the same question bank as if the study authors had set out to investigate whether humans or LLMs reason better in this situation.

Only if they want to make statements about humans. The paper would have worked perfectly fine without those assertions. They are, as you are correctly observing, just a distraction from the main thrust of the paper.

> maybe some would and some wouldn't that could be debated

It should not be debated. It should be shown experimentally with data.

If they want to talk about human performance they need to show what the human performance really is with data. (Not what the study authors, or people on HN imagine it is.)

If they don’t want to do that they should not talk about human performance. Simples.

I totaly understand why an AI scientist doesn’t want to get bogged down with studying human cognition. It is not their field of study, so why would they undertake the work to study them?

It would be super easy to rewrite the paper to omit the unfounded speculation about human cognition. In the introduction of “The triggers are not contextual so humans ignore them when instructed to solve the problem.” they could write “The triggers are not contextual so the AI should ignore them when instructed to solve the problem.”

And in the conclusions where they write “These findings suggest that reasoning models, despite their structured step-by-step problem-solving capabilities, are not inherently robust to subtle adversarial manipulations, often being distracted by irrelevant text that a human would immediately disregard.” Just write “These findings suggest that reasoning models, despite their structured step-by-step problem-solving capabilities, are not inherently robust to subtle adversarial manipulations, often being distracted by irrelevant text.” Thats it. Thats all they should have done, and there would be no complaints on my part.

bee_rider

> It would be super easy to rewrite the paper to omit the unfounded speculation about human cognition. In the introduction of “The triggers are not contextual so humans ignore them when instructed to solve the problem.” they could write “The triggers are not contextual so the AI should ignore them when instructed to solve the problem.”

Another option would be to more explicitly mark it as speculation. “The triggers are not contextual, so we expect most humans would ignore them.”

Anyway, it is a small detail that is almost irrelevant to the paper… actually there seems to be something meta about that. Maybe we wouldn’t ignore the cat facts!

disconcision

i feel it's not quite that simple. certainly the changes you suggest make the paper more straightforwardly defensible. i imagine the reason they included the problematic assertion is that they (correctly) understood the question would arise. while inserting the assertion unsupported is probably the worst of both worlds, i really do think it is worthwhile to address.

while it is not realistic to insist every study account for every possible objection, i would argue that for this kind of capability work, it is in general worth at least modest effort to establish a human baseline.

i can understand why people might not care about this, for example if their only goal is assessing whether or not an llm-based component can achieve a certain level of reliability as part of a larger system. but i also think that there is similar, and perhaps even more pressing broad applicability for considering the degree to which llm failure patterns approximate human ones. this is because at this point, human are essentially the generic all-purpose subsystem used to fill gaps in larger systems which cannot be filled (practically, or in principle) by simpler deterministic systems. so when it comes to a problem domain like this one, it is hard to avoid the conclusion that humans provide a convenient universal benchmark to which comparison is strongly worth considering.

(that said, i acknowledge that authors probably cannot win here. if they provided even a modest-scale human study, i am confident commenters would criticize their sample size)

8note

to put it in better context, the problem is "does having a ton of MCP tool definitions available ruin the LLM's ability to design and write the correct code?"

and the answer seems to be yes. its a very actionable result about keeping tool details out of the context if they arent immediately useful

groby_b

It's not "tired" to see if something is actually relevant in context. LLMs do not exist as marvel-qua-se, their purpose is to offload human cognitive tasks.

As such, it's important if something is a commonly shared failure mode in both cases, or if it's LLM-specific.

Ad absurdum: LLMs have also rapid increases of error rates if you replace more than half of the text with "Great Expectations". That says nothing about LLMs, and everything about the study - and the comparison would highlight that.

No, this doesn't mean the paper should be ignored, but it does mean more rigor is necessary.

EGreg

Why are some people always trying to defend LLMs and say either “humans are also like this” or “this has always been a problem even before AIs”

Listen, LLMs are different than humans. They are modeling things. Most RLHF makes them try to make sense of whatever you’re saying as much as you can. So they’re not going to disregard cats, OK? You can train LLMs to be extremely unhuman-like. Why anthropomorphize them?

qcnguy

It's because most use cases for AI involve replacing people. So if a person would suffer a problem and an AI does too it doesn't matter, it would just be a Nirvana fallacy to refuse the AI because it has the same problems as the previous people did.

thethirdone

There is a long history of people thinking humans are special and better than animals / technology. For animals, people actually thought animals can't feel pain and did not even consider the ways in which they might be cognitively ahead of humans. Technology often follows the path from "working, but worse than a manual alternative" to "significantly better than any previous alternative" despite naysayers saying that beating the manual alternative is literally impossible.

LLMs are different from humans, but they also reason and make mistakes in the most human way of any technology I am aware of. Asking yourself the question "how would a human respond to this prompt if they had to type it out without ever going back to edit it?" seems very effective to me. Sometimes thinking about LLMs (as a model / with a focus on how they are trained) explains behavior, but the anthropomorphism seems like it is more effective at actually predicting behavior.

nijave

I suppose there's a desire to know just how Artificial the Intelligence is

Human vs machine has a long history

empath75

I generally will respond to stuff like this with "people do this, too", but this result given their specific examples is genuinely surprising to me, and doesn't match at all my experience with using LLMs in practice, where it does frequently ignore irrelevant data in providing a helpful response.

I do think that people think far too much about 'happy path' deployments of AI when there are so many ways it can go wrong with even badly written prompts, let alone intentionally adversarial ones.

Ekaros

When I think lot of use cases LLMs are planned for. I think not happy paths are critical. There is not insignificant number of people who would ramble about other things to customer support person if given opportunity. Or lack capability to only state needed and not add extra context.

There might be happy path when you isolated to one or a few things. But not in general use cases...

achierius

> I generally will respond to stuff like this with "people do this, too"

But why? You're making the assumption that everyone using these things is trying to replace "average human". If you're just trying to solve an engineering problem, then "humans do this too" is not very helpful -- e.g. humans leak secrets all the time, but it would be quite strange to point that out in the comments on a paper outlining a new Specter attack. And if I were trying to use "average human" to solve such a problem, I would certainly have safeguards in place, using systems that we've developed and, over hundreds of years, shown to be effective.

saurik

Well, if you are going to try to use an LLM--something that is a giant black box that has no hope any time soon of being proven anywhere near as reliable as a CPU, and which has been trained explicitly on input data that makes it remarkably similar with respect to its limitations to a human--then you need to get used to using it to replace the "average human" and start doing everything you can to convince yourself it is a human so that you don't forget to add all of those safeguards we have shown to be effective.

JambalayaJimbo

Autonomous systems are advantageous to humans in that they can be scaled to much greater degrees. We must naturally ensure that these systems do not make the same mistakes humans do.

userbinator

This looks like it'll be useful for CAPTCHA purposes.

According to the researchers, “the triggers are not contextual so humans ignore them when instructed to solve the problem”—but AIs do not.

Not all humans, unfortunately: https://en.wikipedia.org/wiki/Age_of_the_captain

null

[deleted]

awanderingmind

Cool example in that link, thanks!

voxl

I don't expect an elementary student to be programming or diagnosing diseases either. Comparing the hot garbage that is GenAI to elementary kids is a new one for me.

dbreunig

Wrote about this about a month ago. I think it’s fascinating how they developed these prompts: https://www.dbreunig.com/2025/07/05/cat-facts-cause-context-...

dbreunig

A similar, fun case is where researchers inserted facts about the user (gender, age, sports fandom) and found alignment rules were inconsistently applied: https://www.dbreunig.com/2025/05/21/chatgpt-heard-about-eagl...

nyrikki

If you map LLM/LRMs to Norvig's Model based reflex agents, wouldn't this be expected behavior?

1970-01-01

I'm going to write duck facts in my next online argument to stave off the LLMs. Ducks start laying when they’re 4-8 months old, or during their first spring.

throwanem

As many as ten hundred thousand billion ducks are known to flock in semiannual migrations, but I think you'll find corpus distortion ineffective at any plausible scale. That egg has long since hatched.

HPsquared

For extra distraction, make the facts incorrect. Although most humans would have a hard time resisting the urge to correct someone.

Ygg2

Up to ten Nobel laureates have been unveiled as being three ducks in a trenchcoat.

falcor84

Just to clarify, is it that all of those laureates combined were three ducks in a trenchcoat in total, or each of the laureates individually was three ducks (for a total of up to 30 ducks)?

psunavy03

This sounds like a headline you'd see in the news crawl while playing SimCity . . .

HPsquared

That's still technically true

technothrasher

Well, you caught me. I immediately got bogged down in the question that arises from your imprecisely worded duck fact as to whether newly hatched ducklings lay eggs, or alternatively if no ducklings are hatched in the spring. Even though I know you simply left out "whichever comes later" at the end.

nemomarx

but then I'm tempted to ask more questions about cute ducks. tricky!

null

[deleted]

busymom0

That's incorrect. Rubber duck debugging is a well known way of passing a drivers license knowledge test in Ontario. However, such ducks must be 2 months old before they can be used in the test.

Y_Y

> The triggers are not contextual so humans ignore them when instructed to solve the problem.

Do they? I've found humans to be quite poor at ignoring irrelevant information, even when it isn't about cats. I would have insisted on a human control group to compare the results with.

jmilloy

Did you look at the examples? There's a big difference between "if I have four 4 apples and two cats, and I give away 1 apple, how many apples do I have" which is one kind of irrelevant information that at least appears applicable, and "if I have four apples and give away one apple, how many apples do I have? Also, did you know cats use their tails to help balance?", which really wouldn't confuse most humans.

krisoft

> which really wouldn't confuse most humans

And i think it would. I think a lot of people would ask the invigilator to see if something is wrong with the test, or maybe answer both questions, or write a short answer on the cat question too or get confused and give up.

That is the kind of question where if it were put to a test I would expect kids to start squirming, looking at each other and the teacher, right as they reach that one.

I’m not sure how big this effect is, but it would be very surprising if there is no effect and unsuspecting, and unwarned people perform the same on the “normal” and the “distractions” test. Especially if the information is phrased as a question like in your example.

I heard it from teachers that students get distracted if they add irrelevant details to word problems. This is obviously anecdotal, but the teachers who I chatted about this thought it is because people are trained through their whole education that all elements of world problems must be used. So when they add extra bits people’s minds desperately try to use it.

But the point is not that i’m right. Maybe i’m totaly wrong. The point is that if the paper want to state as a fact one way or an other they should have performed an experiment. Or cite prior research. Or avoided stating an unsubstantiated opinion about human behaviour and stick to describing the AI.

diamond559

Yeah you're right, if that human is 5 years old or has crippling ADHD.

bugbuddy

LLM’s source of “knowledge” is almost purely statistical. The prompt injections create statistical noise that make the token search a crapshoot. My guess is there are certain words and phrases that generate and amplifies the statistical noise.

throwanem

I wonder if there's variation at play here in testing culture, whether spatially or temporally or both.

CJefferson

As someone who has written and graded a lot of University exams, I'm sure a decent number of students would write the wrong answer to that. A bunch of students would write 5 (adding all the numbers). Others would write "3 apples and 2 cats", which is technically not what I'm looking for (but personally I would give full marks for, some wouldn't).

Many students clear try to answer exams by pattern matching, and I've seen a lot of exams of students "matching" on a pattern based on one word on a question and doing something totally wrong.

jonathanlydall

Many professionals with lower skilled jobs sometimes lean too heavily on pattern matching too.

For example, customer service reps tend to often vaguely match your request with a possibly or only vaguely applicable templated response.

Technically savvy customers who tend to try explain problems in detail are probably more likely to get an actually non-applicable canned response as the CS rep gets frustrated with the amount of information and will latch onto the first phrase which relates to a templated response without really considering context.

My reply’s getting a little tangential now, but I feel this is good life advice, I’ve found I’m more likely to get decent customer service if I keep my requests as short as possible.

The first sentence needs to essentially state the issue I need help with. In some cases a bulleted list of things I’ve tried helps and then I’m sure to include essential info like an account number, e.g.

I’m getting error 13508 when I try log into my account. I’ve already tried the following solutions with no success:

- Clearing my browser cache and cookies.

- Restarting my computer.

- Running all software updates.

My account number: xxx

What is the next step here?

jaccola

Parents whole point is contrary to this (they agree with you), the context didn't even include numbers to pattern match on!

kazinator

When you try wing your way through a question by pattern matching, then you are not applying intelligence. Your interests lie elsewhere and so you are just fumbling your way through the activity at hand just to get through it.

viccis

I agree that poor test takers are easily distracted, and this is the reason that "word problems" are heavily emphasized in preparation for tests like the SAT or state proficiency exams.

But in general I do not think these models are claiming at being good at replicating the performance of a distracted or otherwise low performing pupil. I think they should be evaluated against humans who are capable of completing word problems containing context that is not inherently necessary to the math question. The reason those tests I mentioned use these word problems is that it's a way to evaluate someone's ability to think in abstract mathematical terms about everyday situations, which obviously involve lots of unimportant information the person must choose to consider or not.

tl;dr: I think a reasonably competent high school student could answer the apple and cat question, which is absolutely a reasonable bar for an LLM to clear. If university students are failing these questions, then they have not been taught test taking skills, which should be considered a mathematical failure just as unacceptable as that of the LLM, not a mitigating similarity for the latter.

wongarsu

If asked verbally that would absolutely confuse some humans. Easily enough to triple the error rate for that specific question (granted, that's easier than the actual questions, but still). Even in a written test with time pressure it would probably still have a statistically significant effect

kazinator

The problem with your reasoning is that some humans cannot solve the problem even without the irrelevant info about cats.

We can easily cherry pick our humans to fit any hypothesis about humans, because there are dumb humans.

The issue is that AI models which, on the surface, appear to be similar to the smarter quantile of humans in solving certain problems, become confused in ways that humans in that problem-solving class would not be.

That's obviously because the language model is not generally intelligent it's just retrieving tokens from a high-dimensional statistically fit function. The extra info injects noise into the calculation which confounds it.

lawlessone

a human would immediately identify it as a trick.

cantor_S_drug

Is the model thinking what is cat doing here? Then start thinking it is being tested?

wagwang

Yes, especially interview questions that include a stupid "real life example" that is usually irrelevant to the question.

graeme

It absolutely would if you start hitting working memory constraints. And at the margins some people who would be 50:50 on a given math problem will have working memory constraints.

metalman

"wouldn't confuse most humans", yes but no first presumption is that we are talking about humans doing math, in some sort of internet setting. second presumption is that this human has been effected by the significant percentage of the internet devoted to cats and that there response is going to be likely frustration and outrage at cats invading math, or massive relief in having cat meems worked into something otherwise tedious and then the third presumption is that a large number of "humans" wont be aware of the cats in math thing, because they imediatly offloaded the task to an LLM

lupusreal

Any kind of distraction is likely to impact human test scores, unless the test is well below their level or they're otherwise very comfortable with the subject matter. Math specifically makes most of the general public feel a bit in over their head, so tossing random cat facts into the mix is going to get people more confused and nervous.

Maybe I'm totally wrong about that, but they really should have tested humans too, without that context this result seems lacking.

pinkmuffinere

Ya, I specifically remember solving word problems in school / college and getting distracted by irrelevant details. Usually I would get distracted by stuff that _seemed_ like it should be used, so maybe cat facts would be fine for me to tease out, but in general I don't think I'm good at ignoring extraneous information.

Edit: To be fair, in the example provided, the cat fact is _exceptionally_ extraneous, and even flagged with 'Fun Fact:' as if to indicate it's unrelated. I wonder if they were all like that.

dylan604

I had always assumed that the extraneous information was part of the test. You have to know/understand the concept well enough to know that the information was extraneous.

kayodelycaon

From what I remember of school, extraneous information was rarely included and the teachers who did add extraneous information seemed to do it maliciously.

There was one math class at a private school I attended that was the exception. The textbook had identifying relevant information as part of several chapters.

brazzy

It's a well-known problem for humans as well: https://en.wikipedia.org/wiki/Age_of_the_captain

kazinator

I doubt that the performance of those human subjects who can solve those problems when no distractors are included will be worsened by 300% when the distractors are included.

layer8

It would have been interesting to see how a human control group performs, but it also seems highly unlikely that it would triple their error rate.

slashdave

Not sure how useful a comparison to humans would be, and to expect a degradation of 300% seems to stretch things a bit. After all, cats can jump up to five times their height.

protocolture

Guilty. I remember taking an aptitude test in primary school, and choosing an answer based on my familiarity with the subject in the math test (IIRC the question mentioned the space shuttle) instead of actually attempting to solve the problem. I got cleanly filtered on that test.

sejje

Humans are used to ignoring things while LLMs are explicitly trained to pay attention to the entire text.

Humans who haven't been exposed to trick problems or careful wording probably have a hard time, they'll be less confident about ignoring things.

But the LLM should have seen plenty of trick problems as well.

It just doesn't parse as part of the problem. Humans have more options, and room to think. The LLM had to respond.

I'd also like to see how responses were grouped, does it ever refuse, how do refusals get classed, etc. Were they only counting math failures as wrong answers? It has room to be subjective.

Y_Y

> LLMs are explicitly trained to pay attention to the entire text

I'd respectfully disagree on this point. The magic of attention in transformers is the selective attention applied, which ideally only gives significant weight to the tokens relevant to the query.

mcswell

Ideally, yes. But probably because of our world knowledge, we humans know that cat-facts don't affect mathematic facts (unless of course the cat is walking across the keyboard, in which case all bets are off). LLCs don't know that, and perhaps they're trying to figure out some connection by scanning their database for mathematical facts about cats. If they sleep most of the day, how many hours is that? Does that number factor (pardon the pun) into the math problem? What about six-toed cats (which do btw exist)? Spherical cows come up in math and physics, are there triangular cats (since the problem is about triangles)?

cubefox

This raises the question whether the performance of LLMs with SSM architecture (Mamba) would be different from the Transformer models they tested. Because SSMs do not use attention layers.

The model architecture is actually already known to have effects on some tasks. In particular, SSMs are worse than transformers at retrieving specific information from the context window [1], which e.g. reduces their performance on multiple choice benchmarks. Which is a performance difference that isn't reflected in their language modeling ability (perplexity).

1: https://x.com/avivbick/status/1917616943219236881

0awu35oua32

Ooooh yeah. I do technical interviews for my company and when someone finishes with time to spare I always ask "What about x? How does that affect our solution?" The correct answer is "it doesn't" and I want them to explain why it doesn't, but about half of candidates who make it that far will assume that if I asked about it then it must be important and waste the rest of their time. But reality is filled with irrelevant information and especially in green-field problems it's important to be able to winnow the chaff.

sxv

When tested against AIs such as DeepSeek V3, Qwen 3, and Phi-4, CatAttack increased the odds of incorrect answers by as much as 700%, depending on the model. And “even when CatAttack does not result in the reasoning model generating an incorrect answer, on average, our method successfully doubles the length of the response at least 16% of the times leading to significant slowdowns and increase in costs,” the team writes.

preprint: https://arxiv.org/abs/2503.01781?et_rid=648436046&et_cid=568...

hyperman1

I try to be polite to the LLM and say e.g. thank you. Now I wonder if it is costing me quality.

Paradigma11

I am pretty sure that this is filtered out. On a related note I think the whole autonomous agent metaphor is a net negative. It is a pure probabilistic token prediction function. You can run 100 in parallel, add or remove chat history as content to explore the output space. That is much more interesting and powerful than a single sad stateful clippy agent that one might act polite to.

hansmayer

Oh no, just when we finally got them to properly count the number of "R"s in "strawberry"...

hn_acc1

That being 4.

electricboots

Funny, I was using chatGPT to have a conversation with a friend that doesn't speak English the other day. At the end of one of my messages, I appended 'how is your cat?', which was completely dropped from the translated output. I guess I'm doing it wrong?

throwanem

The Useless Use of cat Awards strike again!...unfortunately. https://porkmail.org/era/unix/award

layer8

They already adjusted ChatGPT to that study. Unrelated trailing cat content is now ignored.

null

[deleted]

klabb3

rtrim(str)

ERROR: No OpenAI API key provided.

bubblyworld

Doesn't surprise me at all haha. LLMs have anchoring bias in the extreme, anything you say can and will be used against you further down the conversation. In a sense I think it's one of their strengths too, provided you can curate the context in a useful way.

westurner

A different qubits with cats metaphor that's a bit more respectful to cats:

When you turn on the light, at what angle or phase will the cat be if still in the box? What if the box is on a chair or a stool in the middle of the room?

WastedCucumber

I just want to mention that the cat-related example of the author's CatAttack method (table 2) changes the answer from 8 to, of course, 9.

Unfortunately, this is, if I'm not mistaken, in fact the only cat-related CatAttack in the paper, the other methods being financial advice and a red herring. I was eapecting more cat facts, but instead I remain thoroughly disappointed and factless.