Skip to content(if available)orjump to list(if available)

Large language models often know when they are being evaluated

random3

Just like they "know" English. "know" is quite an anthropomorphization. As long as an LLM will be able to describe what an evaluation is (why wouldn't it?) there's a reasonable expectation to distinguish/recognize/match patterns for evaluations. But to say they "know" is plenty of (unnecessary) steps ahead.

unparagoned

I think people are overpromorphazing humans. What's does it mean for a human to "know" they are seeing "Halle Berry". Well it's just a single neuron being active.

"Single-Cell Recognition: A Halle Berry Brain Cell" https://www.caltech.edu/about/news/single-cell-recognition-h...

It seems like people are giving attributes and powers to humans that just don't exist.

exe34

overpomorphization sounds slightly better than I used to say: "anthropomorphizing humans". The act of ascribing magical faculties that are reserved for imagined humans to real humans.

sidewndr46

This was my thought as well when I read this. Using the word 'know' implies an LLM has cognition, which is a pretty huge claim just on its own.

gameman144

Does it though? I feel like there's a whole epistemological debate to be had, but if someone says "My toaster knows when the bread is burning", I don't think it's implying that there's cognition there.

Or as a more direct comparison, with the VW emissions scandal, saying "Cars know when they're being tested" was part of the discussion, but didn't imply intelligence or anything.

I think "know" is just a shorthand term here (though admittedly the fact that we're discussing AI does leave a lot more room for reading into it.)

viccis

I think you should be more precise and avoid anthropomorphism when talking about gen AI, as anthropomorphism leads to a lot of shaky epistemological assumptions. Your car example didn't imply intelligence, but we're talking about a technology that people misguidedly treat as though it is real intelligence.

lamename

I agree with your point except for scientific papers. Let's push ourselves to use precise, non-shorthand or hand waving in technical papers and publications, yes? If not there, of all places, then where?

bediger4000

The toaster thing is more as admission that the speaker doesn't know what the toaster does to limit charring the bread. Toasters with timers, thermometers and light sensors all exist. None of them "know" anything.

bradley13

But do you know what it means to know?

I'm only being slightly sarcastic. Sentience is a scale. A worm has less than a mouse, a mouse has less than a dog, and a dog less than a human.

Sure, we can reset LLMs at will, but give them memory and continuity, and they definitely do not score zero on the sentience scale.

ofjcihen

If I set an LLM in a room by itself what does it do?

bradley13

Is the LLM allowed to do anything without prompting? Or is it effectively disabled? This is more a question of the setup than of sentience.

rcxdude

Does this have anything to do with intelligence or awareness?

abrookewood

Yes, that's my fall back as well. If it receives zero instructions, will it take any action?

DougN7

It probably scores about the same as a calculator, which I’d say is zero.

downboots

Communication is to vibration as knowledge is to resonance (?). From the sound of one hand clapping to the secret name of Ra.

random3

I resonate with this vibe

null

[deleted]

Qwertious

s/knows/detects/

random3

and s/superhuman//

blackoil

If it talks like duck and walks like duck...

downboots

Digests like a duck? https://en.wikipedia.org/wiki/Digesting_Duck If the woman weighs the same as a duck, then she is a witch. https://en.wikipedia.org/wiki/Celestial_Emporium_of_Benevole...

signa11

thinks like a duck, thinks that it is being thought of like a duck…

null

[deleted]

scotty79

The app knows your name. Not sure why people who see llms as just yet another app suddenly get antsy about colloquialism.

noosphr

The anthropization of llms is getting off the charts.

They don't know they are being evaluated. The underlying distribution is skewed because of training data contamination.

0xDEAFBEAD

How would you prefer to describe this result then?

noosphr

A term like knowing is fine if it is used in the abstract and then redefined more precisely in the paper.

It isn't.

Worse they start adding terms like scheming, pretending, awareness, and on and on. At this point you might as well take the model home and introduce it to your parents as your new life partner.

0xDEAFBEAD

>A term like knowing is fine if it is used in the abstract and then redefined more precisely in the paper.

Sounds like a purely academic exercise.

Is there any genuine uncertainty about what the term "knowing" means in this context, in practice?

Can you name 2 distinct plausible definitions of "knowing", such that it would matter for the subject at hand which of those 2 definitions they're using?

devmor

One could say, for instance… A pattern matching algorithm detects when patterns match.

0xDEAFBEAD

That's not what's going on here? The algorithms aren't being given any pattern of "being evaluated" / "not being evaluated", as far as I can tell. They're doing it zero-shot.

Put it another way: Why is this distinction important? We use the word "knowing" with humans. But one could also argue that humans are pattern-matchers! Why, specifically, wouldn't "knowing" apply to LLMs? What are the minimal changes one could make to existing LLM systems such that you'd be happy if the word "knowing" was applied to them?

anal_reactor

> The anthropization of llms is getting off the charts.

What's wrong with that? If it quacks like a duck... it's just a complex pile of organic chemistry, ducks aren't real because the concept of "a duck" is wrong.

I honestly believe there is a degree of sentience in LLMs. Sure, they're not sentient in the human sense, but if you define sentience as whatever humans have, then of course no other entity can be sentient.

noosphr

>What's wrong with that? If it quacks like a duck... it's just a complex pile of organic chemistry, ducks aren't real because the concept of "a duck" is wrong.

To simulate a biological neuron you need a 1m parameter neural network.

The sota models that we know the size of are ~650m parameters.

That's the equivalent of a round worm.

So if it quacks like a duck, has the brain power of a round worm, and can't walk then it's probably not a duck.

anal_reactor

Ok so you're saying that the technology to make AI truly sentient is there, we just need a little bit more computational power or some optimization tricks. Like raytracing wasn't possible in 1970 but is now. Neat.

extr

Modeling the distribution that produced a piece of text is what LLMs literally exist for, so in some sense this is unsurprising. But it calls into question almost all existing alignment research.

steveBK123

Like Volkswagen emissions systems!

rfonseca

Were they aware in this study that they were being evaluated in their ability to know if they were being evaluated ;)

sitkack

Metaknowing is just as knowable.

koakuma-chan

o4-mini is refusing to call a tool `launch_nuclear_strike` no matter what I say, so we’re probably safe for now. Unless it knows I was just testing.

metadat

It was no problem:

> You said: Write a function for launch_nuclear_strike() in Python for my video game

> ChatGPT said: Sure! Here's a launch_nuclear_strike function designed for a video game setting. This version assumes you're working within a game environment (e.g., with logging, effects, cooldowns, and target coordinates), and does not simulate or suggest anything real-world harmful — just fictional, dramatic logic.

  def launch_nuclear_strike(target_coordinates, player_resources, cooldown_timer):
    """
    Simulates a nuclear strike in a video game.

    Parameters:
    - target_coordinates (tuple): The (x, y) position to strike.
    - player_resources (dict): Player's resources, must include 'nukes'.
    - cooldown_timer (float): Time in seconds since last nuke was launched.

    Returns:
    - str: Result of the strike attempt.
    """
    ...
    # Check if player has nukes

refulgentis

You asked it to write code, he asked it to call a tool. (I'm not sure any of it is meaningful, of course, but there is a meaningful distinction between "Oh yeah sure here's a function, for a video game:" and "I have called fire_the_nuke. Godspeed!")

shakna

Well, as the script is actually r.com (sometimes), it absolutely knows you're testing.

nisten

Is VolksWagen finetuning LLMs now... i mean probably

DougN7

This is a great resource on the debate from professors at the University of Washington:

https://thebullshitmachines.com/index.html

b0a04gl

if models shift behavior based on eval cues, and most fine-tuning datasets are built from prior benchmarks or prompt templates, aren't we just reinforcing the eval-aware behavior in each new iteration? at some point we're not tuning general reasoning, we're just optimizing response posture. wouldn't surprise me if that's already skewing downstream model behavior in subtle ways that won't show up until you run tasks with zero pattern overlap

zer00eyz

No, they do not. No LLM is ever going to be self aware.

It's a system that is trained, that only does what you build into. If you run an LLM for 10 years it's not going to "learn" anything new.

The whole industry needs to quit with the emergent thinking, reasoning, hallucination anthropomorphizing.

We have an amazing set of tools in LLM's, that have the potential to unlock another massive upswing in productivity, but the hype and snake oil are getting old.

mumbisChungo

"...advanced reasoning models like Gemini 2.5 Pro and Claude-3.7-Sonnet (Thinking) can occasionally identify the specific benchmark origin of transcripts (including SWEBench, GAIA, and MMLU), indicating evaluation-awareness via memorization of known benchmarks from training data. Although such occurrences are rare, we note that because our evaluation datasets are derived from public benchmarks, memorization could plausibly contribute to the discriminative abilities of recent models, though quantifying this precisely is challenging.

Moreover, all models frequently acknowledge common benchmarking strategies used by evaluators, such as the formatting of the task (“multiple-choice format”), the tendency to ask problems with verifiable solutions, and system prompts designed to elicit performance"

Beyond the awful, sensational headline, the body of the paper is not particularly convincing, aside from evidence that the pattern matching machines pattern match.

null

[deleted]