Are LLMs able to notice the “gorilla in the data”?
93 comments
·February 5, 2025mariofilho
I uploaded the image to Gemini 2.0 Flash Thinking 01 21 and asked:
“ Here is a steps vs bmi plot. What do you notice?”
Part of the answer:
“Monkey Shape: The most striking feature of this plot is that the data points are arranged to form the shape of a monkey. This is not a typical scatter plot where you'd expect to see trends or correlations between variables in a statistical sense. Instead, it appears to be a creative visualization where data points are placed to create an image.”
Gemini 2.0 Pro without thinking didn’t see the monkey
martinsnow
It thought my bald colleague was a plant in the background. So don't have high hopes for it. He did wear a headset so that is apparently very plant like.
memhole
Favorite thing recently has been using the vision models to make jokes. Sometimes non sequiturs get old, but occasionally you hit the right one that’s just hilarious. It’s like monster rancher for jokes.
jagged-chisel
Wrong kind of plant. See sibling comment.
ffsm8
Maybe he's actually a spy
forgotusername6
I had a recent similar experience with chat gpt and a gorilla. I was designing a rather complicated algorithm so I wrote out all the steps in words. I then asked chatgpt to verify that it made sense. It said it was well thought out, logical etc. My colleague didn't believe that it was really reading it properly so I inserted a step in the middle "and then a gorilla appears" and asked it again. Sure enough, it again came back saying it was well thought out etc. When I questioned it on the gorilla, it merely replied saying that it thought it was meant to be there, that it was a technical term or a codename for something...
jancsika
Just imagining an episode of Star Trek where the inhabitants of a planet have been failing to progress in warp drive tech for several generations. The team beams down to discover that society's tech stopped progressing when they became addicted to pentesting their LLM for intelligence, only to then immediately patch the LLM in order to pass each particular pentest that it failed.
Now the society's time and energy has shifted from general scientific progress to gaining expertise in the growing patchset used to rationalize the theory that the LLM possesses intelligence.
The plot would turn when Picard tries to wrest a phasor from a rogue non-believer trying to assassinate the Queen, and the phasor accidentally fires and ends up frying the entire LLM patchset.
Mr. Data tries to reassure the planet's forlorn inhabitants, as they are convinced they'll never be able to build the warp drive now that the LLM patchset is gone. But when he asks them why their prototypes never worked in the first place, one by one the inhabitants begin to speculate and argue about the problems with their warp drive's design and build.
The episode ends with Data apologizing to Picard since he seems to have started a conflict among the inhabitants. However, Picard points Mr. Data to one of the engineers drawing out a rocket test on a whiteboard. He then thanks him for potentially spurring on the planet's next scientific revolution.
Fin
sillysaurusx
There actually is an episode of TNG similar to that. The society stopped being able to think for themselves, because the AI did all their thinking for them. Anything the AI didn’t know how to do, they didn’t know how to do. It was in season 1 or season 2.
LtWorf
This has more plot than all the seasons of star trek picard together :D
jimkoen
That's actually a moderately decent pitch for an episode.
sigmoid10
>it thought it was meant to be there, that it was a technical term or a codename for something
That's such a classical human behaviour in technical discussions, I wouldn't even be mad. I'm more surprised that picked up on that behaviour from human generated datasets. But I suppose that's what you get from scraping places like Stackoverflow and HN.
ben_w
I'm reminded of one of the earlier anecdotes from OpenAI about fine-tuning — to paraphrase:
> This writer fine tuned on all their slack messages, then asked it to write a blog post. It replied "Sure, I'll do it tomorrow"
> Then he said "No, do it now", and it replied "OK, sure thing" and did nothing else.
izackp
I typically tell it that there at 5 problems in the logic. Summarize the steps, why it’s necessary, and what typically comes after that step. Then please list and explain all five errors.
WesolyKubeczek
“The ball bearings are to be made of wood, because no one is going to read this work this far anyway.”
Karellen
And a bowl of M&Ms, with all the brown ones taken out - to make sure they did read this far.
CamperBob2
This is literally how human brains work: https://www.npr.org/2010/05/19/126977945/bet-you-didnt-notic...
ben_w
Given I knew what the test was before seeing one of these videos (yes, there is more than one), I find it extra weird that I still didn't see the gorilla the first time.
Now, I see it every time.
ben_w
On the one hand, this is very human behaviour, both literally and in general.
Literally, because this is why the Datasaurus dozen was created: https://en.wikipedia.org/wiki/Datasaurus_dozen
Metaphorically, because of all the times (including here, on this very article :P) where people comment on the basis of the headline rather than reading a story.
On the other hand, this isn't the bit of human cognition we should be trying to automate, it's the bit we should be using AI to overcome.
johnfn
GPT can't "see" the results of the scatterplot (unless prompted with an image), it only sees the code it wrote. If a human had the same constraints I doubt they'd identify there was a gorilla there.
Take a screenshot of the scatterplot and feed it into multimodal GPT and it does a fine job at identifying it.
EDIT:
Sorry, as a few people pointed out, I missed the part where the author did feed a PNG into GPT. I kind of jumped to conclusions when it worked fine for me. I still maintain that the article's conclusion ("Your AI Can't See Gorillas") is overly broad, given that I had no trouble getting it to see one.
But I wonder why the author had trouble? My suspicion is that AI got stuck on summary statistics because the previous messages in the chat were all about summary statistics.
dartos
Before seeing Claude’s response, did you see where the author said
> I asked the model to closely look at the plot, and also uploaded a png of the plot it had generated.
Kuinox
I sent the plot to ChatGPT 4o. Here is the conversation:
what do you see
ChatGPT said:
This is a scatter plot with the variables "steps" on the x-axis and "bmi" on the y-axis. The data points are colored by "gender" (red for female and blue for male). Interestingly, the arrangement of the points appears to form a drawing resembling a cartoonish figure or character, likely added for artistic or humorous effect.
If you'd like, I can help analyze or discuss the data further!
You said:
Wha the cartoonish figure looks like ?
ChatGPT said:
The cartoonish figure in the plot resembles a stylized gorilla or monkey-like character. It appears to be waving with one hand raised, while the other arm is resting downward. The face is expressive, with distinct eyes, a nose, and a slightly frowning mouth. The overall pose gives it a somewhat playful or cheeky vibe.
dartos
Sure whatever.
OC seemed to think that Claude did that with just the data and not the image of the scatterplot it’s.
johnfn
Hm, interesting. The way I tried it was by pasting an image into Claude directly as the start of the conversation, plus a simple prompt ("What do you see here?"). It got the specific image wrong (it thought it was baby yoda, lol), but it did understand that it was an image.
I wonder if the author got different results because they had been talking a lot about a data set before showing the image, which possibly predisposed AI to think that it was a normal data set. In any case, I think that "Your Ai Can't See Gorillas" isn't really a valid conclusion.
vunderba
Please read TFA. The conclusion of the article isn't nearly so simplistic, they're just suggesting that you have to be aware of the natural strengths and weaknesses of LLMs, even multi modal ones particularly around visual pattern recognition vs quantitative pattern recognition.
And yes, the idea that the initial context can sometimes predispose the LLM to consider things in a more narrow manner than a user might otherwise want is definitely well known.
SequoiaHope
What you refer to as the article’s conclusion is in fact the article’s title. The article’s conclusion (under “Thoughts” at the end) may be well summarized by its first sentence: “As the idea of using LLMs/agents to perform different scientific and technical tasks becomes more mainstream, it will be important to understand their strengths and weaknesses.”
The conclusion is quite reasonable and the article was IMO well written. It shares details of an experiment and then provides a thoughtful analysis. I don’t believe the analysis is overly broad.
KeplerBoy
Does ChatGPT even have access to the raw data points or does it just know the path to some CSV?
The contents of the CSV might be entirely unknown at inference time.
wodenokoto
I love that gorilla test. Happens in my team all the time, that people start with the assumption that the data is “good” and then deep dive.
Is there a blog post that just focus on the gorilla test that I can share with my team? I’m not even interested in the LLM part
hammock
Same here. Can’t count the number of times I’ve had to come in and say “hold on, you built an entire report with conclusions and recommendations but didn’t stop to say hmm this data looks weird and dig into validation?” “We assumed the data was right and that it must be xyz…”
A corollary if this that is my personal pet peeve is attributing everything you can’t explain to “seasonality” , that is such a crutch. If you can’t explain it then just say that. There is a better than not chance it is noise anyway.
ben_w
> A corollary if this that is my personal pet peeve is attributing everything you can’t explain to “seasonality” , that is such a crutch. If you can’t explain it then just say that. There is a better than not chance it is noise anyway.
Very early in my career, I discovered python's FFT libraries, and thought I was being clever when plugging in satellite data and getting a strong signal.
Until I realised I'd found "years".
8n4vidtmkvmk
> attributing everything you can’t explain to “seasonality”
Is this a literal thing or figurative thing? Because it should be very easy to see the seasons if you have a few years of data.
I just attribute all the data I don't like to noise :-)
xboxnolifes
Just because something happens on a yearly cadence doesn't mean that "seasonality" is a good reasoning. It's just restating that it happens on a yearly cadence, it doesn't actually explain why it happens.
sw1sh
I got "The scatter plot appears to be arranged to resemble the character "Pepe the Frog," a popular internet meme ... " lol
Not sure whether multimodal embeddings have such a good pattern recognition accuracy in this case, probably most of information goes into attending to plot related features, like its labels and ticks.
hollownobody
AFAIK, these models can't "look" at the plots they build. So it is necessary to send the screenshots, otherwise they will never notice the gorilla.
meltyness
Does anyone know if tokenizers are pruned? That is, if a token doesn't appear in the corpus is it removed from the model? That would imply a process that leaks information about the dataset.
silverkiwi
The evolution from LLM to Reasoning is simply multi pass or recursive questioning.
What’s missing in the terminology is the modality- most often TEXT.
So really we on have Test LLM or Text Reasoning models at the moment.
Your example illustrates the benefits of Multi Modal Reasoning (using multiple modality with multi pass)
Good news - this is coming (I’m working on it). Bad news this massively increases the compute as each pass now has to interact with each modality. Unless the LLM is fully multi modal (Some are) - this now forces the multipass questions to accommodate. The number of extra possible paths massively increases. Hopefully we stumble across a nice solution. But the level of complexity massively increases with each additional modality (text,audio,images, video etc)
amelius
Can it draw the unicorn yet?
hwillis
I wondered if o1 would do better- seems reasonable that step-by-step trying to produce legs/torso/head/horn would do better than very weird legless things 4o is making. Looks like someone has done it: https://openaiwatch.com/?model=o1-preview
They do seem to generally have legs and head, which is an improvement over 4o. Still pretty unimpressive.
GaggiX
Claude 3.5 Sonnet is much better at it: https://claude.site/artifacts/ad1b544f-4d1b-4fc2-9862-d6438e...
But I guess GPT-4o results are more funny to look at.
notnmeyer
maybe i dont get it, but can we conclusively say that the gorilla wasn’t “seen” vs. deemed to be irrelevant to the questions being asked?
“look at the scatter plot again” is anthropomorphizing the llm and expecting it to infer a fairly odd intent.
would queries like, “does the scatter plot visualization look like any real world objects?” may have produced a result the author was fishing for.
if it were the opposite situation and you were trying to answer “real” questions and the llm was suggesting, “the data is visualized looks like notorious big” we’d all be here laughing at a different post about the dumb llm.
These posts about X task LLMs fails at when you give it Y prompt are getting more and more silly.
If you ask an AI to analyze some data, should the default behavior be to use that data to make various types of graphs, export said graphs, feed them back in to itself, then analyze the shapes of those graphs to see if they resemble an animal?
Personally I would be very annoyed if I actually wanted a statistical analysis, and it spent a bajillion tokens following the process above in order to tell me my data looks like a chicken when you tip it sideways.
> However, this same trait makes them potentially problematic for exploratory data analysis. The core value of EDA lies in its ability to generate novel hypotheses through pattern recognition. The fact that both Sonnet and 4o required explicit prompting to notice even dramatic visual patterns suggests they may miss crucial insights during open-ended exploration.
It requires prompting for x if you want it to do x... That's a feature, not a bug. Note that no mention of open-ended exploration or approaching the data from alternate perspectives was made in the prompt.