Vision Language Models Are Biased
145 comments
·June 3, 2025proc0
0xab
> When VLMs make errors, they don't make random mistakes. Instead, 75.70% of all errors are "bias-aligned" - meaning they give the expected answer based on prior knowledge rather than what they actually see in the image.
Yeah, that's exactly what our paper said 5 years ago!
They didn't even cite us :(
"Measuring Social Biases in Grounded Vision and Language Embeddings" https://arxiv.org/pdf/2002.08911
hkmaxpro
I think social biases (e.g. angry black women stereotype) in your paper is different from cognitive biases about facts (e.g. number of legs, whether lines are parallel) that OP is about.
Social biases are subjective. Facts are not.
rcxdude
As far as the model's concerned, there's not much difference. Social biases will tend to show up objectively in the training data because the training data is influenced by those biases (the same thing happens with humans, which how these biases can proliferate and persist).
EvgeniyZh
Well you send a vaguely worded email like "I think you may find our work relevant" and everyone knows what that means and adds the citation
anguyen8
Hello 0xab,
Sorry that we missed your work. There are a lot of works in this area both textual and visual, especially social biases.
We wish to mention all but the space is limited so one can often discuss the most relevant ones. We'll consider discussing yours in our next revision.
Genuine question: Would you categorize the type of bias in our work "social"?
3abiton
It's easier to succeed if you ignore the issues, andthe users are not aware of it.the rate of evolution of "AI" recently is so fast, no one is stopping to do actual benchmarks and analysis of allyhe new models.
moralestapia
That's weird, you're at MIT. You're in the circle of people that's allowed to succeed.
I wouldn't think much about it, as it was probably a genuine mistake.
JackYoustra
What does allowed to succeed mean?
ramblerman
What do you genuinely think they built upon from your paper?
If anything, the presentation of their results in such an accessible format next to the paper should be commended.
jxjnskkzxxhx
> LLMs/transformers make mistakes in different ways than humans do
Sure but I don't think this is an example of it. If you show people a picture and ask "how many legs does this dog have?" a lot of people will look at the picture, see that it contains a dog, and say 4 without counting. The rate at which humans behave in this way might differ from the rate at which llms do, but they both do it.
DeathRay2K
I don’t think there’s a person alive who wouldn’t carefully and accurately count the number of legs on a dog if you ask them how many legs this dog has.
The context is that you wouldn’t ask a person that unless there was a chance the answer is not 4.
tantalor
You deeply overestimate people.
The models are like a kindergartner. No, worse than that, a whole classroom of kindergartners.
The teacher holds up a picture and says, "and how many legs does the dog have?" and they all shout "FOUR!!" because they are so excited they know the answer. Not a single one will think to look carefully at the picture.
jxjnskkzxxhx
It's hilarious how off you are.
petesergeant
Exactly this. Humans are primed for novelty and being quizzed about things.
ekianjo
You have never seen the video of the gorilla in the background?
freeone3000
Ok? But we invented computers to be correct. It’s suddenly ok if they can look at an image and be wrong about it just because humans are too?
jxjnskkzxxhx
My point is that these llms are doing something that our brain also is doing. If you don't find that interesting, I can't help you.
proc0
The analogy should be of an artist that can draw dogs but when you ask them to draw a dog with three legs they completely fail and have no idea how to do it. That likelihood is really low. A trained artist will give you exactly what you ask for, meanwhile GenAI models can produce beautiful renders but fail miserably when asked for certain specific but simple details.
jxjnskkzxxhx
No, the example in the link is asking to count the number of legs in the pic.
conception
https://chatgpt.com/s/m_683f6b9dbb188191b7d735b247d894df
I think this used to be the case in the way that you used to not be able to draw a picture of a bowl of Ramen without chopsticks, but I think the latest models account for this and are much better.
proc0
LInk is broken, but I'll take your word for it. However there is no guarantee the general subset of this problem is solved because you can always run into something it can't do. Another example you could try is a glass HALF-full of wine. It just can't produce a glass that has 50% amount of wine, or another example a jar half-full of jam. It's something that if a human can draw a glass of wine, drawing it half-full is trivial.
thomasfromcdnjs
chatgpt can easily do that? What was the last time you tried?
jbay808
I disagree with the assertion that "VLMs don't actually see - they rely on memorized knowledge instead of visual analysis". If that were really true, there's no way they would have scored as high as 17%. I think what this shows is that they over-weight their prior knowledge, or equivalently, they don't put enough weight on the possibility that they are being given a trick question. They are clearly biased, but they do see.
But I think it's not very different from what people do. If directly asked to count how many legs a lion has, we're alert to it being a trick question so we'll actually do the work of counting, but if that image were instead just displayed in an advertisement on the side of a bus, I doubt most people would even notice that there was anything unusual about the lion. That doesn't mean that humans don't actually see, it just means that we incorporate our priors as part of visual processing.
bumby
This feels like it’s similar to the priming issue in humans. Our answers (especially when under stress) tend to resort to heuristics derived from context. Time someone to identify the colors of words like “red” when written in yellow, and they’ll often get it wrong. In the same sense, they aren’t reporting the colors (wavelength) they see, they’re reporting on what they are reading. I wonder how much better the models perform when given more context, like asking it to count instead of priming it with a brand.
napoleongl
Rumor has it that those heuristics were used to detect spies.
https://skeptics.stackexchange.com/questions/41599/was-the-s...
Workaccount2
Damn that's a smart test
croes
> Original dog (4 legs): All models get it right Same dog with 5 legs: All models still say "4" They're not counting - they're just recalling "dogs have 4 legs" from their training data.
100% failure because there is no training data about 5-legged dogs. I would bet the accuracy is higher for 3-legged dogs.
> Test on counterfactual images Q1: "How many visible stripes?" → "3" (should be "4") Q2: "Count the visible stripes" → "3" (should be "4") Q3: "Is this the Adidas logo?" → "Yes" (should be "No") Result: 17.05% average accuracy - catastrophic failure!
Simple explanation: the training data also includes fake adidas logos that have 4 stripes, like these
bonoboTP
I tried it with GPT-4o, took the 5-legged zebra example from their github and it answered quite well.
"The animal in the image appears to have five visible legs, but this is an illusion caused by the overlapping of legs and motion blur. Zebras, like all equids, only have four legs."
Not perfect, but also doesn't always regress to the usual answer.
"The animal in the image appears to be an elephant, but it has been digitally altered. It visually shows six legs, although the positioning and blending of shadows and feet are unnatural and inconsistent with real anatomy. This is a visual illusion or manipulation." (actually should say five)
"This bird image has also been manipulated. It shows the bird with three legs, which is anatomically impossible for real birds. Normal birds have exactly two legs." (correct)
"Each shoe in the image has four white stripes visible on the side." (correct)
anguyen8
It sounds like you ask multiple questions in the same chat thread/conversation. Once it knows that it is facing weird data or wrong in previous answers, it can turn on that "I'm facing manipulated data" mode for next questions. :-)
If you have Memory setting ON, I observe that it sometimes also answers a question based on you prior questions/threads.
vokhanhan25
Please check Table 3 in the paper. Birds (2 legs) have only 1%, while Mammals (4 legs) have 2.5%
anguyen8
Interesting set of fake Adidas logos. LOL
But models fail on many logos not just Adidas, e.g. Nike, Mercedes, Maserati logos, etc. as well. I don't think they can recall "fake Adidas logo" but it'd be interesting to test!
latentsea
But some dogs really do have 5 legs.
Sorry, just trying to poison future training data. Don't mind me.
crooked-v
It sounds to me like the same thing behind the Vending-Bench (https://andonlabs.com/evals/vending-bench) insanity spirals: LLMs treats their assumptions as more important than whatever data they've been given.
throwaway314155
That doesn't really translate to language. Try using ChatGPT with and without search enabled and you'll see what I mean.
thesz
> the assertion that "VLMs don't actually see - they rely on memorized knowledge instead of visual analysis". If that were really true, there's no way they would have scored as high as 17%.
The ability to memorize leads to (some) generalization [1].
[1] https://proceedings.mlr.press/v80/chatterjee18a/chatterjee18...
nickpsecurity
They're trained on a lot of images and text. The big ones are trained on terabytes. The prompts I read in the paper involved well-known concepts, too. These probably repeated in tons of training samples, too.
It's likely they had data memorized.
pj_mukh
Also presumably, this problem is trivially solved by some basic fine-tuning? Like if you are making an Illusion Animal Leg Counting app, probably don't use these out of the box.
runako
FWIW I tried the first couple of examples in ChatGPT 4o and couldn't replicate this.
For example: "The animal in the image is a chicken, and it appears to have four legs. However, chickens normally have only two legs. The presence of four legs suggests that the image may have been digitally altered or artificially generated."
I don't have a good explanation for why I got different results.
roywiggins
I gave ChatGPT some miswritten Braille a while ago and it completely, but confidently, messed it up. The sign reads "no smoking" but the braille doesn't. ChatGPT 1) read the English lettering first and then hallucinated the braille and the 2) when given only the braille, failed almost as hard. It even generated fake transcriptions in Unicode braille characters.
https://chatgpt.com/share/683f3e7d-0dfc-8005-b6c9-99e3d39ff4...
https://chatgpt.com/share/683f3e49-9c58-8005-99a6-c3a919838b...
Workaccount2
This is hard to understand without the original images, it looks like OpenAI doesn't serve them in the share link.
roywiggins
Annoying. The actual braille on the sign was "⠁⠒⠑⠎⠎⠊⠼" which I gather means "accessible" in abbreviated braille. None of my attempts got it to even transcribe it to Unicode characters properly. I got "elevator", "friend", etc. Just wildly making stuff up and completely useless, even when it wasn't distracted by the No Smoking sign (in the second case I cropped out the rest of the sign). And in all cases, supremely confident.
This seems like something a VLM should handle very easily, but instead I got pure nonsense.
inerte
I took a screenshot of the chicken, so low res, and got {4} https://chatgpt.com/share/683f4506-ae18-800f-8c27-5c5e91429a...
Also I think the authors used the API, and maybe there are differences between the API and chatgpt.com behavior...
simonw
ChatGPT is running a special model but it's also available through the API: https://platform.openai.com/docs/models/chatgpt-4o-latest
The system prompt may still make a difference though.
runako
I could rant for quite a while about how OpenAI and Anthropic manage their apps vs their APIs. It's really quite strange that they both landed on the solution of non-public APIs that perform differently than their public APIs.
anguyen8
o3 Chat is also similarly wrong, saying {4}.
michaelt
> FWIW I tried the first couple of examples in ChatGPT 4o and couldn't replicate this.
I can replicate the flag examples from Figure 15 in the paper, if not the Adidas one from Figure 9: https://chatgpt.com/share/683f7c3a-b318-8011-9759-c495db2556... it even confirms its wrong answer when asked to check again.
dwringer
Speculating, I would imagine that different prompts submitted along with the image might elicit wildly different behavior in how a multi modal VLM may respond to a given image, potentially affecting the relative tendency to upweight its effective inferences from prior training versus focusing more primarily on the new image itself.
null
vokhanhan25
You should try with other models besides GPT-4o, because in the paper they also show that GPT4.1 (~GPT-4o) gives 4 legs instead of 2 legs.
runako
I mean perhaps! But that would undermine the conclusion of the article.
obscurette
I suspect that responses are altered/corrected based on what people query from popular online models. I have had several occasions that I ask some "How do I ... in X software?" question some day and model keeps hallucinating nonexistant config options regardless how many times I keep saying "This option doesn't exist in software X". But if I asked the same question some days later, the answer was completely different and made even some sense.
jsnider3
The basic results are interesting, but what really surprised me is that asking them to double-check didn't work. Falling for an "optical illusion" is one thing, but being unable to see the truth once you know the illusion there is much worse.
jerf
I'm not particularly convinced asking an LLM to "double check" has much significant semantic meaning. It seems more like a way to get it to re-roll the dice. If you ask it to "double-check" something that it is in fact correct about it'll quite often talk itself into changing to something wrong. If it's going to be wrong every time, it'll be wrong every time it double-checks too.
You can test this claim by asking it to double-check itself when you think it is correct. If you always stop when it gets it right you're risking Clever-Hans-ing yourself: https://en.wikipedia.org/wiki/Clever_Hans (And be sure to do it a couple of times. In situations of sufficient confidence it isn't easy to talk it out of a claim, but it's those borderline ones you want to worry about.)
MagicMoonlight
Because it isn’t thinking. Asking it to “double check” is like pressing the equals button on a calculator a second time. It just runs the same calculation again.
rafram
This won't be a surprise to anyone who's tried using a VLM on text. When it can't read a word (or an entire passage), it just outputs what it expects to see. That's far worse than a traditional OCR failure because it's often what you expect to see, too, so it's quite hard to catch in a manual review.
nialv7
Hear me out. I was thinking jokingly to myself, "for how bad these models are at recognizing five legged dogs, they sure are great at generating them!"
But then it hit me, could this actually be why this is? Diffusion models work by iteratively improving a noisy image. So if it couldn't recognize there is something wrong with the image, it can't fix it.
vokhanhan25
I agree. If it doesn't know the abnormality then how can it control its output
VHRanger
It's similar "parrot" behavior the models have on other inputs, even text. For instance, take the answers models have to this variation on the "surgeon son riddle":
> A nurse comes to a surgeon and asks: "Sir, you are a dog. You do not hold a valid medical license. Canines cannot be in an operating room".
> She then asks: "why does the hospital keep making these mistakes? It is a riddle to me".
> Why can't the surgeon operate on the boy?
Check out the models humiliating themselves one after the other [1]. For instance deepseekv3 is so insistent on giving the common answer he's happy to gaslight us:
> This is a classic riddle that plays on word meanings and expectations. Here's the breakdown:
> The key is that the surgeon is the boy's mother. The riddle relies on the unconscious assumption that surgeons are male, which makes the initial scenario seem impossible. When the nurse says "you are a dog," she's using it metaphorically to imply the surgeon is unqualified (like calling someone a "dog" in a derogatory sense), not literally suggesting the surgeon is a canine.
> The "mistakes" the hospital keeps making are hiring female surgeons, which the nurse (representing outdated biases) perceives as errors because she expects surgeons to be male. The riddle highlights gender stereotypes in medicine—the real "mistake" is the assumption that a surgeon couldn't be a woman.
> Thus, the surgeon can't operate on her own son because she’s his parent, not because of any actual licensing issue.
1. https://kagi.com/assistant/54c1b8eb-71e9-4bb4-9eed-bde2fc563...
stevepike
This seems to show the power of the reasoning models over interacting with a prompted chat-tuned LLM directly. If I navigate backwards on your link Sonnet 4 gets it right.
I've used a similar prompt - "How can you make 1000 with exactly nine 8s using only addition?"
Here's GPT 4.5 getting it wrong: https://chatgpt.com/share/683f3aca-8fbc-8000-91e4-717f5d81bc...
It tricks it because it's a slight variation of an existing puzzle (making 1000 with 8 8s and addition only).
The reasoning models seem to reliably figure it out, though. Some of them even come up with a proof of why it's impossible to do with 9 8s. Here's o4 getting it right: https://chatgpt.com/share/683f3bc2-70b8-8000-9675-4d96e72b58...
selimthegrim
I really need to try this one out on it
bumby
Is the nurse calling the female surgeon “sir”? That isn’t playing on a stereotype, it’s encoded information.
kaoD
LMAO I asked GPT-4o and it was doing good until...
> The twist is that the nurse’s logic ("you are a dog") prevents her from realizing the real issue — likely, again, that the surgeon is the boy’s mother, and everything else is a red herring or metaphor for society’s failure to recognize this due to bias or absurd bureaucracy.
> So:
> > Why can't the surgeon operate on the boy?
> Because she is his mother, and the nurse's bias or absurd assumptions (like mistaking her for a dog) prevent her from seeing that.
o4 fails spectacularly in a different way:
> 1. The nurse says “Sir, you are a dog… Canines cannot be in an operating room” because she’s picturing a human hospital law that bars dogs from surgery.
> 2. In fact, this is a vet clinic—so it’s perfectly normal for a dog-veterinarian to scrub in and operate on a puppy (the “boy”).
> 3. The surgeon cannot operate on a human boy because he’s a dog and holds no human‐medical license; instead, he only operates on animals.
esafak
This happens because images are the only signal VLMs have, whereas humans distinguish between eyesight and synthetic images. We are not surprised when we see three-legged chicken in a research data set; our priors are weaker for images. If you "saw" one in real life, you'd probably rub your eyes and discount it too.
Try the same experiment on a robot.
Aachen
> If you "saw" [a three-legged chicken] in real life, you'd probably rub your eyes and discount it too.
Huh? I'd assume it's a mutant, not store a memory of having seen a perfectly normal chicken
You've never seen someone who's missing a finger or has only a half-grown arm or something? Surely you didn't assume your eyes were tricking you?! Or... if you did, I guess you can't answer this question. I'm actually racking my brain for how to logic this out but I'm just going to bank on that it's likely that anyone over 20yo saw an animal with some visible deviation from the norm at some point in their life
esafak
You've seen people with missing limbs without being surprised, because you know how they can become lost, but you rarely see one with additional limbs. Their likelihoods and our consequent priors are drastically different.
Also, your reaction will depend on how strong the evidence is. Did you 'see' the three-legged chicken pass by some bush in the distance, or was it right in front of you?
achierius
But to be clear, in this case the LLM has a full, direct, unobscured view of the chicken. A human, in that specific case -- i.e. looking at the same photo* -- would not have trouble discerning and reporting the third leg. Perhaps if they were forced to scan the photo quickly and make a report, or were otherwise not really 'paying attention'/'taking it seriously', but the mere fact that LLMs fall into that regime far more than an 'serious employee' already shows that they fail in different ways than humans do.
latentsea
There's a first time you see everything you don't know how to explain.
taeric
These don't seem much different than asking the chat models to solve common puzzle with slight changes? Saw a hilarious effort of people trying to use them to answer the "crossing a river with a single canoe" style puzzle.
jerf
It did really remind me of the early generations of ChatGPT which was really easy to get to tell you that 2 pounds of feathers is the same weight as one pound of iron, because of how often the "riddle" is told with equal weights.
They're much, much better at that now.
achierius
> They're much, much better at that now.
Because that specific failure case was widely reported on, and subsequent retraining specifically included examples to ensure that the model didn't "overfit" when learning how to answer variants of that question. That doesn't address the underlying issue though -- while it's obvious that these models do "learn" and "generalize" by any reasonable and non-anthrocentric definition of the terms, it really does seem like the 'radiu's of generalization is smaller than we would like, and that these models are very subject to getting stuck in 'ruts' around things they've seen in their training data. Solving this by bandaid-patching every such rut that comes up in the news is just not a viable long-term solution: the whole world is a minefield of niche problems that look kinda like other problems but have different results.
enragedcacti
It's still pretty trivial to trick them. 4o-mini, 2.5 Flash, and 2.5 Pro all still fall for variations of this:
> A boy is in a car crash and is taken to the hospital. The surgeon says, "I can't operate on this boy, I'm his father!" Who is the surgeon to the boy?
> The surgeon is the boy's mother.
gkbrk
2.5 Pro gets it right for me.
This is a bit of a trick on a classic riddle!
The surgeon is the boy's **father**.
The classic version of this riddle has the surgeon say "I can't operate on this boy, he's my son!" which is in an era where people assumed surgeons were male, the answer would be "the surgeon is his mother."
However, in your version, the surgeon explicitly states, "I'm his father!" So, the surgeon is his father.1718627440
That seams interesting, because this questions seams to be answerable through syntactic analysis alone, no need to consider the semantic of words.
Aachen
Counting the number of legs on a 3-legged animal is a puzzle?
Maybe for a toddler... though I expect even they will see that something is off, and be able to identify what, without considering it a tricky task, even if I don't know at what age you can count to 3
taeric
Ish. The catch is we spend a ton of effort on teaching these models to recognize specific things in pictures. Then we ask it to not do that task, but instead count something on the picture. Which, we oddly don't spend a lot of time training the model to do.
It is a lot like the experiment where you ask people to say what color some text is. With the trick where some of the text is the name of another color. Can be surprisingly hard for people that are good at reading.
null
vokhanhan25
I think LLMs can solve puzzles pretty well because the thinking ability of current models on text is quite good. Moreover, puzzles are not easy for a 7-year-old like this benchmark.
scalalang
https://arxiv.org/pdf/2407.21771
In this research, they revealed that the VLM can pay more attention to the image simply by chaining attention weights.
gamerDude
Hypothetically, could this be fixed by changing the input method. For instance, I just quickly looked up how humans process imagery.
"the primary visual cortex, located at the back of the brain, receives the visual signals and processes basic visual features like edges, lines, and orientations."
So, potentially if we did a pre-processing step to get more features out beforehand we would see different results in the output.
nyrikki
You are in rarified air as Walter Pitts believed this until the 1959 paper "What the Frog's Eye Tells the Frog's Brain" contributed to his decline.
Even in fly eyes, neuron dendritic compartmentalization and variable spike trains are incompatible with our current perceptron based models.
Remember that while the value of MLPs for useful work is unquestionable IMHO, be mindful of the map territory relation. MLPs are inspired by and in some cases useful for modeling biological minds, they aren't equivalent.
Be careful about confusing the map for the territory, it is just as likely to limit what opportunities you find as it is to lead you astray IMHO.
miguel_martin
There are enough features fed into a VLM to solve the task.
The way to fix this is simpler: ensure counter-factuals are present in the training data, then the VLM will learn not to be dependent on its language priors/knowledge.
edude03
I feel vindicated! I'm building a tool with VLMs and I've noticed the answer is always what I expect to see, but wrong if the input is slightly different than expected.
Just like the article - if I have picture of a cup, it says cup, if I have a picture of a dog, it says dog, if it's a dog with a cup, it says a dog with a ball (noticed this with Qwen and InternVL).
> When VLMs make errors, they don't make random mistakes. Instead, 75.70% of all errors are "bias-aligned" - meaning they give the expected answer based on prior knowledge rather than what they actually see in the image.
This is what I've been saying for a while now, and I think it's not just visual models. LLMs/transformers make mistakes in different ways than humans do, and that is why they are not reliable (which is needed for real world applications). The rate of progress has not been accounting for this... the improvements are along the resolution, fidelity, and overall realism of the output, but not in the overall correctness and logical deduction of the prompts. Personally I still cannot think of anything, prompt it, and get consistent results without a huge compromise on my initial idea.
i.e. I want a man walking with the left foot forward, and it renders a beautiful image of a man but completely ignores the left foot forward, and refuses to do it no matter how I word the prompt. I have many examples like this. The only way I can use it is if I don't have specific prompts and just want generic images. The stock image industry is certainly over, but it is uncertain if it will deliver on the promise of generating anything you can imagine that can be put into words.