Visual Reasoning Is Coming Soon

50 comments

·April 9, 2025

AIPedant

This seems to ignore the mixed record of video generation models:

  For visual reasoning practice, we can do supervised fine-tuning on sequences similar to the marble example above. For instance, to understand more about the physical world, we can show the model sequential pictures of Slinkys going down stairs, or basketball players shooting 3-pointers, or people hammering birdhouses together.... 

  But where will we get all this training data? For spatial and physical reasoning tasks, we can leverage computer graphics to generate synthetic data. This approach is particularly valuable because simulations provide a controlled environment where we can create scenarios with known outcomes, making it easy to verify the model's predictions. But we'll also need real-world examples. Fortunately, there's an abundance of video content online that we can tap into. While initial datasets might require human annotation, soon models themselves will be able to process videos and their transcripts to extract training examples automatically.

Almost every video generator makes constant "folk physics" errors and doesn't understand object permanence. DeepMind's Veo2 is very impressive but still struggles with object permanence and qualitatively nonsensical physics: https://x.com/Norod78/status/1894438169061269750

Humans do not learn these things by pure observation (newborns understand object permanence, I suspect this is the case for all vertebrates). I doubt transformers are capable of learning it as robustly, even if trained on all of YouTube. There will always be "out of distribution" physical nonsense involving mistakes humans (or lizards) would never make, even if they've never seen the specific objects.

throwanem

> newborns understand object permanence

Is that why the peekaboo game is funny for babies? The violated expectation at the soul of the comedy?

broof

Yeah I had thought that newborns famously didnt understand object permanence and that it developed sometime during their first year. And that was why peekaboo is fun, you’re essentially popping in and out of existence.

AIPedant

This is a case where early 20th century psychology is wrong, yet still propagates as false folk knowledge:

https://en.wikipedia.org/wiki/Object_permanence#Contradictin...

andoando

babies pretty much laugh if you're laughing and being silly

moi2388

No, they understand object permanence just fine.

Peekaboo is fun because fun is fun. When doing peekaboo the other person is paying attention to you, and often smiling and being relaxed.

They laugh just as much if you play ‘peekaboo’ without actually covering your face ;)

dinfinity

You provide no actual arguments as to why LLMs are fundamentally unable to learn this. Your doubt is as valuable as my confidence.

viccis

Because the nature of their operation (learning a probability distribution over a corpus of observed data) is not the same as creating synthetic a priori knowledge (object permanence is a case of cause and effect which is synthetic a priori knowledge). All LLM knowledge is by definition a posteriori.

AstralStorm

That LLMs cannot synthesize it into a propri, including other rules of logic and mathematics, is a major failure of the technology...

AIPedant

Well, it's a good thing I didn't say "fundamentally unable to learn this"!

I said that learning visual reasoning from video is probably not enough: if you claim it is enough, you have to reconcile that with failures in Sora, Veo 2, etc. Veo 2's problems are especially serious since it was trained on an all-DeepMind-can-eat diet of YouTube videos. It seems like they need a stronger algorithm, not more Red Dead Redemption 2 footage.

dinfinity

> I said that learning visual reasoning from video is probably not enough

Fair enough; you did indeed say that.

> if you claim it is enough, you have to reconcile that with failures in Sora, Veo 2, etc.

This is flawed reasoning, though. The current state of video generating AI and the completeness of the training set does not reliably prove that the network used to perform the generation is incapable of physical modeling and/or object permanence. Those things are ultimately (the modeling of) relations between past and present tokens, so the transformer architecture does fit.

It might just be a matter of compute/network size (modeling four dimensional physical relations in high resolution is pretty hard, yo). If you look at the scaling results from the early Sora blogs, the natural increase of physical accuracy with more compute is visible: https://openai.com/index/video-generation-models-as-world-si...

It also might be a matter of fine-tuning training on (and optimizing for) four dimensional/physical accuracy rather than on "does this generated frame look like the actual frame?"

nonameiguess

"All of YouTube" brings the same problem as training on all of the text on the Internet. Much of that text is not factual, which is why RLHF and various other fine-tuning efforts need to happen in addition to just reading all the text on the Internet. All videos on YouTube are not unedited footage of the real world faithfully reproducing the same physics you'd get by watching the real world instead of YouTube.

As for object permanence, I don't know jack about animal cognitive development, but it seems important that all animals are themselves also objects. Whether or not they can see at all, they can feel their bodies and sense in some way or other its relation to the larger world of other objects. They know they don't blink in and out of existence or teleport, which seems like it would create a strong bias toward believing nothing else can do that, either. The same holds true with physics. As physical objects existing in the physical world, we are ourselves subject to physics and learn a model that is largely correct within the realm of energy densities and speeds we can directly experience. If we had to learn physics entirely from watching videos, I'm afraid Roadrunner cartoons and Fast and the Furious movies would muddy the waters a bit.

nkingsy

The example of the cat and detective hat shows that even with the latest update, it isn't "editing" the image. The generated cat is younger, with bigger, brighter eyes, more "perfect" ears.

I found that when editing images of myself, the result looked weird, like a funky version of me. For the cat, it looks "more attractive" I guess, but for humans (and I'd imagine for a cat looking at the edited cat with a keen eye for cat faces), the features often don't work together when changed slightly.

porphyra

Chatgpt 4o's advanced image generation seems to have a low-resolution autoregressive part that generates tokens directly, and an image upscaling decoding step that turns the (perhaps 100 px wide) token-image into the actual 1024 px wide final result. The former step is able to almost nail things perfectly, but the latter step will always change things slightly. That's why it is so good at, say, generating large text but still struggles with fine text, and will always introduce subtle variations when you ask it to edit an existing image.

BriggyDwiggs42

Has anyone tried putting in a model that selects the editing region prior to the process? Training data would probably be hard, but maybe existing image recognition tech that draws rectangles would be a start.

ilynd

Genuine question - how would such a model "edit" the image, besides manipulating the binary? I.e. changing pixel values programmatically

Tiberium

It's sad that they used 4o's image generation feature for the cat example which does some diffusion or something else, results in the whole image changing. They should've instead used Gemini 2.0 Flash's image generation feature (or at least mentioned it!), which, even if far lower quality and resolution (max of 1024x1024, but Gemini will try to match the res of the original image, so you can get something like 681x1024), is much much better at leaving the untouched parts of the image actually "untouched".

Here's the best out of a few attempts for a really similar prompt, more detailed since Flash is a much smaller model "Give the cat a detective hat and a monocle over his right eye, properly integrate them into the photo.". You can see how the rest of the image is practically untouched to the naked human eye: https://ibb.co/zVgDbqV3

Honestly Google has been really good at catching up in the LLM race, and their modern models like 2.0 Flash, 2.5 Pro are one of (or the) best in their respective areas. I hope that they'll scale up their image generation feature to base it on 2.5 Pro (or maybe 3 Pro by the time they do it) for higher quality and prompt adherence.

If you want, you can give 2.0 Flash image gen a try for free (with generous limits) on https://aistudio.google.com/prompts/new_chat, just select it in the model selector on the right.

blixt

I'm not sure I see the behavior in the Gemini 2.0 Flash model's image output as a strength. It seems to me it has multiple output modes, one indeed being masked edits. But it also seems to have convolutional matrix edits (e.g. "make this image grayscale" looks practically like it's applying a Photoshop filter) and true latent space edits ("show me this scene 1 minute later" or "move the camera so it is above this scene, pointing down"). And it almost seems to me these are actually distinct modes, which seems like it's been a bit too hand engineered.

On the other hand, OpenAI's model, while it does seem to have some upscaling magic happening (which makes the outputs look a lot nicer than the ones from Gemini FWIW), also seems to perform all its edits entirely in latent space (hence it's easy to see things degrade at a conceptual level such as texture, rotation, position, etc.) But this is a sign that its latent space mode is solid enough to always use, while with Gemini 2.0 Flash I get the feeling when it is used, it's just not performing as well.

esperent

> You can see how the rest of the image is practically untouched to the naked human eye: https://ibb.co/zVgDbqV3

The cat's facial hair coloring is entirely different in your image, far more so than the OpenAI one. It has a half white nose instead of a half black nose, the ears are black instead of pink, the cheeks are solid colors instead of striped. Yours is the head of an entirely different cat grafted onto the original body.

The OpenAI one is like the original cat run through a beauty filter. The eyes are completely different. But it's facial hair patterning is matched much better.

Neither one of great though. They both output cats that are not recognizable as the original.

casey2

For me flash is much worse. Both are incapable of generating any kind of flowchart similar to the one I give, but 4o does slightly better in that I can read it. Neither make sense.

uaas

> Rather watch than read? Hey, I get it - sometimes you just want to kick back and watch! Check out this quick video where I walk through everything in this post

Hm, no, I’ve never had this thought.

roguecoder

Pivot To Video will never die.

esperent

Me neither. But some people are dyslexic or non native speakers (it's common to have listening comprehension better than reading comprehension when learning a language) and have to put more effort into reading than I do, and some people genuinely just prefer video.

I think it's decent of the author to provide a video for these people.

rel_ic

The inconsistency of an optimistic blog post ending with a picture of a terminator robot makes me think this author isn't taking themself seriously enough. Or - the author is the terminator robot?

null

[deleted]

porphyra

I think that one reason that humans are so good at understanding images is that our eyes see video rather than still images. Video lets us see "cause and effect" by seeing what happens after something. It also allows us to grasp the 3D structure of things since we will almost always see everything from multiple angles. So long as we just feed a big bunch of stills into training these models, it will struggle to understand how things affect one another.

throwanem

I have some bad news for you about how every digital video you've ever seen in your life is encoded.

District5524

The first caption of the cat picture may be a bit misleading for those who are not sure of how this works: "The best a traditional LLM can do when asked to give it a detective hat and monocle." The role of the traditional LLM in creating a picture is quite minimal (if there is any LLMs used), it might just tweak a bit the prompt for the diffusion model. It was definitely not the LLM that created the picture: https://platform.openai.com/docs/guides/image-generation 4o image generation is surely a bit different, but I don't really have that kind of more precise technical information (there must be indeed a specialized transformer model used, linking tokens to pixels, https://openai.com/index/introducing-4o-image-generation/)

CSMastermind

What's interesting to me is how many of these advancements are just obvious next steps for these tools. Chain of thought, tree of thought, mixture of experts etc. are things you'd come up with in the first 10 minutes of thinking about improving LLMs.

Of course the devil's always in the details and there have been real non-obvious advancements at the same time.

AstralStorm

The problem always was that the improvements worsened other cases.

Mixture of experts tends to lock models up. Chain of thought tends to turn the models loopy. As in circling the drain even when asked not to. Tree of thought has a tendency to make the models unstable by flipping between the branches, and otherwise unpredictable complicating training ...

anxoo

"I set a plate on a table, and glass next to it. I set a marble on the plate. Then I pick up the marble, drop it in the glass. Then I turn the glass upside down and set it on the plate. Then, I pick up the glass and put it in the microwave. Where is the marble?"

the author claims that visual reasoning will help the model solve this problem, noting that gpt-4o got the question right after making a mistake in the beginning of the response. i asked gpt-4o, claude 3.7, and gemini 2.5 pro experimental, who all answered 100% correctly.

the author also demonstrates trying to do "visual reasoning" with gpt-4o, notes that the model got it wrong, then handwaves it away by saying the model wasn't trained for visual reasoning.

"visual reasoning" is a tweet-worthy thought that the author completely fails to justify

KTibow

I've seen some speculate that o3 is already using visual reasoning and that's what made it a breakthrough model.

null

[deleted]

budmichstelk

[dead]

HN

Visual Reasoning Is Coming Soon

Visual Reasoning Is Coming Soon