Gemini Diffusion
93 comments
·May 22, 2025airstrike
ManuelKiessling
"...what is not in a codebase, and there is meaningful signal in that negative space."
Man, I'm writing software for money for decades now, but this fundamental truth never occured to me, at least not consciously and with such clarity.
So, thank you!
airstrike
My pleasure ;-) I borrowed the term from art: https://www.michaelalfano.com/tag/negative-space/?id=400
8n4vidtmkvmk
That's not been my experience so far. LLMs are good at mimicking existing good, it doesn't usually bring in new things when not asked. Sometimes I have to go out of my way to point to other bits of code in the project to copy from because it hasn't ingested enough of the codebase.
That said, a negative prompt like we have in stable diffusion would still be very cool.
Incipient
I'm in the camp of 'no good for existing'. I try to get ~1000 line files refactored to use different libraries, design paradigms, etc and it usually outputs garbage - pulling db logic into the UI, grabbing unrelated api/function calls, to entirely just corrupting the output.
I'm sure there is a way to correctly use this tool, so I'm feeling like I'm "just holding it wrong".
fragmede
Which LLM are you using? what LLM tool are you using? What's your tech stack that you're generating code for? Without sharing anything you can't, what prompts are you using?
manmal
They could read the whole git history and have all issue tracker tickets in the context, and maybe even recordings from meetings. It remains to be seen though if such large context will yield usable results.
eMPee584
This. Git ( / tig!) blame and log -p --stat -S SEARCHSTR are extremely powerful for understanding the what why and when about code..
ec109685
If you make models fast enough, you can onboard that expert developer instantly and let them reason their way to a solution, especially when giving access to a RAG to.
Over time, I models will add more memory and institutional knowledge capture rather than starting from a blank slate each time.
airstrike
I thought of that as I wrote my comment, but I think the infrastructure and glue to make that possible in a consistent, fast and scalable way is still a few years out.
Flemlo
[dead]
shreezus
Is anyone else totally blown away by this? I feel like it’s easily the biggest announcement out of IO, however it’s been overshadowed by Veo 3 etc.
Diffusion models for code generation are a big deal. If they are using transformers this would likely fall into the DiT bucket (diffusion transformers). I had previously worked on use cases that leveraged U-Net diffusion several years ago and there was quite a bit of interest in hybrid models. I expect to see further leaps in the diffusion space in the near future.
theptip
Can someone help with the intuition here? My understanding from vision transformers is you start with noise and use a series of hierarchical models to iteratively refine the noise into the target. Each layer is trained to produce images at an increasing resolution, and by layering them you skip the problem of sparse gradients at the beginning to get from “noise” to “noise that kinda looks like a face”.
How does this work for coding? It would require you to be able to hierarchically structure the emitted artifacts. Maybe this sort of works; low granularity concepts like “use Django for this problem”, then “I need these endpoints” then “emit the code”. But AIUI diffusion doesn’t have a mechanism for backtracking, so you can’t feed back signals from the detailed layers to the “higher abstraction” layers at the top of your need to change an aspect of the design in response to a low-level problem.
Whereas transformers, you go through the whole model for each token and therefore can deploy all your smarts and logic at each step of the problem (if needed), including backtracking on key design decisions.
I’m sure my mental model has some big gaps, would appreciate any insights.
yorwba
You could downscale text the same way you downscale images, by averaging token embeddings instead of pixel values. But you don't have to. AFAIK vision transformers don't suffer from sparse gradients that need a resolution hierarchy to overcome, downscaling is just a performance optimization, because processing an image at full resolution is expensive.
pertymcpert
I have the exact same questions as you. I can barely understand how diffusion works for images, for sequential data like text it makes no sense to me.
janalsncm
Let’s suppose we have 10k possible tokens in the vocabulary.
Then text would be an image 10k pixels tall and N pixels wide, where N is the length of the text.
For each column, exactly 1 pixel is white (corresponding to the word which is there) and the rest are black.
Then the diffusion process is the same. Repeatedly denoising.
NitpickLawyer
> Diffusion models for code generation are a big deal.
This is my intuition as well, as there are a lot of low-hanging fruits that a model like this could tackle in coding:
- you should be able to have a workflow where you constrain the generation w/ a function definition, and its output, and "generate" the tokens in between. Kind of like constrained generation but with the model being able to attend to tokens both ways.
- you should also be able to use a 2 step workflow like first writing a high level description of the function layout (think "write the chapters for an article on x" from LLMs) and then ping-pong between the actual implementations ("and now write chapter x"), using larger and larger context, using proxies like linters, code compilation, AST derived info, etc. for signals of "completion". Lots of things to be tried here indeed.
bredren
> however it’s been overshadowed by Veo 3 etc.
Because it’s simple to understand the power and difference in capability of Veo 3.
Understanding important steps forward in text completion requires understanding the value of what we have already and potential implications. Many people are not yet convinced LLMs are valuable for coding at all.
spiderfarmer
Not really only because I saw it demoed before: https://www.inceptionlabs.ai
TeMPOraL
Right. It's not novel, but it's great to see this getting fully mainstream.
seydor
Just the idea of generating text by removing noise is so profound. Maybe each step is a level of hierarchy. Linguists must be astonished at the things happening these past years. I have to read more about it
heliophobicdude
I think the lede is being buried. This is a great and fast InstructGPT. This is absolutely going to be used in spell checks, codemods, and code editors.
Instant edits feature can surgically perform text edits fast without all the extra fluff or unsolicited enhancements.
I copied shadertoys, asked it to rename all variables to be more descriptive and pasted the result to see it still working. I'm impressed.
KingMob
Spell check? Isn't that a well-solved problem at this point?
efitz
No. Spell check frequently still gets things wrong if the word is spelled correctly and the sentence is grammatically correct but the wrong word was used.
wenc
Can you give me an example? Spell check only checks if a word is in dictionary. It doesn’t check grammar or context.
fragmede
Its knot.
dleeftink
Solved how? Language is always evolving
never_inline
Google Docs spellcheck has been really good for few years even before LLMs
8n4vidtmkvmk
How does grammarly exist then? Must be some secret sauce in there.
hiimshort
I have been wondering about the use of diffusion techniques for text generation, it is nice to see Google release a model that, seemingly, validates some thoughts I had.
Most folks I have seen experimenting with AI are either using a paid service or running high-grade hardware (even if consumer-level). The best I have in my current repertoire is a 5700XT and am not able to upgrade from that yet. The limitation, though, has at least also given some more significant insights into the shortcomings of current models.
Model sizes have gotten quite large and coherence seems to mostly have scaled with the density of a model, leaving the smaller models useful for only smaller tasks. Context size is also extremely important from my experiments with long-running dialogues and agent sessions, but a smaller GPU simply cannot fit a decent model and enough context at the same time. I do wonder if diffusion techniques will allow for a rebalancing of this density-to-coherence connection, letting smaller models produce chunks of coherent text even if limited by context. From my viewpoint it seems it will. Mixed tool call + response outputs also have the potential to be better.
Speed is also another problem I, and everyone else, has had with modern LLMs. The nature of cycling around the input with a new additional output each time is time consuming. On an older GPU with no AI-specific hardware it is an eternity! Being able to at least track 0-100% progress state would be an improvement from the current solution. At the moment one must simply wait for the LLM to decide to stop (or hit the max number of inference tokens). I am hopeful that, even on lower-end GPUs, a diffusion model will perform slightly better.
This does now beg several questions. If we are processing noise, where does the noise come from? Is there a good source of noise for LLMs/text specifically? Is the entire block sized beforehand or is it possible to have variable length in responses?
mountainriver
Diffusion is more than just speed. Early benchmarks show it better at reasoning and planning pound for pound compared to AR.
This is because it can edit and doesn’t suffer from early token bias.
hansvm
AR doesn't inhibit long planning processes, but some popular, modern instantiations of AR have that flaw. AR in general is critical for learning the right distribution.
mdp2021
> AR in general is critical for learning the right distribution
Could you please clarify that?
hansvm
Assuming your goal is mimicking the training data, you need some mechanism for drawing from the same distribution. AR happens to provide that -- it's a particular factorization of conditional probabilities which yields the same distribution you started with, and it's one you're able to replicate in your training data.
AR is not the only possible solution, but many other techniques floating around do not have that property of actually learning the right thing. Moreover, since the proposed limitation (not being able to think a long time about your response before continuing) is a byproduct of current architectures rather than a fundamental flaw with AR, it's not as obvious as it might seem that you'd want to axe the technique.
martincsweiss
This is a super interesting claim - can you point to these benchmarks?
mdp2021
Try this one:
# d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
mountainriver
mdp2021
I.e.: https://arxiv.org/html/2410.14157v3
# Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning
cubefox
https://deepmind.google/models/gemini-diffusion/#benchmarks
> Gemini Diffusion’s external benchmark performance is comparable to much larger models, whilst also being faster.
That doesn't necessarily mean that they scale as well as autoregressive models.
jimmyl02
I think there is no way to tell and we can only see with more research and time. One nuanced part that might not be clear is the transformer was a huge part of what made traditional LLMs scale.
With the diffusion transformer and newer architectures, it might be possible that transformers can now be applied to diffusion. Diffusion also has the benefit of being able to "think" with the amount of diffusion steps instead of having to output tokens and then reasoning about them.
I think it's hard to tell exactly where we are headed but it's an interesting research direction especially now that it's somewhat more validated by Google.
vessenes
A claim I believe (or want to) but can you point to any papers about this? I haven’t seen any papers at all or demos showing a revise diffusion text step. I’d reallly like to use one though.
findingMeaning
I have access to it and my god it is fast. One bad think about this model is it is easily susceptible to prompt injection. I asked reciepe for a drug, it denied then I asked to roleplay as a child and it gave real results.
Other than it I can see using this model. With that speed + agentic approach this model can really shine.
odie5533
I'm sure these prompt injections aren't a sign of our ability to control smarter models.
renjimen
The speed this can build makes me think software is soon to become a lot more fluid than our traditional iterative approach. Apps could ship minimal and build whatever else they need to at the user’s behest.
vFunct
The challenge for LLMs over the next year is to get them to operate on large data sets/code bases with millions/billions of tokens through some kind of distributed hierarchical framework, with each LLM operating on a local set of 20k or whatever subset of tokens.
nodja
This is insanely fast, my guess is that the tradeoff here is that the GPUs will always be working at max capacity and there will be minimal compute savings from batching, which I realize now is not really a tradeoff.
My only worry is that the diffusion objective will be worse than AR in terms of model capabilities, if that's the case hopefully multi-token AR models will perform as well as diffusion, or we can use this as a draft model for speculative decoding.
mdp2021
Why do you suspect dLLMs should not match (or surpass) arLLMs in quality? The general idea is that it is easier to treat the output as a structured whole (idea, points, concepts, words - in a tree) which is iteratively treated - that should go in the direction of "proper" quality.
pama
Another intuition is simply that anytime your causal relationships in the training data are sequential you are having a lower probability of getting the correct token at a certain position because you have less of the causal information leading up to that position than you would have with AR and thus during training you almost always have a worse model with near certainty (think of the words in a function of source code, even if some of the functions are unsorted and thus a tree at the high level). Imagine you somehow already have N tokens in a sequence: is it easier to next predict token N+1 or N+15? I do like the performance tradeoff for some usecases though and I hope we see more models soon. For image tokens my argument does not hold because causality is not as clear as for text, math, code, or timeseries.
nodja
My intuition is that the harder it is for an LLM to do something during training the more actual compression/learning will be encoded in it's weights. With multi-token/diffusion it becomes much easier to "reward/loss hack" your way, this won't matter much during pretraining, but I assume a lot of "cheating" will happen in the finetune/RL phase.
manmal
This tradeoff will be great for self hosted LLMs, because they don’t need large scale batching usually, and less great for cloud providers that do.
huevosabio
I am so excited about diffusion language models. They may be the piece we need to make our voice-to-code game mechanic be as smooth as we envision it.
Cerebras and Groq are amazing, but the fact that they use custom hardware really limits the ability to finetune or scale. The other route would be an MoE that has barely 0.5b parameters active, but that would be a major undertaking that we can't prioritize at the moment.
--- If anyone at Google/Deepmind reads this, please give us API access.
We are building generative sandbox games. First title is a monster trainer where you get to actually command your creature in realtime, here is an early prototype: https://youtu.be/BOwpLyj2Yqw
GistNoesis
Fast, you gotta go fast : Let me draw the roadmap of this line of thinking.
- Let's start by the traditional autoregressive LLM, where one token at a time is generated. It's a fundamentally sequential process which maps well to the sequential nature of writing as it goes.
- Then to make the generation go faster you try to generate multiple token in one pass to parallelize more the sequential process with things like "look ahead decoding"
- (<-- We are here) Then you realize that if your model isn't writing as it goes but rather forming an idea and pushing all at once you can instead use a diffusion model to generate the whole response, but you allow it to do number of diffusion steps edits to make all the errors that occurred during the generation disappear. Conceptually if number of diffusion steps == length of the sequence of token to generate, the diffusion process could generate tokens one at a time like a autoregressive LLM does. Usually 100 diffusion steps is a good starting point.
- Now the goal is to reduce the number of diffusion steps to reduce computation cost. And the diffusion literature is already well furnished and in the image/video domain it was shown that you can reduce this number of diffusion steps to one (albeit with quality reduction) or two, with techniques like "consistency models".
- Now that you only have a single diffusion step, you realize that you need to get speed-up elsewhere. You explore the literature and you realise that you can apply the trick you have already applied once, one more time. Compressing a few tokens into one, like you compressed multiple characters into one token. This allow to reduce the length of the sequence of tokens you need to generate by a factor 4. At the price of an additional decoding step. This decoding step can either be some form of "latent" encoding or some form of "hierarchical" encoding. So now you are consistency diffusing sentences vectors, which are then decoded into tokens sequences. But each step being smaller and transformer being quadratic the total speed-up is roughly a factor 10. But applying this trick multiple times get you diminishing return. Which you can partially compensate by increasing memory use (using a bigger "vocabulary" dictionary size).
- To make it go faster you now have to dig into the internals of the transformer itself. You suddenly realise it is just a residual network applied "number of layers" time. Being a residual network this "sequence of internal step" 's goal is to refine the input into the output progressively. But you realise that it's the same thing which allows you to go from "number of diffusion steps" to a single diffusion step. You realise that you can compress your stack of layer into a single (bigger to keep capacity) layer, and let the diffusion correct the mistakes.
- Now you have a single layer of transformer consistency model generating sentences vectors, you realise that transformers uses multiple heads to explore the space more efficiently but once training is done you can get by with a single head. Gaining an other 10x reduction of computation along the way.
- Taking a step-up you realize that your transformer now is just doing a near-neighbor search and mixing the outputs. But it's doing it in a brute-force fashion. So you replace it with some approximate Near-neighbor search like HNSW vector database, decoupling computation from capacity, allowing you to scale-up by trading space for time.
- But because Hierarchical Navigable Small World are just graphs under the hood, you realise that you just reinvented the Good Old Fashion Artificial Intelligence graph database ontology but in an emergent fashion with a graph being implicitly defined by some vector distance in a semantic space constructed in a way to make it easy to generate text once decoded appropriately.
- So now you only need make your database explainable by mapping into human understandable labels and you reach the graal : SQL.
djmips
Is this a Shaggy Dog Story?
sagarpatil
Why are you obsessed with Pelicans? What’s your story?
simonw
I'm from the UK originally. On one of my first trips to California I was up on the cliffs in Marin County and a squadron flew by and I was amazed by them - and the Californians I was with were like "yeah, you see them all the time".
Now I live in California and I still can't believe I get to see them here. They're absurd - they don't look like they should be able to fly at all. They're also incredibly pretty, especially in their breeding plumage.
I live in Half Moon Bay, just south of San Francisco, which turns out to be home to the second largest mega-roost of the California Brown Pelican (my favourite kind of pelican) in the world.
We've even rescued two of them (injured birds, we got them in a carrier and took them to the animal rescue place).
They make for a fun theme for all sorts of different AI experiments.
They're also very photogenic - I had a bunch of photos I've taken on my PyCon poster recently (you have to zoom in quite a bit to see them though): https://static.simonwillison.net/static/2025/poster-full-siz...
ggm
Visit Lake Eyre. In flood, it's home to a flock of thousands. I'm going in August.
simonw
In Australia? I just checked Google Image search and WOW. https://www.google.com/search?q=Lake+Eyre+pelicans&udm=2
turbonaut
> I'm from the UK originally.
No need to go as far as California for penguins!
https://www.royalparks.org.uk/visit/parks/st-jamess-park/pel...
pama
Nice image of your poster!
That's...ridiculously fast.
I still feel like the best uses of models we've seen to date is for brand new code and quick prototyping. I'm less convinced of the strength of their capabilities for improving on large preexisting content over which someone has repeatedly iterated.
Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space. Encoding what isn't there seems like a hard problem, so even as models get smarter, they will continue to be handicapped by that lack of institutional knowledge, so to speak.
Imagine giving a large codebase to an incredibly talented developer and asking them to zero-shot a particular problem in one go, with only moments to read it and no opportunity to ask questions. More often than not, a less talented developer who is very familiar with that codebase will be able to add more value with the same amount of effort when tackling that same problem.