Why I find diffusion models interesting?

95 comments

·March 6, 2025

mountainriver

The most interesting thing about diffusion LMs that tends to be missed, are their ability to edit early tokens.

We know that the early tokens in an autoregressive sequence disproportionately bias the outcome. I would go as far as to say this is some of the magic of reasoning models is they generate so much text they can kinda get around this.

However, diffusion seems like a much better way to solve this problem.

ithkuil

Yeah reasoning models are "self-doubt" models.

The model is trained to encourage re-evaluating the soundness of tokens produced during the "thinking phase".

The model state vector is kept in a state of open exploration. Influenced by the already emitted tokens but less strongly so.

The non-reasoning models were just trained with the goal of producing useful output on a first try and they did their best to maximize that fitness function.

kgeist

But how can test-time compute be implemented for diffusion models if they already operate on the whole text at once? Say it gets stuck—how does it proceed further? Autoregressive reasoning models would simply backtrack and try other approaches. It feels like denoising the whole text further wouldn't lead to good results, but I may be wrong.

spwa4

Diffusion LLMs are still residual networks. You can Google that, but it means that they don't generate the whole text at once. Every layer generates corrections to be made to the whole text at once.

Think of it like writing a text by forcing your teacher to write for you by entering in the assignment 100 times. You begin by generating completely inaccurate text, almost random, that leans perhaps a little bit towards the answer. Then you systematically begin to correct small parts of the text. The teacher that sees the text, and uses red the red pen to correct a bunch of things. Then the corrected text is copied onto a fresh page, and resubmitted to the teacher. And again. And again. And again. And again. 50 times. 100 times. That's how diffusion models work.

Technically, it adds your corrections to the text, but that's mathematical addition, not adding at the end. Also technically every layer is a teacher that's slightly different from the previous one. And and and ... but this is the basic principle. The big advantage is that this makes neural networks slowly lean towards the answer. First they decide to have 3 sections, one about X, Y and one about Z, then they decide on what sentences to put, then they start thinking about individual words, then they start worrying about things like grammar, and finally about spelling and pronouns and ...

So to answer your question: diffusion networks can at any time decide to send out a correction that effectively erases the text (in several ways). So they can always start over by just correcting everything all at once back to randomness.

kgeist

Yeah, but with autoregressive models, the state grows, whereas with diffusion models, it remains fixed. As a result, a diffusion model can't access its past thoughts (e.g., thoughts that rejected certain dead ends) and may start oscillating between the same subpar results if you keep denoising multiple times.

eru

Perhaps do a couple of independent runs, and then combine them afterwards?

vinkelhake

I don't get where the author is coming from with the idea that a diffusion based LLM would hallucinate less.

> dLLMs can generate certain important portions first, validate it, and then continue the rest of the generation.

If you pause the animation in the linked tweet (not the one on the page), you can see that the intermediate versions are full of, well, baloney.

(and anyone who has messed around with diffusion based image generation knows the models are perfectly happy to hallucinate).

gdiamos

Bidirectional seq2seq models are usually more accurate than unidirectional models.

However, autoregressive models that generate one token at a time are usually more accurate than parallel models that generate multiple tokens at a time.

In diffusion LLMs, both of these two effects interact. You can trade them off by determining how many tokens are generated at a time, and how many future tokens are used to predict the next set of tokens.

Legend2440

Hallucination is probably a feature of statistical prediction as a whole, not any particular architecture of neural network.

markisus

Regarding faulty intermediate versions, I think that’s the point. The diffusion process can correct wrong tokens with the global state implies it.

evrydayhustling

I think the discussion here is confusing the algorithm for the output. It's true that diffusion can rewrite tokens during generation, but it is doing so for consistency with the evolving output -- not "accuracy". I'm unaware of any research which shows that the final product, when iteration stops, is less likely to contain hallucinations than with autoregression.

With that said, I'm still excited about diffusion -- if it offers different cost points, and different interaction modes with generated text, it will be useful.

whoami_nr

The Llada paper: https://ml-gsai.github.io/LLaDA-demo/ here implied strong bidirectional reasoning capabilities and improved performance on reversal tasks (where the model needs to reason backwards).

I made a logical leap from there.

mitthrowaway2

I'm not sure about hallucination about facts, but it might be less prone to logically inconsistent statements of the form "the sky is red because[...] and that's why the sky is blue".

kelseyfrog

I'm personally happy to see effort in this space simply because I think it's an interesting set of tradeoffs (compute ∝ accuracy) - a departure from the fixed next token compute budget required now.

It brings up interesting questions, like what's the equivalency between smaller diffusion models which consume more compute because they have a greater number of diffusion steps compared to larger traditional LLMs which essentially have a single step. How effective is decoupling the context window size to the diffusion window size? Is there an optimum ratio?

machiaweliczny

I actually think that diffusion LLMs will be best for code generation

genewitch

Was it on Hacker News a few days ago? There was a diffusion language model that was actually running, but I think it was a paid service. I don't know if anybody mentioned that there was a open source one or one that you could run locally.

prometheus76

Why did the person who posted this change the headline of the article ("Diffusion models are interesting") into a nonsensical question?

whoami_nr

Author here. I just messed up while posting.

amclennon

Considering that the article links back to this post, the simplest explanation might be that the author changed the title at some point. If this were a larger publication, I would have probably assumed an A/B test

antirez

There is a disproportionate skepticism in autoregressive models and a disproportionate optimism in alternative paradigms because of the absolutely non verifiable idea that LLMs, when predicting the next token, don't already model, in the activation states, the gist of what they could going to say, similar to what humans do. That's funny because many times it can be observed in the output of truly high quality replies that the first tokens only made sense in the perspective of what comes later.

spr-alex

maybe i understand this a little differently, the argument i am most familiar with is this one from lecun, where the error accumulation in the prediction is the concern with autoregression https://drive.google.com/file/d/1BU5bV3X5w65DwSMapKcsr0ZvrMR...

antirez

The error accumulation thing is basically without any ground as regressive models correct what they are saying in the process of emitting tokens (trivial to test yourself: force a given continuation in the prompt and the LLMs will not follow at all). LeCun provided an incredible amount of wrong claims about LLMs, many of which he now no longer accepts: like the stochastic parrot claim. Now the idea that there is just a statistical relationship in the next token prediction is considered laughable, but even when it was formulated there were obvious empirical hints.

spr-alex

i think the opposite, the error accumulation thing is basically the daily experience of using LLMs.

As for the premise that models cant self correct that's not the argument i've ever seen, transformers have global attention across the context window. It's that their prediction abilities are increasingly poor as generation goes on. Is anyone having a different experience than that?

Everyone doing some form of "prompt engineering" whether with optimized ML tuning, whether with a human in the loop, or some kind of agentic fine tuning step, runs into perplexity errors that get worse with longer contexts in my opinion.

There's some "sweet spot" for how long of a prompt to use for many use cases, for example. It's clear to me that less is more a lot of the time

Now will diffusion fare significantly better on error is another question. Intuition would guide me to think more flexiblity with token-rewriting should enable much greater error correction capabilities. Ultimately as different approaches come online we'll get PPL comparables and the data will speak for itself

HeatrayEnjoyer

>force a given continuation in the prompt and the LLMs will not follow at all

They don't? That's not the case at all, unless I am misunderstanding.

kazinator

Interestingly, that animation at the end mainly proceeds from left to right, with just some occasional exceptions.

So I followed the link, and gave the model this bit of conversation starter:

> You still go mostly left to right.

The denoising animation it generated went like this:

> [Yes] [.] [MASK] [MASK] [MASK] ... [MASK]

and proceeded by deletion of the mask elements on the right one by one, leaving just the "Yes.".

gdiamos

I think these models would get interesting at extreme scale. Generate a novel in 40 iterations on a rack of GPUs.

At some point in the future, you will be able to autogen a 10M line codebase in a few seconds on a giant GPU cluster.

gdiamos

Diffusion LLMs also follow scaling laws - https://proceedings.neurips.cc/paper_files/paper/2023/file/3...

impossiblefork

Those aren't the modern type with discrete masking based diffusion though.

Of course, these too will have scaling laws.

esperent

Is it possible that combining multiple AIs will be able to somewhat bypass scaling laws, in a similar way that multicore CPUs can somewhat bypass the limitations of a single CPU core?

gdiamos

I’m sure there are ways of bypassing scaling laws, but I think we need more research to discover and validate them

nthingtohide

I read a wikipedia article of a person who was very intelligent but also suffered from a mental illness. He told people around him that his next novel will be of exactly N number of words and it will end with the sentence P.

I don't remember article. I read it a decade ago. It's like he was doing diffusion in his mind, subconsciously perhaps

eru

Seems pretty easy to achieve if you have text editing software that tells you the number of words written so far?

jacobn

The animation on the page looks an awful lot like autoregressive inference in that virtually all of the tokens are predicted in order? But I guess it doesn't have to do that in the general case?

creata

The example in the linked demo[0] seems less left-to-right.

Anyway, I think we'd expect it to usually be more-or-less left-to-right -- We usually decide what to write or speak left-to-right, too, and we don't seem to suffer much for it.

(Unrelated: it's funny that the example generated code has a variable "my array" with a space in it.)

[0]: https://ml-gsai.github.io/LLaDA-demo/

frotaur

Very related : https://arxiv.org/abs/2401.17505

whoami_nr

yeah but you can backtrack your thinking. You also have a mind voice to plan out the next couple words/reflect/self correct before uttering them.

whoami_nr

So, in practice there are some limitations here. Chat interfaces force you to feed the entire context to the model everytime you ping it. Even multi step tool calls have a similar thing going. So, yeah we may effectively turn all of this effectively into autoregressive models too.

DeathArrow

That got me thinking that it would be nice to have something like ComfyUi to work with diffusion based LLMs. Apply LORAs, use multiple inputs, have multiple outputs.

Something akin to ComfyUi but for LLMs would open up a world of possibilities.

terhechte

Check out Floneum, it's basically ComfyUI for LLM's, extendable via plugins

https://floneum.com/

Scroll down a bit on the website to see a screenshot.

DeathArrow

Thank you!

dragonwriter

ComfyUI already has nodes (mostly in extensions available through the built in manager) for working with LLMs, both remote LLMs accessed through APIs and local ones running under Comfy itself, the same as it runs other models.

hdjrudni

Maybe not even 'akin' but literally ComfyUI. Comfy already has a bunch of image-to-text nodes. I haven't seen txt2txt or Loras and such for them though. But I also haven't looked.

Philpax

It's complicated by the ComfyUI data model, which treats strings as immediate values/constants and not variables in their own right. This could ostensibly be fixed/worked around, but I imagine that it would come at a cost to backwards compatibility.

mistrial9

this is the huggingface page https://huggingface.co/papers/2502.09992

chw9e

This was a very cool paper about using diffusion language models and beam search: https://arxiv.org/html/2405.20519v1

Just looking at all of the amazing tools and workflows that people have made with ComfyUI and stuff makes me wonder what we could do with diffusion LMs. It seems diffusion models are much more easily hackable than LLMs.

inverted_flag

How do diffusion LLMs decide how long the output should be? Normal LLMs generate a stop token and then halt. Do diffusion LLMs just output a fixed block of tokens and truncate the output that comes after a stop token?

HN

Why I find diffusion models interesting?

Why I find diffusion models interesting?