Beyond Diffusion: Inductive Moment Matching

32 comments

·March 12, 2025

goldemerald

I've been lightly following this type of research for a few years. I immediately recognized the broad idea as stemming from the lab of the ridiculously prolific Stefano Ermon. He's always taken a unique angle for generative models since the before times of GenAI. I was fortunate to get lunch with him in grad school after a talk he gave. Seeing the work from his lab in these modern days is compelling, I always figured his style of research would break out into the mainstream eventually. I'm hopeful the the future of ML improvements come from clever test-time algorithms like this article shows. I'm looking forward to when you can train a high quality generative model without needing a super cluster or webscale data.

imjonse

Some of their research had already broken out into mainstream, DDIM at least was their paper and probably others too in the diffusion domain.

programjames

Anyone willing to give an intuitive summary of what they did mathwise? The math in the paper is super ugly to churn through.

bearobear

Last author here (I also did the DDIM paper, https://arxiv.org/abs/2010.02502). I know this is going to be very tricky math-wise (and in the paper we just wrote the most general thing to make reviewers happy), so I tried to explain the idea more easily under the blog post (https://lumalabs.ai/news/inductive-moment-matching).

If you look at how a single step of the DDIM sampler interacts with the target timestep, it is actually just a linear function. This is obviously quite inflexible if we want to use it to represent a flexible function where we can choose any target timestep. So just add this as an argument to the neural network and then train it with a moment matching objective.

In general, I feel that analyzing a method's inference-time properties before training it can be helpful to not only diffusion models, but also LLMs including various recent diffusion LLMs, which prompted me to write a position paper in the hopes that others develop cool new ideas (https://arxiv.org/abs/2503.07154).

niemandhier

Just as a counter perspective: I think your paper is great!

Please don’t let people ever discourage you from writing proper papers. Ever since meta etc. started asking for „2 papers in relevant fields“ we see a flood of papers that should be tweets.

qumpis

What happens if we don't add any moments matching objective? e.g. at train time just fit a diffusion model that predicts the target given any pair of timesteps (t, t')? Why is moment matching critical here?

Also regarding linearity, why is it inflexible? It seems quite convenient that a simple linear interpolation is used for reconstruction, besides, even in DDIM, the directions towards the final target changes at each step as the images become less noisy. In standard diffusion models or even flow matching, denoising is always equal to the prediction of the original data + direction from current timestep to the timestep t'. Just to be clear, it is intuitive that such models are inferior in few-step generations since they don't optimise for test time efficiency (in terms of the tradeoff of quality vs compute), but it's unclear what inflexibility exists there beyond this limitation.

Clearly there's no expected benefit in quality if all timesteps are used in denoising?

littlestymaar

Stupid question, what's a “timestep” in that context?

nmca

The authors own summary from the position paper is:

In particular, we examine the one-step iterative process of DDIM [39, 19, 21] and show that it has limited capacity with respect to the target timestep under the current denoising network design. This can be addressed by adding the target timestep to the inputs of the denoising network [15].

Interestingly, this one fix, plus a proper moment matching objective [5] leads to a stable, single-stage algorithm that surpasses diffusion models in sample quality while being over an order of magnitude more efficient at inference [50]. Notably, these ideas do not rely on denoising score matching [46] or the score-based stochastic differential equations [41] on which the foundations of diffusion models are built.

oofbey

In normal diffusion you train a model to take lots of tiny steps, all the same small size. e.g. "You're gonna take 20 steps, at times [1.0, 0.95, 0.90, 0.85...]" and each time the model takes that small fixed-size step to make the image look better.

Here they train a model to say "I'm gonna ask you to take a step from time B to A - might be a small step, might be a big step - but whatever size it is, make the image that much better." You you might ask the model to improve the image from t=1.0 to t=0.25 and be almost done. It gets a side variable telling it how much improvement to make in each step.

I'm not sure this right, but that's what I got out of it by skimming the blog & paper.

kadushka

No, we typically train any diffusion model on a single step (randomly chosen).

hyperbovine

The math is totally standard if you've read recent important papers on score matching and flow matching. If you haven't, well, I can't see how you could possibly hope to understand this work at a technical level anyways.

lukasb

"Inference can generally be scaled along two dimensions: extending sequence length (in autoregressive models), and augmenting the number of refinement steps (in diffusion models)."

Does this mean that diffusion models for text could scale inference compute to improve quality for a fixed-length output?

svachalek

Yes, although so far it seems the main advantage of text diffusion models is that they're really, really fast. Iterations reach an asymptote very quickly.

kadushka

I don’t know which text diffusion models you’re talking about, the latest and greatest is this one: https://arxiv.org/abs/2502.09992 and it’s extremely slow – couple of orders of magnitude slower than a regular LLM, mainly because it does not support KV caching, and requires many full sequence processing steps per token.

janalsncm

I’m not familiar with that paper but it would probably be best to compare speeds with an unoptimized transformer decoder. The Vaswani paper came out 8 years ago so implementations will be pretty highly optimized at this point.

On the other hand if there was a theoretical reason why text diffusion models could never be faster than autoregressive transformers it would be notable.

lukasb

Yeah I guess progressive refinement is limited in quality by how good the first N iterations are that establish the broad outlines.

vessenes

FWIW i don’t think we’ve seen nearly all the ideas for text diffusion yet — why not ‘jiggle the text around a bit’ when things have stabilized, or add space to fill, or have a separate judging module identify space that needs more tokens? Lots of super interesting possibilities.

xela79

this went over my head quickly; read through it a few times, than asked GPT for a summary on my level of understanding, which does clear it up for me ,personally , to grasp the overall idea:

Alright, imagine you have a big box of LEGO bricks, and you're trying to build a really cool spaceship. There are two main ways people usually build things like this:

Step-by-step (Autoregressive Models) – Imagine you put one LEGO brick down at a time, making sure each piece fits perfectly before adding the next. It works, but it takes a long time.

Fix and refine (Diffusion Models) – Imagine you start by dumping all the LEGO bricks in a messy pile. Then, you slowly move pieces around, fixing mistakes until you get a spaceship. This is faster than the first method, but it still takes a lot of tiny adjustments.

What's the Problem? People have been using these two ways for a long time, and they’ve gotten really good at them. But no matter how big or smart your LEGO-building robot gets, these methods don’t get that much better. They’re kind of stuck.

The New Way: Inductive Moment Matching (IMM) IMM is like a magical LEGO helper that doesn’t just follow the usual slow steps. Instead, it looks at what the final spaceship should look like ahead of time and figures out how to jump closer to the final result in fewer steps.

Instead of moving one LEGO brick at a time or slowly fixing a messy pile, it’s like the helper knows where each piece should go ahead of time and moves big sections all at once. That makes it way faster and still super accurate!

Why is This Cool? Faster – It builds things much more quickly than the old methods. More efficient – It doesn’t waste as much time adjusting tiny details. Works with all kinds of problems – This method can be used for pictures, videos, and maybe even other things like 3D models. Real-World Example Imagine drawing a picture of a dog. Old way: You draw one tiny detail at a time, or you start with a blurry dog and keep fixing it. New way (IMM): You already kind of know what the dog should look like, so you make big strokes to get there quickly!

So basically, IMM is a super smart way to skip unnecessary steps and get amazing results much faster.

azinman2

Thank you, this is helpful framing. Obviously all the details are missing, but the blog post was impenetrable for me, and I’m quite technical.

b2w

So like intuitive photographic memory?

Climatebamb

More like "Oh i remember what you roughly want, i rememeber basic steps of reaching it just not details, lets generate the details" vs. "learning x steps from noise to image".

You make the way of reaching your target faster.

b2w

This helped me, thanks!

33a

Reminds me of https://ggx-research.github.io/publication/2023/05/10/public...

echelon

Does this mean high quality images and video will be possible in one or a few sampling steps?

Fast, real time video generation? (One second of compute per one second of output.)

Does this mean more efficient and more generalizable training and fine tuning?

bbminner

Can anyone share insight into how this is different from consistency models? The insight seems quite similar?

bearobear

Consistency models is a special case of IMM where you do moment matching with 1 sample from each distribution (i.e., you cannot match distributions properly). See Fig 5 for an ablation study, of course, adding more samples when you are doing moment matching makes it more stable during training :)

throwaway2562

I’m trying to understand what the ‘spectral’ interpretation of IMM is: but perhaps I shouldn’t

https://sander.ai/2024/09/02/spectral-autoregression.html

bbminner

Makes sense. How can you even approximately estimate higher order differences in conditional moments in such a high dim space? Seems statistically impossible to get a reasonable estimate for a gradient. Moment matching in sample space has always been very hard.

richard___

Reminds me of the Kevin Frans shortcut networks paper?

unit149

[dead]

brcmthrowaway

This is a gamechanger

HN

Beyond Diffusion: Inductive Moment Matching

Beyond Diffusion: Inductive Moment Matching