Mercury: Ultra-fast language models based on diffusion

153 comments

·July 7, 2025

mike_hearn

A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.

Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.

As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.

It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?

kccqzy

> Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green.

I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem. When I was at Google, it was somewhat common for me to debug non-deterministic bugs such as a missing synchronization or fence causing flakiness; and it was common to just launch 10000 copies of the same test on 10000 machines to find perhaps a single digit number of failures. My current employer has a clunkier implementation of the same thing (no UI), but there's also a single command to launch 1000 test workers to run all tests from your own checkout. The goal is to finish testing a 1M loc codebase in no more than five minutes so that you get quick feedback on your changes.

> make builds fully hermetic (so no inter-run caching)

These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.

mike_hearn

I was also at Google for years. Places like that are not even close to representative. They can afford to just-throw-more-resources, they get bulk discounts on hardware and they pay top dollar for engineers.

In more common scenarios that represent 95% of the software industry CI budgets are fixed, clusters are sized to be busy most of the time, and you cannot simply launch 10,000 copies of the same test on 10,000 machines. And even despite that these CI clusters can easily burn through the equivalent of several SWE salaries.

> These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.

Again, that's how companies like Google do it. In normal companies, build caching isn't always perfectly reliable, and if CI runs suffer flakes due to caching then eventually some engineer is gonna get mad and convince someone else to turn the caching off. Blaze goes to extreme lengths to ensure this doesn't happen, and Google spends extreme sums of money on helping it do that (e.g. porting third party libraries to use Blaze instead of their own build system).

In companies without money printing machines, they sacrifice caching to get determinism and everything ends up slow.

kridsdale1

I’m at Google today and even with all the resources, I am absolutely most bottlenecked by the Presubmit TAP and human review latency. Making CLs in the editor takes me a few hours. Getting them in the system takes days and sometimes weeks.

PaulHoule

Most of my experience writing concurrent/parallel code in (mainly) Java has been rewriting half-baked stuff that would need a lot of testing with straightforward reliable and reasonably performant code that uses sound and easy-to-use primitives such as Executors (watch out for teardown though), database transactions, atomic database operations, etc. Drink the Kool Aid and mess around with synchronized or actors or Streams or something and you're looking at a world of hurt.

I've written a limited number of systems that needed tests that probe for race conditions by doing something like having 3000 threads run a random workload for 40 seconds. I'm proud of that "SuperHammer" test on a certain level but boy did I hate having to run it with every build.

IshKebab

Developer time is more expensive than machine time, but at most companies it isn't 10000x more expensive. Google is likely an exception because it pays extremely well and has access to very cheap machines.

Even then, there are other factors:

* You might need commercial licenses. It may be very cheap to run open source code 10000x, but guess how much 10000 Questa licenses cost.

* Moores law is dead Amdahl's law very much isn't. Not everything is embarrassingly parallel.

* Some people care about the environment. I worked at a company that spent 200 CPU hours on every single PR (even to fix typos; I failed to convince them they were insane for not using Bazel or similar). That's a not insignificant amount of CO2.

hyperpape

> Moores law is dead Amdahl's law

Yes, but the OP specifically is talking about CI for large numbers of pull requests, which should be very parallelizable (I can imagine exceptions, but only with anti-patterns, e.g. if your test pipeline makes some kind of requests to something that itself isn't scalable).

underdeserver

That's solvable with modern cloud offerings - Provision spot instances for a few minutes and shut them down afterwards. Let the cloud provider deal with demand balancing.

I think the real issue is that developers waiting for PRs to go green are taking a coffee break between tasks, not sitting idly getting annoyed. If that's the case you're cutting into rest time and won't get much value out of optimizing this.

mark_undoio

> I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem.

I'd personally agree. But this sounds like the kind of thing that, at many companies, could be a real challenge.

Ultimately, you can measure dollars spent on CI workers. It's much harder and less direct to quantify the cost of not having them (until, for instance, people start taking shortcuts with testing and a regression escapes to production).

That kind of asymmetry tends, unless somebody has a strong overriding vision of where the value really comes from, to result in penny pinching on the wrong things.

mike_hearn

It's more than that. You can measure salaries too, measurement isn't the issue.

The problem is that if you let people spend the companies money without any checks or balances they'll just blow through unlimited amounts of it. That's why companies always have lots of procedures and policies around expense reporting. There's no upper limit to how much money developers will spend on cloud hardware given the chance, as the example above of casually running a test 10,000 times in parallel demonstrates nicely.

CI doesn't require you to fill out an expense report every time you run a PR thank goodness, but there still has to be a way to limit financial liability. Usually companies do start out by doubling cluster sizes a few times, but each time it buys a few months and then the complaints return. After a few rounds of this managers realize that demand is unlimited and start pushing back on always increasing the budget. Devs get annoyed and spend an afternoon on optimizations, suddenly times are good again.

The meme on HN is that developer time is always more expensive than machine time, but I've been on both sides of this and seen how the budgets work out. It's often not true, especially if you use clouds like Azure which are overloaded and expensive, or have plenty of junior devs, and/or teams outside the US where salaries are lower. There's often a lot of low hanging fruit in test times so it can make sense to optimize, even so, huge waste is still the order of the day.

socalgal2

Even Google can not buy more old Intel Macs or Pixel 6s or Samsung S20s to increase their testing on those devices (as an example)

Maybe that affects less devs who don't need to test on actual hardware but plenty of apps do. Pretty much anything that touches a GPU driver for example like a game.

ronbenton

>Do companies not just double their CI workers after hearing people complain?

They do not.

I don't know if it's a matter of justifying management levels, but these discussions are often drawn out and belabored in my experience. By the time you get approval, or even worse, rejected, for asking for more compute (or whatever the ask is), you've spent way more money on the human resource time than you would ever spend on the requested resources.

mysteria

This is exactly my experience with asking for more compute at work. We have to prepare loads of written justification, come up with alternatives or optimizations (which we already know won't work), etc. and in the end we choose the slow compute and reduced productivity over the bureaucracy.

And when we manage to make a proper request it ends up being rejected anyways as many other teams are asking for the same thing and "the company has limited resources". Duh.

kccqzy

I have never once been refused by a manager or director when I am explicitly asking for cost approval. The only kind of long and drawn out discussions are unproductive technical decision making. Example: the ask of "let's spend an extra $50,000 worth of compute on CI" is quickly approved but "let's locate the newly approved CI resource to a different data center so that we have CI in multiple DCs" solicits debates that can last weeks.

wavemode

You're confusing throughput and latency. Lengthy CI runs increase the latency of developer output, but they don't significantly reduce overall throughput, given a developer will typically be working on multiple things at once, and can just switch tasks while CI is running. The productivity cost of CI is not zero, but it's way, way less than the raw wallclock time spent per run.

Then also factor in that most developer tasks are not even bottlenecked by CI. They are bottlenecked primarily by code review, and secondarily by deployment.

anp

I’m currently at google (opinions not representative of my employer’s etc) and this is true for things that run in a data center but it’s a lot harder for things that need to be tested on physical hardware like parts of Android or CrOS.

wbl

No it is not. Senior management often has a barely disguised contempt for engineering and spending money to do a better job. They listen much more to sales complain.

kridsdale1

That depends on the company.

daxfohl

There are a couple mitigating considerations

1. As implementation phase gets faster, the bottleneck could actually switch to PM. In which case, changes will be more serial, so a lot fewer conflicts to worry about.

2. I think we could see a resurrection of specs like TLA+. Most engineers don't bother with them, but I imagine code agents could quickly create them, verify the code is consistent with them, and then require fewer full integration tests.

3. When background agents are cleaning up redundant code, they can also clean up redundant tests.

4. Unlike human engineering teams, I expect AIs to work more efficiently on monoliths than with distributed microservices. This could lead to better coverage on locally runnable tests, reducing flakes and CI load.

5. It's interesting that even as AI increases efficiency, that increased velocity and sheer amount of code it'll write and execute for new use cases will create its own problems that we'll have to solve. I think we'll continue to have new problems for human engineers to solve for quite some time.

TechDebtDevin

LLM making a quick edit, <100 lines... Sure. Asking an LLM to rubber-duck your code, sure. But integrating an LLM into your CI is going to end up costing you 100s of hours productivity on any large project. That or spend half the time you should be spending learning to write your own code, dialing down context sizing and prompt accuracy.

I really really don't understand the hubris around llm tooling, and don't see it catching on outside of personal projects and small web apps. These things don't handle complex systems well at all, you would have to put a gun in my mouth to let one of these things work on an important repo of mine without any supervision... And if I'm supervising the LLM I might as well do it myself, because I'm going to end up redoing 50% of its work anyways..

kraftman

I keep seeing this argument over and over again, and I have to wonder, at what point do you accept that maybe LLM's are useful? Like how many people need to say that they find it makes them more productive before you'll shift your perspective?

dragonwriter

> I keep seeing this argument over and over again, and I have to wonder, at what point do you accept that maybe LLM's are useful?

The post you are responding to literally acknowledges that LLMs are useful in certain roles in coding in the first sentence.

> Like how many people need to say that they find it makes them more productive before you'll shift your perspective?

Argumentum ad populum is not a good way of establishing fact claims beyond the fact of a belief being popular.

psychoslave

That's a tool, and it depends what you need to do. If it fits someone need and make them more productive, or even simply enjoy more the activity, good.

Just because two people are fixing something on the whole doesn't mean the same tool will hold fine. Gum, pushpin, nail, screw,bolts?

The parent thread did mention they use LLM successfully in small side project.

MangoToupe

> at what point do you accept that maybe LLM's are useful?

LLMs are useful, just not for every task and price point.

candiddevmike

People say they are more productive using visual basic, but that will never shift my perspective on it.

Code is a liability. Code you didn't write is a ticking time bomb.

ninetyninenine

They say it’s only effective for personal projects but there’s literally evidence of LLMs being used for what he says can’t be used. Actual physical evidence.

It’s self delusion. And also the pace of AI is so fast he may not be aware of how fast LLMs are integrating into our coding environments. Like 1 year ago what he said could be somewhat true but right now what he said is clearly not true at all.

mike_hearn

I've used Claude with a large, mature codebase and it did fine. Not for every possible task, but for many.

Probably, Mercury isn't as good at coding as Claude is. But even if it's not, there's lots of small tasks that LLMs can do without needing senior engineer level skills. Adding test coverage, fixing low priority bugs, adding nice animations to the UI etc. Stuff that maybe isn't critical so if a PR turns up and it's DOA you just close it, but which otherwise works.

Note that many projects already use this approach with bots like Renovate. Such bots also consume a ton of CI time, but it's generally worth it.

airstrike

IMHO LLMs are notoriously bad at test coverage. They usually hard code a value to have the test pass, since they lack the reasoning required to understand why the test exists or the concept of assertion, really

flir

Don't want to put words in the parent commenter's mouth, but I think the key word is "unsupervised". Claude doesn't know what it doesn't know, and will keep going round the loop until the tests go green, or until the heat death of the universe.

blitzar

Do the opposite - integrate your CI into your LLM.

Make it run tests after it changes your code and either confirm it didnt break anything or go back and try again.

DSingularity

He is simply observing that if PR numbers and launch rates increase dramatically CI cost will become untenable.

mrkeen

> Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers

No, this is common. The devs just haven't grokked dependency inversion. And I think the rate of new devs entering the workforce will keep it that way forever.

Here's how to make it slow:

* Always refer to "the database". You're not just storing and retrieving objects from anywhere - you're always using the database.

* Work with statements, not expressions. Instead of "the balance is the sum of the transactions", execute several transaction writes (to the database) and read back the resulting balance. This will force you to sequentialise the tests (simultaneous tests would otherwise race and cause flakiness) plus you get to write a bunch of setup and teardown and wipe state between tests.

* If you've done the above, you'll probably need to wait for state changes before running an assertion. Use a thread sleep, and if the test is ever flaky, bump up the sleep time and commit it if the test goes green again.

grogenaut

Before cars people spent little on petroleum products or motor oil or gasoline or mechanics. Now they do. That's how systems work. You wanna go faster well you need better roads, traffic lights, on ramps, etc. you're still going faster.

Use AI to solve the IP bottlenecks or build more features that ear more revenue that buy more ci boxes. Same as if you added 10 devs which you are with AI so why wouldn't some of the dev support costs go up.

Are you not in a place where you can make an efficiency argument to get more ci or optimize? What's a ci box cost?

pamelafox

For Python apps, I've gotten good CI speedups by moving over to the astral.sh toolchain, using uv for the package installation with caching. Once I move to their type-checker instead of mypy, that'll speed the CI up even more. The playwright test running will then probably be the slowest part, and that's only in apps with frontends.

(Also, Hi Mike, pretty sure I worked with you at Google Maps back in early 2000s, you were my favorite SRE so I trust your opinion on this!)

SoftTalker

Wow, your story gives me flashbacks to the 1990s when I worked in a mainframe environment. Compile jobs submitted by developers were among the lowest priorities. I could make a change to a program, submit a compile job, and wait literally half a day for it to complete. Then I could run my testing, which again might have to wait for hours. I generally had other stuff I could work on during those delays but not always.

rafaelmn

> If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

I am guesstimating (based on previous experience self-hosting the runner for MacOS builds) that the project I am working on could get like 2-5x pipeline performance at 1/2 cost just by using self-hosted runners on bare metal rented machines like Hetzner. Maybe I am naive, and I am not the person that would be responsible for it - but having a few bare metal machines you can use in the off hours to run regression tests, for less than you are paying the existing CI runner just for build, that speed up everything massively seems like a pure win for relatively low effort. Like sure everyone already has stuff on their plate and would rather pay external service to do it - but TBH once you have this kind of compute handy you will find uses anyway and just doing things efficiently. And knowing how to deal with bare metal/utilize this kind of compute sounds generally useful skill - but I rarely encounter people enthusiastic about making this kind of move. Its usually - hey lets move to this other service that has slightly cheaper instances and a proprietary caching layer so that we can get locked into their CI crap.

Its not like these services have 0 downtime/bug free/do not require integration effort - I just don't see why going bare metal is always such a taboo topic even for simple stuff like builds.

mike_hearn

Yep. For my own company I used a bare metal machine in Hetzner running Linux and a Windows VM along with a bunch of old MacBook Pros wired up in the home office for CI.

It works, and it's cheap. A full CI run still takes half an hour on the Linux machine (the product [1] is a kind of build system for shipping desktop apps cross platform, so there's lots of file IO and cryptography involved). The Macs are by far the fastest. The M1 Mac is embarrassingly fast. It can complete the same run in five minutes despite the Hetzner box having way more hardware. In fairness, it's running both a Linux and Windows build simultaneously.

I'm convinced the quickest way to improve CI times in most shops is to just build an in-office cluster of M4 Macs in an air conditioned room. They don't have to be HA. The hardware is more expensive but you don't rent per month, and CI is often bottlenecked on serial execution speed so the higher single threaded performance of Apple Silicon is worth it. Also, pay for a decent CI system like TeamCity. It helps reduce egregious waste from problems like not caching things or not re-using checkout directories. In several years of doing this I haven't had build caching related failures.

[1] https://hydraulic.dev/

adamcharnock

> 2-5x pipeline performance at 1/2 cost just by using self-hosted runners on bare metal rented machines like Hetzner

This is absolutely the case. Its a combination of having dedicated CPU cores, dedicated memory bandwidth, and (perhaps most of all) dedicated local NVMe drives. We see a 2x speed up running _within VMs_ on bare metal.

> And knowing how to deal with bare metal/utilize this kind of compute sounds generally useful skill - but I rarely encounter people enthusiastic about making this kind of move

We started our current company for this reason [0]. A lot of people know this makes sense on some level, but not many people want to do it. So we say we'll do it for you, give you the engineering time needed to support it, and you'll still save money.

> I just don't see why going bare metal is always such a taboo topic even for simple stuff like builds.

It is decreasingly so from what I see. Enough people have been variously burned by public cloud providers to know they are not a panacea. But they just need a little assistance in making the jump.

[0] - https://lithus.eu

azeirah

At the last place I worked at, which was just a small startup with 5 developers, I calculated that a server workstation in the office would be both cheaper and more performant than renting a similar machine in the cloud.

Bare metal makes such a big difference for test and CI scenarios. It even has an integrated a GPU to speed up webdev tests. Good luck finding an affordable machine in the cloud that has a proper GPU for this kind of a use-case

rafaelmn

Is it a startup or small business ? In my book a startup expects to scale and hosting bare metal HW in an office with 5 people means you have to figure everything out again when you get 20/50/100 people - IMO not worth the effort and hosting hardware has zero transferable skills to your product.

Running on managed bare metal servers is theoretically the same as running any other infra provider except you are on the hook for a bit more maintenance, you scale to 20 people you just rent a few more machines. I really do not see many downsides for the build server/test runner scenario.

mxs_

In their tech report, they say this is based on:

> "Our methods extend [28] through careful modifications to the data and computation to scale up learning."

[28] is Lou et al. (2023), the "Score Entropy Discrete Diffusion" (SEDD) model (https://arxiv.org/abs/2310.16834).

I wrote the first (as far as I can tell) independent from-scratch reimplementation of SEDD:

https://github.com/mstarodub/dllm

My goal was making it as clean and readable as possible. I also implemented the more complex denoising strategy they described (but didn't implement).

It runs on a single GPU in a few hours on a toy dataset.

true_blue

I tried the playground and got a strange response. I asked for a regex pattern, and the model gave itself a little game-plan, then it wrote the pattern and started to write tests for it. But it never stopped writing tests. It continued to write tests of increasing size until I guess it reached a context limit and the answer was canceled. Also, for each test it wrote, it added a comment about if the test should pass or fail, but after about the 30th test, it started giving the wrong answer for those too, saying that a test should fail when actually it should pass if the pattern is correct. And after about the 120th test, the tests started to not even make sense anymore. They were just nonsense characters until the answer got cut off.

The pattern it made was also wrong, but I think the first issue is more interesting.

beders

I think that's a prime example showing that token prediction simply isn't good enough for correctness. It never will be. LLMs are not designed to reason about code.

ianbicking

FWIW, I remember regular models doing this not that long ago, sometimes getting stuck in something like an infinite loop where they keep producing output that is only a slight variation on previous output.

data-ottawa

if you shrink the context window on most models you'll get this type of behaviour. If you go too small you end up with basically gibberish even on modern models like Gemini 2.5.

Mercury has a 32k context window according to the paper, which could be why it does that.

fiatjaf

This is too funny to be true.

fastball

ICYMI, DeepMind also has a Gemini model that is diffusion-based[1]. I've tested it a bit and while (like with this model) the speed is indeed impressive, the quality of responses was much worse than other Gemini models in my testing.

[1] https://deepmind.google/models/gemini-diffusion/

Powdering7082

From my minor testing I agree that it's crazy fast and not that good at being correct

tripplyons

Is the Gemini Diffusion demo free? I've been on the waitlist for it for a few weeks now.

M4v3R

I am personally very excited for this development. Recently I AI-coded a simple game for a game jam and half the time was spent waiting for the AI agent to finish its work so I can test it. If instead of waiting 1-2 minutes for every prompt to be executed and implemented I could wait 10 seconds instead that would be literally game changing. I could test 5-10 different versions of the same idea in the time it took me to test one with the current tech.

Of course this model is not as advanced yet for this to be feasible, but so was Claude 3.0 just over a year ago. This will only get better over time I’m sure. Exciting times ahead of us.

chc4

Using the free playground link, and it is in fact extremely fast. The "diffusion mode" toggle is also pretty neat as a visualization, although I'm not sure how accurate it is - it renders as line noise and then refines, while in reality presumably those are tokens from an imprecise vector in some state space that then become more precise until it's only a definite word, right?

icyfox

Some text diffusion models use continuous latent space but they historically haven't done that well. Most the ones we're seeing now typically are trained to predict actual token output that's fed forward into the next time series. The diffusion property comes from their ability to modify previous timesteps to converge on the final output.

I have an explanation about one of these recent architectures that seems similar to what Mercury is doing under the hood here: https://pierce.dev/notes/how-text-diffusion-works/

chc4

Oh neat, thanks! The OP is surprisingly light on details on how it actually works and is mostly benchmarks, so this is very appreciated :)

maelito

Link : https://chat.inceptionlabs.ai/

PaulHoule

It's insane how fast that thing is!

mtillman

Ton of performance upside in most GPU adjacent code right now.

However, is this what arXiv is for? It seems more like marketing their links than research. Please correct me if I'm wrong/naive on this topic.

ricopags

not wrong, per se, but it's far from the first time

mseri

Sounds all cool and interesting, however:

> By submitting User Submissions through the Services, you hereby do and shall grant Inception a worldwide, non-exclusive, perpetual, royalty-free, fully paid, sublicensable and transferable license to use, edit, modify, truncate, aggregate, reproduce, distribute, prepare derivative works of, display, perform, and otherwise fully exploit the User Submissions in connection with this site, the Services and our (and our successors’ and assigns’) businesses, including without limitation for promoting and redistributing part or all of this site or the Services (and derivative works thereof) in any media formats and through any media channels (including, without limitation, third party websites and feeds), and including after your termination of your account or the Services. For clarity, Inception may use User Submissions to train artificial intelligence models. (However, we will not train models using submissions from users accessing our Services via OpenRouter.)

armcat

I've been looking at the code on their chat playground, https://chat.inceptionlabs.ai/, and they have a helper function `const convertOpenAIMessages = (convo) => { ... }`, which also contains `models: ['gpt-3.5-turbo']`. I also see in API response: `"openai": true`. Is it actually using OpenAI, or is it actually calling its dLLM? Does anyone know?

Also: you can turn on "Diffusion Effect" in the top-right corner, but this just seems to be an "animation gimmick" right?

Alifatisk

The speed of the response is waaay to quick for using OpenAi as backend, it's almost instant!

armcat

I've been asking bespoke questions and the timing is >2 seconds, and slower than what I get for the same questions to ChatGPT (using gpt-4.1-mini). I am looking at their call stack and what I see: "verifyOpenAIConnection()", "generateOpenAIChatCompletion()", "getOpenAIModels()", etc. Maybe it's just so it's compatible with OpenAI API?

martinald

Check the bottom, I think it's just some off the shelf chat UI that uses OpenAI compatible API behind the scenes.

JimDabell

Pricing:

US$0.000001 per output token ($1/M tokens)

US$0.00000025 per input token ($0.25/M tokens)

https://platform.inceptionlabs.ai/docs#models

asaddhamani

The pricing is a little on the higher side. Working on a performance-sensitive application, I tried Mercury and Groq (Llama 3.1 8b, Llama 4 Scout) and the performance was neck-and-neck but the pricing was way better for Groq.

But I'll be following diffusion models closely, and I hope we get some good open source ones soon. Excited about their potential.

tripplyons

Good to know. I didn't realize how good the pricing is on Groq!

tlack

If your application is pricing sensitive, check out DeepInfra.com - they have a variety of models in the pennies-per-mil range. Not quite as fast as Mercury, Groq or Samba Nova though.

(I have no affiliation with this company aside from being a happy customer the last few years)

gdiamos

I think the LLM dev community is underestimating these models. E.g. there is no LLM inference framework that supports them today.

Yes the diffusion foundation models have higher cross entropy. But diffusion LLMs can also be post trained and aligned, which cuts the gap.

IMO, investing in post training and data is easier than forcing GPU vendors to invest in DRAM to handle large batch sizes and forcing users to figure out how to batch their requests by 100-1000x. It is also purely in the hands of LLM providers.

mathiaspoint

You can absolutely tune causal LLMs. In fact the original idea with GPTs was that you had to tune them before they'd be useful for anything.

gdiamos

Yes I agree you can tune autoregressive LLMs

You can also tune diffusion LLMs

After doing so, the diffusion LLM will be able to generate more tokens/sec during inference

Alifatisk

Love the ui in the playground, it reminds me of Qwen chat.

We have reached a point where the bottlenecks in genAI is not the knowledge or accuracy, it is the context window and speed.

Luckily, Google (and Meta?) has pushed the limits of the context window to about 1 million tokens which is incredible. But I feel like todays options are still stuck about ~128k token window per chat, and after that it starts to forget.

Another issue is the time time it takes for inference AND reasoning. dLLMs is an interesting approach at this. I know we have Groqs hardware aswell.

I do wonder, can this be combined with Groqs hardware? Would the response be instant then?

How many tokens can each chat handle in the playground? I couldn't find so much info about it.

Which model is it using for inference?

Also, is the training the same on dLLMs as on the standardised autoregressive LLMs? Or is the weights and models completely different?

martinald

I agree entirely with you. While Claude Code is amazing, it is also slow as hell and the context issue keeps coming up (usually at what feels like the worst possible time for me).

It honestly feels like dialup most LLMs (apart from this!).

AFIAK with traditional models context size is very memory intensive (though I know there are a lot of things that are trying to 'optimize' this). I believe memory usage grows at the square of context length, so even 10xing context length requires 100x the memory.

(Image) diffusion does not grow like that, it is much more linear. But I have no idea (yet!) about text diffusion models if someone wants to chip in :).

kadushka

We have reached a point where the bottlenecks in genAI is not the knowledge or accuracy, it is the context window and speed.

You’re joking, right? I’m using o3 and it can’t do half of the coding tasks I tried.