Cerebras Code now supports GLM 4.6 at 1000 tokens/sec
16 comments
·November 8, 2025lordofgibbons
At what quantization? And if it is in fact quantized below fp8, how is the performance impacted on all the various benchmarks?
behnamoh
If they don't quantize the model, how do they achieve these speeds? Groq also says they don't quantize models (and I want to believe them) but we literally have no way to prove they're right.
This is important because their premium $50 (as opposed to $20 on Claude Pro or ChatGPT Plus) should be justified by the speed. GLM 4.6 is fine but I don't think it's still at the GPT-5/Claude Sonnet 4.5 level, so if I'm paying $50 for it on Cerebras it should be mainly because of speed.
What kind of workflow justifies this? I'm genuinely curious.
cschneid
so apparently they have custom hardware that is basically absolutely gigantic chips - across the scale of a whole wafer at a time. Presumably they keep the entire model right on chip, in effectively L3 cache or whatever. So the memory bandwidth is absurdly fast, allowing very fast inference.
It's more expensive to get the same raw compute as a cluster of nvidia chips, but they don't have the same peak throughput.
As far as price as a coder, I am giving a month of the $50 plan a shot. I haven't figured out how to adapt my workflow yet to faster speeds (also learning and setting up opencode).
bigyabai
For $50/month, it's a non-starter. I hope they can find a way to use all this excess bandwidth to put out a $10 equivalent to Claude Code instead of a 1000 tok/s party trick I can't use properly.
alyxya
It would be nice if there was more information provided on that page. I assume this is just the output token generation speed. Is it using speculative decoding to get to 1000 tokens/sec? Is there lossy quantization being used to speed things up? I tend to think the number of tokens per second a model can generate to be relatively low on the list of things I care about, when things like model/inference quality and harness play a much bigger role in how I feel about using a coding agent.
cschneid
Yes this is the output speed. Code just flashes onto the page, it's pretty impressive.
They've claimed repeatedly in their discord that they don't quantize models.
The speed of things does change how you interact with it I think. I had this new GLM model hooked up to opencode as the harness with their $50/mo subscription plan. It was seriously fast to answer questions, although there are still big pauses in workflow when the per-minute request cap is hit.
I got a meaningful refactor done, maybe a touch faster than I would have in claude code + sonnet? But my human interaction with it felt like the slow part.
alyxya
The human interaction part is one of the main limitations to speed, where the more autonomous a model can be, the faster it is for me.
niklassheth
This is more evidence that Cognition's SWE-1.5 is a GLM-4.6 finetune
nl
Not at all. Any model with somewhat-similar architecture and roughly similar size should run at the same speed on Cerabras.
It's like saying Llama 3.2 3B and Gemma 4B are fine tunes of each other because they run at similar speeds on NVidia hardware.
prodigycorp
Can you provide more context for this? (eg Was SWE-1.5 released recently? Is it considered good? Is it considered fast? Was there speculation about what the underlying model was? How does this prove that it's a GLM finetune?)
mhuffman
I suspect they are referencing the 950tok/s claim on Cognition's page.
prodigycorp
Ah. Thx. Blogpost for others: https://cognition.ai/blog/swe-1-5
Takeaway is that this is sonnet-ish model at 10x the speed.
renewiltord
Unfortunately for me, the models on Cerebras weren’t as good as Claude Code. Speedy but I needed to iterate more. Codex is trustworthy and slow. Claude is better at iterating. But none of the Cerebras models at the $50 tier were worth anything for me. They would have been something if they’d just come out but we have these alternatives now.
null
gatienboquet
Vibe Slopping at 1000 tokens per second
Was able to sign up for the Max plan & start using it via opencode. It does a way better job than Qwen3 Coder in my opinion. Still extremely fast, but in less than 30 minutes I was able to use 1M input tokens, so with multiple agents running I should be able to pass that 120M daily token limit. The speed difference between Claude Code is significant though - to the point where I'm not waiting for generation most of the time, I'm waiting for my tests to run.
For reference, each new request needs to send all previous messages - tool calls force new requests too. So it's essentially cumulative when you're chatting with an agent - my opencode agent's context window is only 50% used at 72k tokens, but Cerebra's tracking online shows that I've used 1M input tokens and 10k output tokens already.