Cerebras Code

128 comments

·August 1, 2025

Flux159

Tried this out with Cline using my own API key (Cerebras is also available as a provider for Qwen3 Coder via via openrouter here: https://openrouter.ai/qwen/qwen3-coder) and realized that without caching, this becomes very expensive very quickly. Specifically, after each new tool call, you're sending the entire previous message history as input tokens - which are priced at $2/1M via the API just like output tokens.

The quality is also not quite what Claude Code gave me, but the speed is definitely way faster. If Cerebras supported caching & reduced token pricing for using the cache I think I would run this more, but right now it's too expensive per agent run.

sysmax

Adding entire files into the context window and letting the AI sift through it is a very wasteful approach.

It was adopted because trying to generate diffs with AI opens a whole new can of worms, but there's a very efficient approach in between: slice the files on the symbol level.

So if the AI only needs the declaration of foo() and the definition of bar(), the entire file can be collapsed like this:

  class MyClass {
    void foo();
    
    void bar() {
        //code
    }
  }

Any AI-suggested changes are then easy to merge back (renamings are the only notable exception), so it works really fast.

I am currently working on an editor that combines this approach with the ability to step back-and-forth between the edits, and it works really well. I absolutely love the Cerebras platform (they have a free tier directly and pay-as-you-go offering via OpenRouter). It can get very annoying refactorings done in one or two seconds based on single-sentence prompts, and it usually costs about half a cent per refactoring in tokens. Also great for things like applying known algorithms to spread out data structures, where including all files would kill the context window, but pulling individual types works just fine with a fraction of tokens.

If you don't mind the shameless plug, there's a more explanation how it works here: https://sysprogs.com/CodeVROOM/documentation/concepts/symbol...

postalcoder

this works if your code is exceptionally well composed. anything less can lead to looney tunes levels of goofiness in behavior, especially if there’s as little as one or two lines of crucial context elsewhere in the file.

This approach saves tokens theoretically, but i find it can lead to wastefulness as it tries to figure out why things aren’t working when loading the full file would have solved the problem in a single step.

sysmax

It greatly depends on the type of work you are trying to delegate to the AI. If you ask it to add one entire feature at a time, file level could work better. But the time and costs go up very fast, and it's harder to review.

What works for me (adding features to huge interconnected projects), is think what classes, algorithms and interfaces I want to add, and then give very brief prompts like "split class into abstract base + child like this" and "add another child supporting x,y and z".

So, I still make all the key decisions myself, but I get to skip typing the most annoying and repetitive parts. Also, the code don't look much different from what I could have written by hand, just gets done about 5x faster.

DrBenCarson

Yep and it collapses in the enterprise. The code you’re referencing might well be from some niche vendor’s bloated library with multiple incoherent abstractions, etc. Context is necessarily big

BenGosub

If they say it costs $50 per month, why do you need to make additional payments?

Havoc

This seems to be rate limited by message not token so the lack of cache may matter less

andhuman

No it’s by token. The FAQ says this:

> Actual number of messages per day depends on token usage per request. Estimates based on average requests of ~8k tokens each for a median user.

https://cerebras-inference.help.usepylon.com/articles/346886...

NitpickLawyer

Yes, but the new "thing" now is "agentic" where the driver is "tool use". So at every point where the LLM decides to make a tool use, there is a new request that gets sent. So a simple task where the model needs to edit one function down the tree, there might be 10 calls - 1st with the task, 2-5 for "read_file", then the model starts writing code, 6-7 trying to run the code, 8 fixing something, and so on...

itsafarqueue

Yup. If you’ve ever watched a 60+ minute agent loop spawning sub agents, your “one message” prompt leaves you several hundred messages in the hole.

Flux159

The lack of caching causes the price to increase for each message or tool call in a chat because you need to send the entire history back after every tool call. Because there isn’t any discount for cached tokens you’re looking at very expensive chat threads.

null

[deleted]

thanhhaimai

> running at speeds of up to 2,000 tokens per second, with a 131k-token context window, no proprietary IDE lock-in, and no weekly limits!

I was excited, then I read this:

> Send up to 1,000 messages per day—enough for 3–4 hours of uninterrupted vibe coding.

I don't mind paying for services I use. But it's hard to take this seriously when the first paragraph claim is contradicting the fine prints.

superasn

Pretty sure this is there to prevent this[1] from happening to them

[1] https://www.viberank.app/

bravesoul2

That's a CO2 emissions leader board!

LudwigNagasena

That’s almost no CO2 emissions at all. Here is a CO2 emissions leaderboard (need to sort by the correct column): https://celebrityprivatejettracker.com/leaderboard/

echelon

Oh my god. That's insane.

The anti-AI people would be pulling their pitchforks out against these people.

Would there be any way of compiling this without people's consent? Looking at GitHub public repos, etc.?

I imagine a future where we're all automatically profiled like this. Kind of like perverse employee tracking software.

attentive

To put this into perspective, github copilot Business license is 300 "premium" requests a MONTH.

sneilan1

1,000 messages per day should be plenty as a daily development driver. I use claude code sonnet 4 exclusively and I do not send more than 1,000 messages per day. However, that is my current understanding. I am certainly not pressing enter 1,000 times! Maybe there are more messages being sent under the hood that I do not realize?

thanhhaimai

The issue is not about whether the limit is too high or too low. What turned me back was that they claimed "no weekly limits" as a selling feature, without mentioning that they change it to a "daily limits".

I understand it's a sale tactics. But it seems not forthcoming, and it's hard for me to trust the rest of the claims.

null

[deleted]

twothreeone

I don't see what's hard to understand about this.. other providers have weekly limits and daily limits. If you max out your daily every day you might still hit your weekly after 3-4 days of usage, meaning you cannot send more for the rest of the week. This is saying that no such weekly limit exists on top of the daily. E.g. see https://techcrunch.com/2025/07/28/anthropic-unveils-new-rate...

itsafarqueue

Your “one enter” press might generate dozens or even hundreds of messages in an agent. Every file read, re-read, read a bit more, edit, whoops re-edit, ls, grep, etc etc counts as a message.

null

[deleted]

SamDc73

Still not sure if it's 1000 messages or calls though, if messages that's good.

null

[deleted]

kristjansson

It’s a true statement - no weekly limits, just a daily limit. Easier to work with when you can only get locked out of your tool for 23h59m

handfuloflight

You're going to send 1,000 messages in 1 minute?

brandall10

The CC weekly limits are in place to thwart abuse. This bit of marketing isn't particular useful as that limit primarily impacts those who are running it at all hours.

OTOH, 5 hour limits are far superior to daily limits when both can be realistically hit.

bongodongobob

The weekly limit is the daily limit x 7.

bluelightning2k

What a productive second you must have had

amirhirsch

the distinction is from weekly limits of claude code.

sneilan1

Claude code weekly limits are hard to distinguish. It's not easy to understand their usage limits. I've found when I run into too much opus usage, I switch to sonnet but I've never ran into a usage limit with sonnet 4 yet.

crawshaw

If you would like to try this in a coding agent (we find the qwen3-coder model works really well in agents!), we have been experimenting with Cerebras Code in Sketch. We just pushed support, so you can run it with the latest version, 0.0.33:

  brew install boldsoftware/tap/sketch
  CEREBRAS_API_KEY=...
  sketch --model=qwen3-coder-cerebras -skaband-addr=

Our experience is it seems overloaded right now, to the point where we have better results with our usual hosted version:

  sketch --model=qwen

unraveller

Some users who signed up for pro ($50 p.m.) are reporting further limitations than those advertised.

>While they advertise a 1,000-request limit, the actual daily constraint is a 7.5 million-token limit. [1]

Assumes an average of 7.5k/request whereas in their marketing videos they show API requests ballooning by ~24k per request. Still lower than the API price.

[1] https://old.reddit.com/r/LocalLLaMA/comments/1mfeazc/cerebra...

itsafarqueue

Bait and switched their FAQ after the fact too. Come on Cerebras, it’s only VC money you’re burning here in the first place, let’s see some commitment to winning market share. :money: :fire:

exclipy

Windsurf also has Cerebras/Qwen3-Coder. 1000 user messages per month for $15

https://x.com/windsurf/status/1951340259192742063

alfalfasprout

2k tokens/second is insane. While I'm very much against vibe coding, such performance essentially means you can get near-github copilot level speed with drastically better quality.

For in-editor use that's game changing.

itsafarqueue

At full pace that means 62 mins until you hit the daily cap.

namanyayg

I was waiting for more subscription base services to pop up to compete with the influence provider on a commodities level.

I think a lot more companies will follow suit and the competition will make pricing much better for the end user.

congrats on the launch Cerebras team!

ktsakas

Does it work with claude-code-router? I was getting API errors this week trying to use qwen3 Cerebras through OpenRouter with Claude code router.

d4rkp4ttern

I really wish Qwen3 folks put up an Anthropic-compatible API like the Kimi and GLM/Zai folks cleverly did — this makes their models trivially usable in Claude Code, via this dead-simple setup:

https://github.com/pchalasani/claude-code-tools?tab=readme-o...

amirhirsch

API Error: 422 {"error":{"message":"Error from provider: {\"message\":\"body.messages.0.system.content: Input should be a valid string\",\"type\":\"invalid_request_error\",\"param\":\"validation_error\",\"code\":\"wrong_api_format\"}

amirhirsch

i ended up getting it working through copying the transformer in this issue: https://github.com/musistudio/claude-code-router/issues/407

It hits the request per minute limit instantly and then you wait a minute.

nubela

Did you make payment? I also found it unusable due to rate limits. Not sure if it is because I was on the free trial.

lvl155

Their hardware is incredible. Why aren’t more investors lining up for this in this environment?

no_flaks_given

This model is super quantized and the quality isn't great, but that's necessary because just like everyone else except for Nvidia and AMD

They shat the bed. They went for super crazy fast compute and not much memory, assuming that models would plateu at a fee billion parameters.

Last year 70b parameters was considered huge, and a good place to standardize around.

Today we have 1t parameter models and we know it still scales linearly with parameters.

So next year we might have 10T parameter LLMs and these guys will still be playing catch up.

All that matters for inference right now is how many HBM chips you can stack and that's it

dmitrygr

Contradictions do not exist. Whenever you think that you are facing a contradiction, check your premises. You will find that one of them is wrong.

thfuran

Neither do perfectly efficient, perfectly rational markets.

sejje

A perfectly efficient market would be a bad premise, sure.

orbifold

In this case the hardware is a nightmare to program.

dmitrygr

Bingo

arisAlexis

Or just bad marketing vs the Goliath (Nvidia)

null

[deleted]

segmondy

FYI, you are probably going to use up your tokens because there's a total limit of tokens per day, so in about 300 requests it's feasible to use it all up. See https://www.reddit.com/r/LocalLLaMA/comments/1mfeazc/cerebra...

sneilan1

I'm so excited to see a real competitor to Claude Code! Gemini CLI, while decent, does not have a $200/month pricing model and they charge per API access - Codex is the same. I'm trying to get into the https://cloud.cerebras.ai/ to try the $50/month plan but I can't even get in.

bangaladore

Unless I'm misunderstanding something. Cerebras Code is not equivalent to Claude Code or Gemini CLI. It's a strange name for a subscription to access an API endpoint.

You take your Cerebras Code endpoint and configure XYZ CLI tool or IDE plugin to point at it.

sneilan1

Oh so this is not an integrated command line tool like Claude code? I assumed it was something where Cerebras released a decent prompt and command line agent setup. A lot of the value of Claude Code is how polished it is and how much work went into the prompt design.

unshavedyak

There is i believe a forked Gemini Code which will work like Claude Code, or so it looks like on Youtube.

wordofx

This doesn’t feel like a competitor. Amp does tho.

flashblaze

I don't hear about Amp often. Have you tried it? How does it compare to Claude Code?

wordofx

It’s really good. Was discussing it with a friend recently who said he thinks it works out cheaper because it takes less loops to get things right. I’ve been having better success with it so would recommend it over Claude Code for now.

attentive

Attn: Cerebras

Any attempt to deal with "<think>" in the code gets it replaced with "<tool_call>".

Both in inference.cerebras.ai chat and API.

Same model on chat.qwen.ai doesn't do it.

HN

Cerebras Code

Cerebras Code