Llama 4 Now Live on Groq

56 comments

·April 5, 2025

Game_Ender

To help those who got a bit confused (like me) this Groq the company making accelerators designed specifically for LLM's that they call LPUs (Language Process Units) [0]. So they want to sell you their custom machines that, while expensive, will be much more efficient at running LLMs for you. While there is also Grok [0] which is xAI's series of LLMs and competes with ChatGPT and other models like Claude and DeepSeek.

EDIT - Seems that Groq has stopped selling their chips and now will only partner to fund large build outs of their cloud [2].

0 - https://groq.com/the-groq-lpu-explained/

1 - https://grok.com/

2 - https://www.eetimes.com/groq-ceo-we-no-longer-sell-hardware

IAmNotACellist

I deeply crave prosumer hardware that can sit on my shelf and handle massive models, like 200-400B at a reasonable quant. Something like Groq or Digits but at the cost of a high-end gaming PC, like $3k. This has to be a massive market, considering that even ancient Pascal-series GPUs that were once $50 are going for $500.

darksaints

I have that irresistible urge too, but I have to keep reminding myself that I could spend $2000 in credits over the course of a year, and get the performance and utility of a $40k server, with scalable capacity, and without any risk that that investment will be obsolete when Llama5 comes out.

zozbot234

> I deeply crave prosumer hardware that can sit on my shelf and handle massive models, like 200-400B at a reasonable quant.

So, an Apple Mac Studio?

numa7numa7

Nvidia's working on it. 200B at $3k

https://www.nvidia.com/en-us/products/workstations/dgx-spark...

almostgotcaught

> This has to be a massive market

It's not - it's absolutely a vanishingly small market.

null

[deleted]

sofixa

The Framework Desktop is one not absurdly expensive option. The memory speed isn't great (200 something GB/s), but any faster with those requirements at least doubles the price (e.g. a Mac Studio, only the highest tier M chips have faster memory).

renewiltord

At home people would rather use the cloud.

latchkey

> So they want to sell you there custom machines

They stopped selling the hardware to the public, and it takes an extraordinary amount of it to run these larger models due to limited ram.

ozenhati

hi! i work @ groq and just made an account here to answer any questions for anyone who might be confused. groq has been around since 2016 and although we do offer hardware for enterprises in the form of dedicated instances, our goal is to make the models that we host easily accessible via groqcloud and groq api (openai compatible) so you can instantly get access to fast inference. :)

we have a pretty generous free tier and a dev tier you can upgrade to for higher rate limits. also, we deeply value privacy and don't retain your data. you can read more about that here: https://groq.com/privacy-policy/

null

[deleted]

iJohnDoe

https://groq.com/hey-elon-its-time-to-cease-de-grok/

null

[deleted]

ronsor

Groq was suing Grok at some point, but Elon Musk is basically untouchable now.

minimaxir

Groq's blog post about the issue was a shitpost, not an actual legal document.

sejje

Suing for what? The name?

exe34

[flagged]

faebi

[flagged]

andreresende

[dead]

simonw

It's live on Groq, Together and Fireworks now.

All three of those can also be accessed via OpenRouter - with both a chat interface and an API:

- Scout: https://openrouter.ai/meta-llama/llama-4-scout

- Maverick: https://openrouter.ai/meta-llama/llama-4-maverick

Scout claims a 10 million input token length but the available providers currently seem to limit to 128,000 (Groq and Fireworks) or 328,000 (Together) - I wonder who will win the race to get that full sized 10 million token window running?

Maverick claims 1 million and Fireworks offers 1.05M while Together offers 524,000. Groq isn't offering Maverick yet

spmurrayzzz

I'd pump the brakes a bit on the 10M context expectations. Its just another linear attention mechanism with rope scaling [1]. They're doing something similar to what cohere did recently, using a global attention mask and a local chunked attention mask.

Notably the max sequence length in training was 256k, but the native short context is still just 8k. I'd expect the retrieval performance to be all over the place here. Looking forward to seeing some topic modeling benchmarks run against it (ill be doing so with some of my local/private datasets).

[1] https://github.com/meta-llama/llama-models/blob/eececc27d275...

EDIT: should be fair/complete and note they do claim perfect NIAH text retrieval performance across all 10M tokens for the Scout model on their blog post: https://ai.meta.com/blog/llama-4-multimodal-intelligence/. There are some serious limitations and caveats to that particular flavor of test though.

theGnuMe

> using a global attention mask and a local chunked attention mask.

Would you mind expanding on this? Or point to a reference or two? Thanks! I am trying to understand it.

spmurrayzzz

The source file I linked in my initial comment is honestly the most succinct way to understand how this works, but the TL;DR is that there is a NoPE layer interval parameter passed to the transformer block implementation. That defines how frequent a "no positional encoding" layer is used. The NoPE layers use the global attention mask, which is a traditional application of attention (attends to all tokens in the context window). The other layers use RoPE (rotary positional encodings) and a chunked local attention mask, which only attends to a fixed set of tokens in each chunk.

There is a wealth of literature to catch up on to understand the performance motivations behind those choices, but you can think of it as essentially a balancing act. They want to extend the context length, which is limited by conventional attention compute scaling. RoPE on the other hand is a trick that helps you to scale attention to longer context, but at the cost of poor retrieval across the entire context window. This approach is a hybrid of those two things. The recent Cohere models employ a similar methodology.

null

[deleted]

parhamn

I might be biased by the products I'm building but it feels to me that function support is table stakes now? Are open source models are just missing the dataset to fine tune one?

Very few of the models supported on Groq/Together/Fireworks support function calling. And rarely the interesting ones (DeepSeek V3, large llamas, etc)

ozenhati

100%. we've found that llama-3.3-70b-versatile and qwen-qwq-32b perform exceptionally well with reliable function calling. we had recognized the need for this and our engineers partnered with glaive ai to create fine tunes of llama 3.0 specifically for better function calling performance until the llama 3.3 models came along and performed even better.

i'd actually love to hear your experience with llama scout and maverick for function calling. i'm going to dig into it with our resident function calling expert rick lamers this week.

garfij

Thank you for saying this out loud. I've been losing my mind wondering where the discussion on this was. LLMs without Tool Use/Function Calling is basically a non starter for anything I want to do.

jstanley

When I was working with LLMs without function calling I made the scaffold put some information in the system prompt that tells it some JSON-ish syntax it can use to invoke function calls.

It places more of a "mental burden" on the model to output tool calls in your custom format, but it worked enough to be useful.

minimaxir

Although Llama 4 is too big for mere mortals to run without many caveats, the economics of call a dedicated-hosting Llama 4 are more interesting than expected.

$0.11 per 1M tokens, a 10 million content window (not yet implemented in Groq), and faster inference due to fewer activated parameters allows for some specific applications that were not cost-feasible to be done with GPT-4o/Claude 3.7 Sonnet. That's all dependent on whether the quality of Llama 4 is as advertised, of course, particularly around that 10M context window.

zozbot234

It's possible that we'll see smaller Llama 4-based models in the future, though. Similar to Llama 3.2 1B, which was released later than other Llama 3.x models.

sroussey

Yeah, I too am looking forward to their small text only models at 3B and 1B.

latchkey

> Llama 4 is too big for mere mortals to run without many caveats

AMD MI300x has day zero support to run it using vLLM. Easy enough to rent them for decent pricing.

sinab

I got an error when passing a prompt with about 20k tokens to the Llama 4 Scout model on groq (despite Llama 4 supporting up to 10M token context). groq responds with a POST https://api.groq.com/openai/v1/chat/completions 413 (Payload Too Large) error.

Is there some technical limitation on the context window size with LPUs or is this a temporary stop-gap measure to avoid overloading groq's resources? Or something else?

greeneggs

FYI, the last sentence, "Start building today on GroqCloud – sign up for free access here…" links to https://conosle.groq.com/ (instead of "console")

snikch

Fixed. Thanks for the report.

vessenes

Just tried this thank you. Couple qs - looked like just scout access for now, do you have plans for larger model access? Also, seems like context length is always fairly short with you guys, is that architectural or cost-based decisions?

ozenhati

amazing! and yes, we'll have maverick available today. the reason we limit ctx window is because demand > capacity. we're pretty busy with building out more capacity so we can get to a state where we give everyone access to larger context windows without melting our currently available lpus, haha.

vessenes

cool. I would so happily pay you guys for long context API that aider could point at -- the speed is just game changing. I know your arch is different, so I understand it's an engineering lift. But, I bet you'd find some pareto optimal point in the curve where you could charge a lott more for the speed you guys can do if it's long enough for coding.

jasonjmcghee

Seems to be about 500 tk/s. That's actually significantly less than I expected / hoped for, but fantastic compared to nearly anything else. (specdec when?)

Out of curiosity, the console is letting me set max output tokens to 131k but errors above 8192. what's the max intended to be? (8192 max output tokens would be rough after getting spoiled with 128K output of Claude 3.7 Sonnet and 64K of gemini models.)

ozenhati

do you happen to be trying this out on free tier right now? because our rate limits are at 6k tokens per minute on free tier for this model, which might be what you're running into.

jasonjmcghee

When I tried llama4 scout and tried to set the max output tokens above 8192 it told me the max was 8192. Once I set it below, it worked. This was in the console

growdark

Would it be realistic to buy and self-host the hardware to run, for example, the latest Llama 4 models, assuming a budget of less than $500,000?

mrajcok

Yes - I'm able to run Llama 3.1 405B on 3x A6000 + 3x 4090.

Will have Llama 4 Maverick running in 4bit quantization (typically results in only minor quality degradation) once llama.cpp support is merged.

Total hardware cost well under $50,000.

The 2T Behemoth model is tougher, but enough Blackwell 6000 Pro cards (16) should be able to run it for under $200k.

briandw

Llama scout is a 17B x 16 MOE. So that 17B active parameters. That makes it faster to run. But the memory requirements are still large. They claim it fits on an H100. So under 80GB. A mac studio at 96GB could run this. By run i mean inference, Ollama is easy to use for this. 4x3090 nvidia cards would also work but its not the easiest pc build. The tinybox https://tinygrad.org/#tinybox is 15k and you can do Lora fine tuning. Could also do a regular pc with 128gb of ram, but its would be quite slow.

latchkey

A box of AMD MI300x (1.5TB of memory) is much less than $500k and AMD made sure to have day zero support with vLLM.

That said, I'm obviously biased but you're probably better off renting it.

hhh

You can do it with regular gpus for less

null

[deleted]

geor9e

I'm glad I saw this because llama-3.3-70b-versatile just stopped working in my app. I switched it to meta-llama/llama-4-scout-17b-16e-instruct and it started working again. Maybe groq stopped supporting the old one?

imcritic

All I get is {"error":{"message":"Not Found"}}

ozenhati

can you reach out to us via live chat on console.groq.com with your organization id?

null

[deleted]

HN

Llama 4 Now Live on Groq

Llama 4 Now Live on Groq