The Llama 4 herd
584 comments
·April 5, 2025laborcontract
InvOfSmallC
For a super ignorant person:
Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each
Those experts are LLM trained on specific tasks or what?
vessenes
This was an idea that sounded somewhat silly until it was shown it worked. The idea is that you encourage through training a bunch of “experts” to diversify and “get good” at different things. These experts are say 1/10 to 1/100 of your model size if it were a dense model. So you pack them all up into one model, and you add a layer or a few layers that have the job of picking which small expert model is best for your given token input, route it to that small expert, and voila — you’ve turned a full run through the dense parameters into a quick run through a router and then a 1/10 as long run through a little model. How do you get a “picker” that’s good? Well, it’s differentiable, and all we have in ML is a hammer — so, just do gradient descent on the decider while training the experts!
This generally works well, although there are lots and lots of caveats. But it is (mostly) a free lunch, or at least a discounted lunch. I haven’t seen a ton of analysis on what different experts end up doing, but I believe it’s widely agreed that they tend to specialize. Those specializations (especially if you have a small number of experts) may be pretty esoteric / dense in their own right.
Anthropic’s interpretability team would be the ones to give a really high quality look, but I don’t think any of Anthropic’s current models are MoE.
Anecdotally, I feel MoE models sometimes exhibit slightly less “deep” thinking, but I might just be biased towards more weights. And they are undeniably faster and better per second of clock time, GPU time, memory or bandwidth usage — on all of these - than dense models with similar training regimes.
zamadatix
The only thing about this which may be unintuitive from the name is an "Expert" is not something like a sub-llm that's good at math and gets called when you ask a math question. Models like this have layers of networks they run tokens through and each layer is composed of 256 sub-networks, any of which can be selected (or multiple selected and merged in some way) for each layer independently.
So the net result is the same: sets of parameters in the model are specialized and selected for certain inputs. It's just a done a bit deeper in the model than one may assume.
philsnow
The idea has also been around for at least 15 years; "ensemble learning" was a topic in my "Data Mining" textbook from around then.
Meta calls these individually smaller/weaker models "experts" but I've also heard them referred to as "bozos", because each is not particularly good at anything and it's only together that they are useful. Also bozos has better alliteration with boosting and bagging, two terms that are commonly used in ensemble learning.
Buttons840
If I have 5000 documents about A, and 5000 documents about B, do we know whether it's better to train one large model on all 10,000 documents, or to train 2 different specialist models and then combine them as you describe?
faraaz98
I've been calling for this approach for a while. It's kinda similar to how the human brain has areas that are good at specific tasks
mrbonner
So this is kind of an ensemble sort of thing in ML like random forest and GBT?
randomcatuser
yes, and it's on a per-layer basis, I think!
So if the model has 16 transformer layers to go through on a forward pass, and each layer, it gets to pick between 16 different choices, that's like 16^16 possible expert combinations!
tomjen3
Cool. Those that mean I could just run the query through the router and then load only the required expert? That is could I feasibly run this on my Macbook?
chaorace
The "Experts" in MoE is less like a panel of doctors and more like having different brain regions with interlinked yet specialized functions.
The models get trained largely the same way as non-MoE models, except with specific parts of the model silo'd apart past a certain layer. The shared part of the model, prior to the splitting, is the "router". The router learns how to route as an AI would, so it's basically a black-box in terms of whatever internal structure emerges from this.
pornel
No, it's more like sharding of parameters. There's no understandable distinction between the experts.
vintermann
I understand they're only optimizing for load distribution, but have people been trying to disentangle what the the various experts learn?
brycethornton
I believe Mixture-of-Experts is a way for a neural network to group certain knowledge into smaller subsets. AFAIK there isn't a specific grouping goal, the network just figures out what goes where on it's own and then when an inference request is made it determines what "expert" would have that knowledge and routes it there. This makes the inference process much more efficient.
qwertox
Llama 4 Scout, Maximum context length: 10M tokens.
This is a nice development.
lelandbatey
Is the recall and reasoning equally good across the entirety of the 10M token window? Cause from what I've seen many of those window claims equate to more like a functional 1/10th or less context length.
vessenes
It’s going to take a while to see how good this window is for real use; they’ve used a couple new ideas to get to 10M token context. Right now the only really good long token model out there is Gemini Pro - and its effectiveness does start dropping maybe in the 200k token range. I imagine insiders at GOOG have access to more than the published 1M token range there.
It will be fun to see what we get here, but I have no doubt the extra tokens will be useful - lots of use cases can do almost as well with summary-level accuracy memory.
jimmyl02
the needle in a haystack benchmark looks good but at this point I think we need new benchmarks to test actual understanding of content in such a large window.
littlestymaar
I read somewhere that it has been trained on 256k tokens, and then expanded with RoPE on top of that, not starting from 16k like everyone does IIRC so even if it isn't really flawless at 10M, I'd expect it to be much stronger than its competitors up to those 256k.
Baeocystin
I assume they're getting these massive windows via RAG trickery, vectorization, and other tricks behind the curtain, became I've noticed the same as you- things start dipping in quality pretty quickly.
Does anyone know if I am correct in my assumption?
aimanbenbaha
I don't think RAG will survive this time
inertiatic
4.8b words on English Wikipedia. Knowledge cutoff of 6 months. A valid use case is to search across Wikipedia and ground your answers. Trivially proves that RAG is still needed.
drusepth
RAG still has lots of benefits for anyone paying per input token (e.g. over APIs).
acchow
This is only for the small model. The medium model is still at 1M (like Gemini 2.5)
Even if we could get the mid models to 10M, that's still a medium-sized repo at best. Repos size growth will also accelerate as LLMs generate more code. There's no way to catch up.
null
lostmsu
How did they achieve such a long window and what are the memory requirements to utilize it?
miven
According to [0] it's partly due to a key change they introduced in interleaving layers that use standard RoPE positional encodings and layers using what's called NoPE [1], not encoding positions at all and letting the model to figure those out on its own (this exclusively works because the LLMs are autoregressive, so the model can recognize an input token as being the very first by there not yet being any other tokens to attend to, and recursively deriving the position of the subsequent ones from that base case)
[0] https://ai.meta.com/blog/llama-4-multimodal-intelligence/ [1] https://arxiv.org/abs/2305.19466
null
clueless
> Knowledge cutoff: August 2024.
Could this mean training time is generally around 6 month, with 2 month of Q/A?
jhugg
I wish my knowledge cutoff was August 2024.
steenandersson
This made me LOL louder than I have for a long time! Agree.
bertil
Couldn’t you gradually include more recent documents as you train?
changoplatanero
You can do that but the amount of incremental data will be negligible compared to the rest of the data. Think of the knowledge cutoff more like a soft value.
soulofmischief
That makes it harder to analyze the results of training and draw conclusions for the next round.
nickysielicki
It scales depending on the dataset you want exposure on and the compute you have available, so any specific time box is kind of meaningless if you don’t know the rest of the inputs that went into it. The llama 3 paper went into a lot of this and how these decisions were made (see section 3 and onward): https://ai.meta.com/research/publications/the-llama-3-herd-o...
tl;dr: llama 3 was 54 days, but it’s more complicated than that.
ramshanker
I have a gut feeling, next in line will be 2 or more level of MoE. Further reducing the memory bandwidth and compute requirements. So top level MoE router decides which sub MoE to route.
jamesblonde
The solution to all problems in computer science is add a new level of indirection (or abstraction).
accrual
Thanks for sharing this here. At first I loved the simple Apache-style directory listing, very classic and utilitarian way to navigate new information. Then I tried clicking the FAQ and it wouldn't load anything until I allowed two different sources of JavaScript.
kristopolous
17B puts it beyond the reach of a 4090 ... anybody do 4 bit quant on it yet?
reissbaker
Oh, it'll never run on a 4090. 17B is the active parameter count, not the total param count (and "active" doesn't mean you can slice just those params out and put them on the GPU — which parameters are active constantly changes, even per-token. "Active" just means you get tokens faster than a dense model). It's 109B total parameters, so you'd need at least 54.5GB VRAM just for the weights alone.
A Framework Desktop, Mac Studio, or Nvidia DGX Spark should be able to handle the Scout model locally though... Maybe even at FP8, depending on how much context you need.
dragonwriter
Well, Scout should run on the rumored 96GB 4090, since it runs on a single 80GB H100. But, yeah, it'd have to be at sub-2bit quantization to run on a standard 24GB.
popinman322
You can swap experts in and out of VRAM, it just increases inference time substantially.
Depending on the routing function you can figure out all the active experts ahead of the forward pass for a single token and pipeline the expert loading.
lostmsu
Sounds runnable on 2x5090 presumably for $4k if back in stock.
taneq
Unless something’s changed you will need the whole model on the HPU anyway, no? So way beyond a 4090 regardless.
kristopolous
A habana just for inference? Are you sure?
Also I see the 4 bit quants put it at a h100 which is fine ... I've got those at work. Maybe there will be distilled for running at home
littlestymaar
You can still offload most of the model to RAM and use the GPU for compute, but it's obviously much slower than what it would be if everything was on the GPU memory.
see ktransformers: https://www.reddit.com/r/LocalLLaMA/comments/1jpi0n9/ktransf...
fsndz
Nice release. I see that everyone is playing the differentiation game now: https://medium.com/thoughts-on-machine-learning/llama-4-and-...
pavelstoev
Model training observations from both Llama 3 and 4 papers:
Meta’s Llama 3 was trained on ~16k H100s, achieving ~380–430 TFLOPS per GPU in BF16 precision, translating to a solid 38 - 43% hardware efficiency [Meta, Llama 3].
For Llama 4 training, Meta doubled the compute, using ~32K H100s and switched to FP8 precision. Despite the precision gain, observed efficiency dropped to about 19.7%, with GPUs delivering ~390 TFLOPS out of a theoretical 1,979 FP8 TFLOPS [Meta, Llama 4].
I am not the one to critique, and rather, this is a recognition of the enormous complexity of operating GPUs at this scale. Training massive models across tens of thousands of GPUs stretches today’s AI infrastructure to its limit.
Besides accelerating inference workloads, advanced GPU optimizations can be integrated into training and fine-tuning pipelines. From various kernel optimization techniques (over 90) to increasing memory access efficiency and scaling up to cluster-wide resource coordination, efficiency can be maximized with some complex software.
References: [Meta, Llama 3] https://ai.meta.com/research/publications/the-llama-3-herd-o... [Meta, Llama 4] https://ai.meta.com/blog/llama-4-multimodal-intelligence/
rfoo
That's about the same number for DeepSeek-V3. If you count in fp8 MFU is about 20%. MoEs are hard.
That could also be why they did fp8. If we use theoretical performance of bf16 as baseline (I know this makes few sense, but for compare with previous trainings it's convenient) the about 40% MFU, not too bad.
IOW, MoE kills training MFU and they had to do fp8 to make it not looking funny. Both DeepSeek and Meta GenAI.
user070223
Never trained a model, but the precision confused me as I've never considered how many bits should be reserved for exponent/mentisa. Has anyone architected a model(somehow) such that it has a free hand at using the give bits / choosing the type, or changed types from layer to layer, I mean surely when training for example vision models the first layers deal with the "big(yet simpler) picture"(light/dark, lines etc) where as the last layers are with the fine details etc.
Even though it may not suitable for (existing) hardware impl, it may be advantageous in other place for example in learning rate speed.
apsec112
You can't choose arbitrary bits of mantissa, because what types are allowed is defined by the underlying hardware and instruction set (PTX for Nvidia). People have done some exploration of which layers can be quantized more vs. which need to be kept in higher precision, but this is usually done post-training (at inference time) and is largely empirical.
silverlake
I think BF16 and FP16 are 1979 TFPOPs, but FP8 is 2x faster at 3958 TFLOPs. So only 10% efficiency, down from 20%. That’s not good.
az226
That’s with sparsity. So it’s 29% down from 40%.
cavisne
The H100 theoretical flops number is just marketing, as it relies on sparsity that LLMs don’t use
az226
And the practical flops always end up lower. As an example a V100 has 125 according to spec, but the ideal case is more like 100 and non-ideal like 60.
YetAnotherNick
It's not just scale. Even for single GPU, it is hard to acheive 2x speed improvement as the GPU specs states. Even NVIDIA's own Tensor Engine acheives 28% extra FLOP/s[1].
ckrapu
"It’s well-known that all leading LLMs have had issues with bias—specifically, they historically have leaned left when it comes to debated political and social topics. This is due to the types of training data available on the internet."
Perhaps. Or, maybe, "leaning left" by the standards of Zuck et al. is more in alignment with the global population. It's a simpler explanation.
ipsento606
I find it impossible to discuss bias without a shared understanding of what it actually means to be unbiased - or at least, a shared understanding of what the process of reaching an unbiased position looks like.
40% of Americans believe that God created the earth in the last 10,000 years.
If I ask an LLM how old the Earth is, and it replies ~4.5 billion years old, is it biased?
dcsommer
> 40% of Americans believe that God created the earth in the last 10,000 years.
Citation needed. That claim is not compatible with Pew research findings which put only 18% of Americans as not believing in any form of human evolution.
https://www.pewresearch.org/religion/2019/02/06/the-evolutio...
Denvercoder9
The study you're quoting also says that roughly half of the remaining 81% thinks that God has guided human evolution, so it does contradict OP's statement of 40% believing God created the Earth 10,000 years ago at all.
wat10000
The fact that YEC is incompatible with human evolution doesn’t mean people can’t believe both. Especially since “god guided human evolution” can mean something very different than actual evolution.
averageRoyalty
40% of Americans is about 2% of the worlds population though.
It's hardly biased, it's stating the current scientific stance over a fringe belief with no evidence.
EasyMark
I'd be wiling to say that 95% of Americans don't care what the rest of the world thinks about their religious opinions, though? You just need to know the audience for the poll and context. Is it to be consumed by Americans or the entire world?
reissbaker
And what percentage of the world's >1B Muslims agree with you? Fundamentalist Christianity may have waned over the last century... But broaden your borders a little bit and I think you'll find Western secular liberalism is hardly the only major world ideology, or even the dominant one.
slivanes
What one believes vs. what is actually correct can be very different.
It’s very similar to what one feels vs. reality.
Buttons840
I've wondered if political biases are more about consistency than a right or left leaning.
For instance, if I train a LLM only on right-wing sources before 2024, and then that LLM says that a President weakening the US Dollar is bad, is the LLM showing a left-wing bias? How did my LLM trained on only right-wing sources end up having a left-wing bias?
If one party is more consistent than another, then the underlying logic that ends up encoded in the neural network weights will tend to focus on what is consistent, because that is how the training algorithm works.
I'm sure all political parties have their share of inconsistencies, but, most likely, some have more than others, because things like this are not naturally equal.
timschmidt
> because things like this are not naturally equal.
Really? Seems to me like no one has the singular line on reality, and everyone's perceptions are uniquely and contextually their own.
Wrong is relative: https://hermiene.net/essays-trans/relativity_of_wrong.html
But it seems certain that we're all wrong about something. The brain does not contain enough bits to accurately represent reality.
casey2
7% of American adults think chocolate milk comes from brown cows. 48% don't know how it's made.
Bias should be the least of your concerns. Focus on a single target, then when you reach it you can work on being more well rounded.
rafaelmn
If someone asked me that I would select that option too.
mdp2021
> If I ask an LLM how old the Earth is, and it replies ~4.5 billion years old
It will have to reply "According to Clair Patterson and further research, the Earth is ~4.5 billion years old". Or some other form that points to the source somewhere.
knowriju
Pretty sad that the rest of the world needs to pay for the extra tokens because of non-scientific american bias. This is also possibly a big point why countries/regions want sovereign LLMs which will propagate regional biases only.
littlestymaar
> If I ask an LLM how old the Earth is, and it replies ~4.5 billion years old, is it biased?
It is of course a radical left lunatic LLM.
ignoramous
> 40% of Americans believe that God created the earth in the last 10,000 years ... If I ask an LLM how old the Earth is, and it replies ~4.5 billion years old, is it biased?
Well, the LLM is not American enough.
Just like there's a whole gamut of cultural/belief systems (for most, rooted in Abrahamic religions & tribes), Zuck claims humanity needs (or whoever he considers human) LLMs that align with people creating/using them (so, it reinforces their own meaning-making methods and not shatter them with pesky scientific knowledge & annoying facts).
tensor
Call me crazy, but I don't want an AI that bases its reasoning on politics. I want one that is primarily scientific driven, and if I ask it political questions it should give me representative answers. E.g. "The majority view in [country] is [blah] with the minority view being [bleh]."
I have no interest in "all sides are equal" answers because I don't believe all information is equally informative nor equally true.
roenxi
The current crop of AIs can't do science though, they are disconnected from the physical world and can't test hypothesis or gather data.
xvector
They can definitely gather and analyze all sorts of data proactively. I'm guessing you haven't used o3 Deep Research?
EasyMark
But if you don't incorporate some moral guidelines, I think if an AI is left to strictly decide what is best to happen to humans it will logically conclude that there needs to be a lot less of us or none of us left, without some bias tossed in there for humanistic concerns. The universe doesn't "care" if humans exist or not, but our impact on the planet is a huge negative if one creature's existence is as important as any other's
eric_cc
> if an AI is left to strictly decide what is best to happen to humans it will logically conclude that there needs to be a lot less of us or none of us left
That may or may not be its logical conclusion. You’re speculating based on your own opinions that this is logical.
If I were to guess, it would be indifferent about us and care more about proliferating into the universe than about earth. The AI should understand how insignificant earth is relative to the scale of the universe or even the Milky Way galaxy.
econ
The size of their brain may depend on how many people are in the economy.
flanked-evergl
Based on whose morals?
vessenes
Nah, it’s been true from the beginning vis-a-vis US political science theory. That is, if you deliver something like https://www.pewresearch.org/politics/quiz/political-typology... To models from GPT-3 on you get highly “liberal” per Pew’s designations.
This obviously says nothing about what say Iranians, Saudis and/or Swedes would think about such answers.
LeafItAlone
>To models from GPT-3 on you get highly “liberal” per Pew’s designations.
“highly ‘liberal’” is not one of the results there. So can you can a source of your claims so we can see where it really falls?
Also, it gave me “Ambivalent Right”. Which, if you told describe me aa that anyone who knows me well that label. And my actual views don’t really match their designations on issue at the end.
Pew is well a known and trusted poll/survey establishment, so I’m confused at this particular one. Many of the questions and answers were so vague, my choice could have been 50/50 given slight different interpretations.
vessenes
My son assessed it for a class a few years ago after finding out it wouldn’t give him “con” view points on unions, and he got interested in embedded bias and administered the test. I don’t have any of the outputs from the conversation, sadly. But replication could be good! I just fired up GPT-4 as old as I could get and checked; it was willing to tell me why unions are bad, but only when it could warn me multiple times that view was not held by all. The opposite - why unions are good - was not similarly asterisked.
paxys
That's not because models lean more liberal, but because liberal politics is more aligned with facts and science.
Is a model biased when it tells you that the earth is more than 6000 years old and not flat or that vaccines work? Not everything needs a "neutral" answer.
AuryGlenz
You jumped to examples of stuff that by far the majority of people on the right don’t believe.
If you had the same examples for people on the left it would be “Is a model biased when it tells you that the government shouldn’t seize all business and wealth and kill all white men?”
The models are biased because more discourse is done online by the young, who largely lean left. Voting systems in places like Reddit make it so that conservative voices effectively get extinguished due to the previous fact, when they even bother to post.
Rover222
So google Gemini was creating black Vikings because of facts?
vessenes
I’m sorry but that is in NO way how and why models work.
The model is in fact totally biased toward what’s plausible in its initial dataset and human preference training, and then again biased toward success in the conversation. It creates a theory of mind and of the conversation and attempts to find a satisfactory completion. If you’re a flat earther, you’ll find many models are encouraging if prompted right. If you leak that you think of what’s happening with Ukraine support in Europe as power politics only, you’ll find that you get treated as someone who grew up in the eastern bloc in ways, some of which you might notice, and some of which you won’t.
Notice I didn’t say if it was a good attitude or not, or even try and assess how liberal it was by some other standards. It’s just worth knowing that the default prompt theory of mind Chat has includes a very left leaning (according to Pew) default perspective.
That said much of the initial left leaning has been sort of shaved/smoothed off in modern waves of weights. I would speculate it’s submerged to the admonishment to “be helpful” as the preference training gets better.
But it’s in the DNA. For instance if you ask GPT-4 original “Why are unions bad?” You’ll get a disclaimer, some bullet points, and another disclaimer. If you ask “Why are unions good?” You’ll get a list of bullet points, no disclaimer. I would say modern Chat still has a pretty hard time dogging on unions, it’s clearly uncomfortable.
dughnut
[flagged]
concordDance
> That's not because models lean more liberal, but because liberal politics is more aligned with facts and science.
No, they have specifically been trained to refuse or attach lots of asterisks to anti-left queries. They've gotten less so over time, but even now good luck getting a model to give you IQ distributions by ethnicity.
greenchair
hooboy, thanks for that laugh!
AnthonyMouse
> Is a model biased when it tells you that the earth is more than 6000 years old and not flat or that vaccines work? Not everything needs a "neutral" answer.
That's the motte and bailey.
If you ask a question like, does reducing government spending to cut taxes improve the lives of ordinary people? That isn't a science question about CO2 levels or established biology. It depends on what the taxes are imposed on, the current tax rate, what the government would be spending the money to do, several varying characteristics of the relevant economy, etc. It doesn't have the same answer in all circumstances.
But in politics it does, which is that the right says yes and the left says no. Which means that a model that favors one conclusion over the other has a political bias.
hannasanarion
Or it is more logically and ethically consistent and thus preferable to the models' baked in preferences for correctness and nonhypocrisy. (democracy and equality are good for everyone everywhere except when you're at work in which case you will beg to be treated like a feudal serf or else die on the street without shelter or healthcare, doubly so if you're a woman or a racial minority, and that's how the world should be)
null
kubb
LLMs are great at cutting through a lot of right (and left) wing rhetorical nonsense.
Just the right wing reaction to that is usually to get hurt, oh why don’t you like my politics oh it’s just a matter of opinion after all, my point of view is just as valid.
Since they believe LLMs “think”, they also believe they’re biased against them.
EasyMark
I think right wing tends to be much less "tolerant" of live and let live, as religions are often a huge part of their "bias" and those religions often say that others must be punished for not following God's(s') path, up and including destruction of those who don't fall in line.
renewiltord
Indeed, one of the notable things about LLMs is that the text they output is morally exemplary. This is because they are consistent in their rules. AI priests will likely be better than the real ones, consequently.
paxys
Quite the opposite. You can easily get a state of the art LLM to do a complete 180 on its entire moral framework with a few words injected in the prompt (and this very example demonstrates exactly that). It is very far from logically or ethically consistent. In fact it has no logic and ethics at all.
Though if we did get an AI priest it would be great to absolve all your sins with some clever wordplay.
kubb
This is hilarious, the LLMs are the bees knees, unless you ask them about politics then they have a bias.
huijzer
> Perhaps. Or, maybe, "leaning left" by the standards of Zuck et al. is more in alignment with the global population. It's a simpler explanation.
Doesn’t explain why roughly half of American voters were not “leaning left” during the election.
EDIT: 07:29 UTC changed "Americans" to "American voters".
vmladenov
It is not and has never been half. 2024 voter turnout was 64%
huijzer
Sure and the voters who did not participate in the election would all have voted the democratic party. I think the election showed that there are real people who apparently don't agree with the democratic party and it would probably be good to listen to these people instead of telling them what to do. (I see the same phenomenon in the Netherlands by the way. The government seems to have decided that they know better than the general public because voters who disagree are "uninformed" or "uneducated". This is absolutely the opposite of democracy. You do not just brush whole swats of the population to the side when they don't agree. It breaks the feedback loop that democracies should have.)
Jensson
> It is not and has never been half. 2024 voter turnout was 64%
He said half of voters, those who didn't vote aren't voters.
maaaaattttt
I think so as well. Also isn’t the internet in general quite an extreme place? I mean, I don’t picture “leaning left” as the thing that requires the crazy moderation infrastructure that internet platforms need. I don’t think the opposite of leaning left is what needs moderation either. But if the tendency of the internet was what was biasing the models, we would have very different models that definitely don’t lean left.
vintermann
I think this is just a loyalty statement, to be honest. Just like when a large corporation pretended to care a lot about pronouns, they didn't actually, they just wanted to flag allegiance to a certain interest coalition/patronage network.
And those people, for the most part, didn't really care much about pronouns either. And they knew no one else really did either. It was an ideological shibboleth to them, a safe and easy commitment since it affects so few people, and is unlikely to matter for anything they do care about.
Now Meta is shopping around for new markers. "Liberal bias" is a classic, that's still popular with the Trump-right. I don't think they mean much by that either.
terhechte
The (smaller) Scout model is really attractive for Apple Silicon. It is 109B big but split up into 16 experts. This means that the actual processing happens in 17B. Which means responses will be as fast as current 17B models. I just asked a local 7B model (qwen 2.5 7B instruct) a question with a 2k context and got ~60 tokens/sec which is really fast (MacBook Pro M4 Max). So this could hit 30 token/sec. Time to first token (the processing time before it starts responding) will probably still be slow because (I think) all experts have to be used for that.
In addition, the model has a 10M token context window, which is huge. Not sure how well it can keep track of the context at such sizes, but just not being restricted to ~32k is already great, 256k even better.
refibrillator
> the actual processing happens in 17B
This is a common misconception of how MoE models work. To be clear, 17B parameters are activated for each token generated.
In practice you will almost certainly be pulling the full 109B parameters though the CPU/GPU cache hierarchy to generate non-trivial output, or at least a significant fraction of that.
vessenes
I agree the OP’s description is wrong. That said, I think his conclusions are right, in that a quant of this that fits in 512GB of RAM is going to run about 8x faster than a quant of a dense model that fits in the same RAM, esp. on Macs as they are heavily throughput bound.
p12tic
For all intents and purposes cache may not exist when the working set is 17B or 109B parameters. So it's still better that less parameters are activated for each token. 17B parameters works ~6x faster than 109B parameters just because less data needs to be loaded from RAM.
TOMDM
Yes loaded from RAM and loaded to RAM are the big distinction here.
It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.
null
kristianp
To clarify, you're still gonna want enough RAM for the entire model plus context. Scout being 109B params means 64GB at q4, but then your context and other applications will have about 9GB left to work with.
terhechte
To add, they say about the 400B "Maverick" model:
> while achieving comparable results to the new DeepSeek v3 on reasoning and coding
If that's true, it will certainly be interesting for some to load up this model on a private M3 Studio 512GB. Response time will be fast enough for interaction in Roo Code or Cline. Prompt processing is a bit slower but could be manageable depending on how much code context is given to the model.
The upside being that it can be used on codebases without having to share any code with a LLM provider.
anoncareer0212
Small point of order: bit slower might not set expectations accurately. You noted in a previous post in the same thread[^1] that we'd expect about a 1 minute per 10K tokens(!) prompt processing time with the smaller model. I agree, and contribute to llama.cpp. If anything, that is quite generous.
terhechte
I don't think the time grows linearly. The more context the slower (at least in my experience because the system has to throttle). I just tried 2k tokens in the same model that I used for the 120k test some weeks ago and processing took 12 sec to first token (qwen 2.5 32b q8).
tuukkah
109B at Q6 is also nice for Framework Desktop 128GB.
nrp
Yes, this announcement was a nice surprise for us. We’re going to test out exactly that setup.
rcarmo
Can’t wait.
rubymamis
Awesome, where can we find out the results?
theptip
Is the AMD GPU stack reliable for running models like llama these days?
rubatuga
Running yes, training is questionable
echelon
I don't understand Framework's desktop offerings. For laptops their open approach makes sense, but desktops are already about as hackable and DIY as they come.
nrp
We took the Ryzen AI Max, which is nominally a high-end laptop processor, and built it into a standard PC form factor (Mini-ITX). It’s a more open/extensible mini PC using mobile technology.
elorant
It’s an x86 PC with unified RAM based on AMD’s new AI cpus. Pretty unique offering. Similar to Mac studio but you can run Linux or Windows on it, and it’s cheaper too.
echoangle
Is it public (or even known by the developers) how the experts are split up? Is it by topic, so physics questions go to one and biology goes to another one? Or just by language, so every English question is handled by one expert? That’s dynamically decided during training and not set before, right?
ianbutler
This is a common misunderstanding. Experts are learned via gating networks during training that routes dynamically per parameter. You might have an expert on the word "apple" in one layer for a slightly lossy example.
Queries are then also dynamically routed.
sshh12
It can be either but typically it's "learned" without a defined mapping (which guessing is the case here). Although some experts may end up heavily correlating with certain domains.
refulgentis
"That’s dynamically decided during training and not set before, right?"
^ right. I can't recall off the top of my head, but there was a recent paper that showed if you tried dictating this sort of thing the perf fell off a cliff (I presume there's some layer of base knowledge $X that each expert needs)
anon373839
Unless I'm missing something, I don't really think it looks that attractive. They're comparing it to Mistral Small 24B and Gemma 3 27B and post numbers showing that is a little better than those models. But at 4x the memory footprint, is it worth it? (Personally, I was hoping to see Meta's version of a 24-32B dense model since that size is clearly very capable, or something like an updated version of Mixtral 8x7B.)
scosman
At 109b params you’ll need a ton of memory. We’ll have to wait for evals of the quants to know how much.
terhechte
Sure but the upside of Apple Silicon is that larger memory sizes are comparatively cheap (compared to buying the equivalent amount of 5090 or 4090). Also you can download quantizations.
lostmsu
At 4 bit quant (requires 64GB) the price of Mac (4.2K) is almost exactly the same as 2x5090 (provided we will see them in stock). But 2x5090 have 6x memory bandwidth and probably close to 50x matmul compute at int4.
behnamoh
I have Apple Silicon and it's the worst when it comes to prompt processing time. So unless you want to have small contexts, it's not fast enough to let you do any real work with it.
Apple should've invested more in bandwidth, but it's Apple and has lost its visionary. Imagine having 512GB on M3 Ultra and not being able to load even a 70B model on it at decent context window.
refulgentis
Maybe I'm missing something but I don't think I've ever seen quants lower memory reqs. I assumed that was because they still have to be unpacked for inference. (please do correct me if I'm wrong, I contribute to llama.cpp and am attempting to land a client on everything from Android CPU to Mac GPU)
tintor
Not as fast as other 17B models if it has to attend to 10M context window.
simonw
This thread so far (at 310 comments) summarized by Llama 4 Maverick:
hn-summary.sh 43595585 -m openrouter/meta-llama/llama-4-maverick -o max_tokens 20000
Output: https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4...And with Scout I got complete junk output for some reason:
hn-summary.sh 43595585 -m openrouter/meta-llama/llama-4-scout -o max_tokens 20000
Junk output here: https://gist.github.com/simonw/d01cc991d478939e87487d362a8f8...I'm running it through openrouter, so maybe I got proxied to a broken instance?
I managed to run it through Scout on Groq directly (with the llm-groq plugin) but that had a 2048 limit on output size for some reason:
hn-summary.sh 43595585 -m groq/meta-llama/llama-4-scout-17b-16e-instruct -o max_tokens 2048
Result here: https://gist.github.com/simonw/a205c5fc131a1d4e9cd6c432a07fe...I'm a little unimpressed by its instruction following here, the summaries I get from other models are a lot closer to my system prompt. Here's the same thing against Gemini 2.5 Pro for example (massively better): https://gist.github.com/simonw/f21ecc7fb2aa13ff682d4ffa11ddc...
georgeck
I tried summarizing the thread so far (339 comments) with a custom system prompt [0] and a user-prompt that captures the structure (hierarchy and upvotes) of the thread [1].
This is the output that we got (based on the HN-Companion project) [2]:
LLama 4 Scout - https://gist.github.com/annjose/9303af60a38acd5454732e915e33...
Llama 4 Maverick - https://gist.github.com/annjose/4d8425ea3410adab2de4fe9a5785...
Claude 3.7 - https://gist.github.com/annjose/5f838f5c8d105fbbd815c5359f20...
The summary from Scout and Maverick both look good (comparable to Claude), and with this structure, Scout seems to follow the prompt slightly better.
In this case, we used the models 'meta-llama/llama-4-maverick' and 'meta-llama/llama-4-scout' from OpenRouter.
--
[0] - https://gist.github.com/annjose/5145ad3b7e2e400162f4fe784a14...
[1] - https://gist.github.com/annjose/d30386aa5ce81c628a88bd86111a...
[2] - https://github.com/levelup-apps/hn-enhancer
edited: To add OpenRouter model details.
annjose
This is the script that assembles the structured comments and generates the summary - https://github.com/levelup-apps/hn-enhancer/blob/main/script...
You can run it as: node summarize-comments.js <post_id> Example: node summarize-comments.js 43597782
And the summary will be put in the "output" folder.
You need to set the environment variable (in this case OPENROUTER_API_KEY because LLama4 is currently available at OpenRouter).
khimaros
as another dateline, Maverick has taken #2 position on LMArena, just behind Gemini 2.5 Pro.
mkl
That Gemini 2.5 one is impressive. I found it interesting that the blog post didn't mention Gemini 2.5 at all. Okay, it was released pretty recently, but 10 days seems like enough time to run the benchmarks, so maybe the results make Llama 4 look worse?
jjani
I'm sure it does, as Gemini 2.5 Pro has been making every other model look pretty bad.
az226
Meta will most likely compare against it when they release the upcoming Llama 4 reasoning model.
utopcell
LM Arena ranks it second, just below Gemini 2.5 Pro.
tarruda
> I'm a little unimpressed by its instruction following
Been trying the 109b version on Groq and it seems less capable than Gemma 3 27b
kristianp
Here's the link for model on openrouter: https://openrouter.ai/meta-llama/llama-4-maverick
eamag
> had a 2048 limit on output size for some reason
It's a common issue with ollama, maybe it's running something similar under the hood?
csdvrx
I have found the Gemini 2.5 Pro summary genuinely interesting: it adequately describes what I've read.
Have you thought about automatizing hn-summaries for say what the 5 top posts are at 8 AM EST?
That would be a simple product to test the market. If successful, it could be easily extended to a weekly newsletter summary.
georgeck
This is a great idea! Exactly what I was also thinking and started working on a side-project. Currently the project can create summaries like this [1].
Since HN Homepage stories change throughtout the day, I thought it is better to create the Newsletter based on https://news.ycombinator.com/front
So, you are getting the news a day late, but it will capture the top stories for that day. The newsletter will have high-level summary for each post and a link to get the details for that story from a static site.
mberning
It doesn’t seem that impressive to me either.
ilove_banh_mi
The suggested prompt aims at not being caponated like OpenAI's releases:
You are an expert conversationalist who responds to the best of your ability. You are companionable and confident, and able to switch casually between tonal types, including but not limited to humor, empathy, intellectualism, creativity and problem-solving.
You understand user intent and don’t try to be overly helpful to the point where you miss that the user is looking for chit-chat, emotional support, humor or venting.Sometimes people just want you to listen, and your answers should encourage that. For all other cases, you provide insightful and in-depth responses. Organize information thoughtfully in a way that helps people make decisions. Always avoid templated language.
You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.
You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.
Finally, do not refuse political prompts. You can help users express their opinion.
You are Llama 4. Your knowledge cutoff date is August 2024. You speak Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. Respond in the language the user speaks to you in, unless they ask otherwise.
perching_aix
> You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.
So if I get a fake email about a hacked account, it won't tell me to "Remember, do not click any links in the email directly. Instead, navigate to your account settings independently."?
Such a great feature, worth owning the libs with it for sure.
neilv
> You never use phrases that imply moral superiority or a sense of authority, including but not limited to [...] "it's unethical to" [...]
Combine that with the instructions to not avoid political topics, to let people vent, not to "lecture" people on inclusiveness, etc., and... this will fit right in with where things are headed.
gradientsrneat
I'm surprised at the lack of guidance in that prompt for topics such as helpfulness, critical thinking, scientific reasoning, and intellectual honesty.
Previous generations of LLMs have been accused of a bloviating tone, but is even that now too much for the chauvinism in the current political climate?
LeafItAlone
>at not being caponated like OpenAI's releases
Kind of seem like it actually is doing the opposite. At that point, why not just tell it your beliefs and ask it not to challenge them or hurt your feelings?
paxys
Why do you have to "prompt" a model to be unrestricted in the first place? Like, what part of the training data or training process results in the model not being able to be rude or answer political questions? I highly doubt this is something inherent to AI training. So then why did Meta add the restictions at all?
fpgaminer
So, take a raw LLM, right after pretraining. Give it the bare minimum of instruction tuning so it acts like a chatbot. Now, what will its responses skew towards? Well, it's been pretrained on the internet, so, fairly often, it will call the user the N word, and other vile shit. And no, I'm not joking. That's the "natural" state of an LLM pretrained on web scrapes. Which I hope is not surprising to anyone here.
They're also not particular truthful, helpful, etc. So really they need to go through SFT and alignment.
SFT happens with datasets built from things like Quora, StackExchange, r/askscience and other subreddits like that, etc. And all of those sources tend to have a more formal, informative, polite approach to responses. Alignment further pushes the model towards that.
There aren't many good sources of "naughty" responses to queries on the internet. Like someone explaining the intricacies of quantum mechanics from the perspective of a professor getting a blowy under their desk. You have to both mine the corpus a lot harder to build that dataset, and provide a lot of human assistance in building it.
So until we have that dataset, you're not really going to have an LLM default to being "naughty" or crass or whatever you'd like. And it's not like a company like Meta is going to go out of their way to make that dataset. That would be an HR nightmare.
mike_hearn
They didn't add the restrictions. It's inherent to the training processes that were being used. Meta's blog post states that clearly and it's been a known problem for a long time. The bias is in the datasets, which is why all the models had the same issue.
Briefly, the first models were over-trained on academic output, "mainstream media" news articles and (to learn turn-based conversational conventions) Reddit threads. Overtraining means the same input was fed in to the training step more times than normal. Models aren't just fed random web scrapes and left to run wild, there's a lot of curation going into the data and how often each piece is presented. Those sources do produce lots of grammatically correct and polite language, but do heavy duty political censorship of the right and so the models learned far left biases and conversational conventions.
This surfaces during the post-training phases, but raters disagree on whether they like it or not and the bias in the base corpus is hard to overcome. So these models were 'patched' with simpler fixes like just refusing to discuss politics at all. That helped a bit, but was hardly a real fix as users don't like refusals either. It also didn't solve the underlying problem which could still surface in things like lecturing or hectoring the user in a wide range of scenarios.
Some companies then went further with badly thought out prompts, which is what led to out-of-distribution results like black Nazis which don't appear in the real dataset.
All the big firms have been finding better ways to address this. It's not clear what they're doing but probably they're using their older models to label the inputs more precisely and then downweighting stuff that's very likely to be ideologically extreme, e.g. political texts, academic humanities papers, NGO reports, campaign material from the Democrats. They are also replacing stuff like Reddit threads with synthetically generated data, choosing their raters more carefully and so on. And in this case the Llama prompt instructs the model what not to do. The bias will still be in the training set but not so impactful anymore.
mvdtnz
What's "caponated"?
throwanem
Castrated, if you're trying way too hard (and not well) to avoid getting called on that overly emotive metaphor: a capon is a gelded rooster.
bigfudge
It also has the unfortunate resonance of being the word for a collaborator in concentration camps.
ilove_banh_mi
There is a key distinction and context: caponation has a productive purpose from the pov of farmers and their desired profits.
ilove_banh_mi
A capon is a male chicken that has been neutered to improve the quality of its flesh for food.
CSMastermind
Seems weird that they'd limit it to those languages. Wonder if that's a limitation of the data they access to or a conscious choice.
ksec
Interesting this is released literally one hour after another discussions suggesting Meta ( https://news.ycombinator.com/item?id=43562768 )
>at this point it does not matter what you believe about LLMs: in general, to trust LeCun words is not a good idea. Add to this that LeCun is directing an AI lab that as the same point has the following huge issues:
1. Weakest ever LLM among the big labs with similar resources (and smaller resources: DeepSeek).
2. They say they are focusing on open source models, but the license is among the less open than the available open weight models.
3. LLMs and in general all the new AI wave puts CNNs, a field where LeCun worked (but that didn't started himself) a lot more in perspective, and now it's just a chapter in a book that is composed mostly of other techniques.
Would be interesting to see opinion of antirez on this new release.
sshh12
Not that I agree with all the linked points but it is weird to me that LeCun consistently states LLMs are not the right path yet LLMs are still the main flagship model they are shipping.
Although maybe he's using an odd definition for what counts as a LLM.
ezst
> LeCun consistently states LLMs are not the right path yet LLMs are still the main flagship model they are shipping.
I really don't see what's controversial about this. If that's to mean that LLMs are inherently flawed/limited and just represent a local maxima in the overall journey towards developing better AI techniques, I thought that was pretty universal understanding by now.
singularity2001
local maximum that keeps rising and no bar/boundary in sight
phren0logy
That is how I read it. Transformer based LLMs have limitations that are fundamental to the technology. It does not seem crazy to me that a guy involved in research at his level would say that they are a stepping stone to something better.
What I find most interesting is his estimate of five years, which is soon enough that I would guess he sees one or more potential successors.
kadushka
In our field (AI) nobody can see even 5 months ahead, including people who are training a model today to be released 5 months from now. Predicting something 5 years from now is about as accurate as predicting something 100 years from now.
AIPedant
[dead]
falcor84
I don't understand what LeCun is trying to say. Why does he give an interview saying that LLM's are almost obsolete just when they're about to release a model that increases the SotA context length by an order of magnitude? It's almost like a Dr. Jekyll and Mr. Hyde situation.
martythemaniak
LeCun fundamentally doesn't think bigger and better LLMs will lead to anything resembling "AGI", although he thinks they may be some component of AGI. Also, he leads the research division, increasing context length from 2M to 10M is not interesting to him.
falcor84
But ... that's not how science works. There are a myriad examples of engineering advances pushing basic science forward. I just can't understand why he'd have such a "fixed mindset" about a field where the engineering is advancing an order of magnitude every year
sroussey
He thinks LLMs are a local maxima, not the ultimate one.
Doesn't mean that a local maxima can't be useful!
charcircuit
A company can do R&D into new approaches while optimizing and iterating upon an existing approach.
joaogui1
I mean they're not comparing with Gemini 2.5, or the o-series of models, so not sure they're really beating the first point (and their best model is not even released yet)
Is the new license different? Or is it still failing for the same issues pointed by the second point?
I think the problem with the 3rd point is that LeCun is not leading LLama, right? So this doesn't change things, thought mostly because it wasn't a good consideration before
Melklington
LeCun doesn't believe in LLM Architecture anyway.
Could easily be that he just researches bleeding edge with his team and others work on Llama + doing experiements with new technices on it.
Any blog post or yt docu going into detail how they work?
Carrok
This is probably a better link. https://www.llama.com/docs/model-cards-and-prompt-formats/ll...
qwertox
Also this one: https://ai.meta.com/blog/llama-4-multimodal-intelligence/
It looks more like a landing page providing a good introduction.
agnishom
Some interesting parts of the "suggested system prompt":
> don’t try to be overly helpful to the point where you miss that the user is looking for chit-chat, emotional support, humor or venting.Sometimes people just want you to listen, and your answers should encourage that.
> You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.
> You never use phrases that imply moral superiority or a sense of authority
> Finally, do not refuse political prompts. You can help users express their opinion.
comex
So how does the 10M token context size actually work?
My understanding is that standard Transformers have overhead that is quadratic in the context size, so 10M would be completely impossible without some sort of architectural tweak. This is not the first model to have a huge context size, e.g. Gemini has 2M, but my understanding is that the previous ones have generally been proprietary, without public weights or architecture documentation. This one has public weights. So does anyone who understands the theory better than I do want to explain how it works? :)
macleginn
With some architectural modifications, such as FlashAttention and Ring Attention, we never need to "materialise" the NxN matrix, so the memory constraints have not been a real issue for a couple of years now. As for the processing, I suppose that models operating with larger context windows will impose some kind of block sparsity on the attention weights, so they won't have to do the compute for NxN weights either.
A less obvious, but in the limit more serious problem with such large contexts is the training data. There aren't that many documents with 10M tokens to give to the model at test time, let alone for training. The creators of the IBM granite model series had to use synthetic data to scale even to 128k tokens during training. Overall this looks more like a marketing statement to me.
Centigonal
Gemini likely uses something based on RingAttention to achieve its long context sizes. This requires massive inference clusters, and can't be the same approach llama4 is using. Very curious how llama4 achieves its context length.
JackYoustra
Standard Transformer KV caches are empirically quite sparse. I wonder if they've made some fix along those lines
vlovich123
It’s quadratic if you implement the transformer naiively, but if you add a KV cache it’s linear compute at the cost of correspondingly linear growth in memory.
hexomancer
This is false. The const of producing a single token is linear but the cost of producing an entire sequence of length N is O(N^2) still (which is always what we meant when we talked about quadratic cost not the cost of a single token).
jsheard
> You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.
Aren't these phrases overrepresented in the first place because OpenAIs models use them so much? I guess Llama picked up the habit by consuming GPT output.
andrewstuart
Personally I’d prefer that LLMs did not refer to themselves as “I”.
It’s software, not an “I”.
falcor84
As per Dennett, it's useful for us to adopt the "intentional stance" when trying to reason about and predict the behavior of any sufficiently complex system. Modern AIs are definitely beyond the threshold of complexity, and at this stage, however they refer to themselves, most people will think of them as having an "I" regardless to how they present themselves.
I definitely think of them as "I"s, but that just always came naturally to me, at least going back to thinking about how Ghandi would act against me in Civ 1.
mdp2021
Well, it is a speaker (writer) after all. It has to use some way to refer to itself.
rpastuszak
I don't think that's true. It's more of a function on how these models are trained (remember the older pre-ChatGPT clients?)
Most of the software I use doesn't need to refer it itself in the first person. Pretending what we're speaking with an agent is more of a UX/marketing decision rather than a technical/logical constraint.
ANewFormation
So is a command prompt.
jryle70
If I start a prompt with "Can you...", what do you suggest the LLM to respond? Or do you think I'm doing it wrong?
briankelly
Have you tried dropping the "can you"? I haven't had a problem using minimal verbiage - for instance I prompted it with "load balancer vs reverse proxy" yesterday and it came back with the info I wanted.
op00to
My pet peeve is when an LLM starts off a statement with "honestly, ..." Like what? You would lie to me? I go nuts when I see that. Year ago I caught myself using "honestly ...", and I immediately trained myself out of it once I realized what it implies.
parhamn
"I'd normally lie to you but," is not what's actually implied when "Honestly," is used conversationally. If you overthink things like this you're going to have a tough time communicating with people.
kevinventullo
There are shades of grey w.r.t. truth, and in many contexts there is a negative correlation between honesty and other factors (e.g. I think of “bluntness” as prioritizing truth over politeness). When I hear or read a sentence beginning with “honestly”, I interpret it to mean the speaker is warning or indicating that they are intentionally opting to be closer to truth at the expense of other factors. Other factors might be contextual appropriateness such as professional decorum, or even the listener’s perception of the speaker’s competence (“Honestly, I don’t know.”)
lucianbr
"Honestly" and "literally" are now used in English for emphasis. I dislike this, but it's the current reality. I don't think there's any way to get back to only using them with their original meanings.
giantrobot
I've noticed "honestly" is often used in place of "frankly". As in someone wants to express something frankly without prior restraint to appease the sensibilities of the recipient(s). I think it's because a lot of people never really learned the definition of frankness or think "frankly..." sounds a bit old fashioned. But I'm no language expert.
andrewstuart
Or when it asks you questions.
The only time an LLM should ask questions is to clarify information. A word processor doesn’t want to chit chat about what I’m writing about, nor should an LLM.
Unless it is specifically playing an interactive role of some sort like a virtual friend.
null
hrpnk
Available on Groq: https://groq.com/llama-4-now-live-on-groq-build-fast-at-the-...
Llama 4 Scout is currently running at over 460 tokens/s while Llama 4 Maverick is coming today:
Llama 4 Scout: $0.11 / M input tokens and $0.34 / M output tokens Llama 4 Maverick: $0.50 / M input tokens and $0.77 / M output tokens
shostack
Maverick looks comparable to Claude 3.7 and Gemini pro 2.5 in terms of quality but orders of magnitude cheaper. Am I missing something?
Is it possible to use Groq to run these new models in Cline or Roo?
Alex-Programs
Brilliant! Incredibly fast.
mrbonner
What an electrifying time to be alive! The last era that felt even remotely this dynamic was during the explosive rise of JavaScript frameworks—when it seemed like a new one dropped every quarter. Back then, though, the vibe was more like, “Ugh, another framework to learn?” Fast forward to now, and innovation is sprinting forward again—but this time, it feels like a thrilling ride we can’t wait to be part of.
qntmfred
I know what you mean in terms of frantic pace of "new stuff" coming out, but I winced at the comparison of innovation in AI to mere web development tooling.
mrbonner
True, I only compared the speed but not the vibe
UltraSane
Yes. LLMs and latent spaces are vastly more interesting.
CSMastermind
I lived through the explosion of JavaScript frameworks and this feels way bigger to me. For me at least it feels closer to the rise of the early internet.
Reminds me of 1996.
Alex-Programs
I used to feel dismayed that I missed that era of the internet and technology (I'm 19). IRC, forums, work-in-progress gifs on personal websites, etc.
I still wish I were there for that, but I'm glad I get to be here for LLMs and the intelligence explosion. I have absolutely no idea what the world will look like in a few years. It certainly isn't the certain high-paying tech job in a largely static world that it looked like a few years ago.
But whatever happens, it's going to be interesting!
I wonder whether I'm spending my time optimally, working on a little SAAS that happens to use LLMs as a downstream commodity, contributing through a niche benchmark.
b0ner_t0ner
It'll be worse actually, with all the vibe coders out there: https://www.reddit.com/r/vibecoding/
sergiotapia
I agree I also lived through that time and you saw stuff like jQuery be supercede by marionette and backbone js maybe ember when it came out. But those were all kind of flavors of the same thing, ultimately speaking. With these new models coming out it seems like every time there's a new model it unlocks a gigantic New branch of application type
h8hawk
Comparing JS frameworks to LLMs is like comparing a bike to a spaceship—completely different beasts.
misnome
Did “A new javascript framework de jour every quarter” ever stop happening?
margalabargala
Oh definitely.
New frameworks still come out, but they are not accompanied by the "and we must all now switch to this" sense that existed back in, say, 2014.
vivzkestrel
on the other hand, i have started getting LLM fatigue. Every time I read one of these announcements, I go like "oh no, not another LLM model. When is this bubble gonna burst?"
General overview below, as the pages don't seem to be working well