The End of Moore's Law for AI? Gemini Flash Offers a Warning
57 comments
·July 3, 2025simonw
sethkim
Both great points, but more or less speak to the same root cause - customer usage patterns are becoming more of a driver for pricing than underlying technology improvements. If so, we likely have hit a "soft" floor for now on pricing. Do you not see it this way?
simonw
Even given how much prices have decreased over the past 3 years I think there's still room for them to keep going down. I expect there remain a whole lot of optimizations that have not yet been discovered, in both software and hardware.
That 80% drop in o3 was only a few weeks ago!
sethkim
No doubt prices will continue to drop! We just don't think it will be anything like the orders-of-magnitude YoY improvements we're used to seeing. Consequently, developers shouldn't expect the cost of building and scaling AI applications to be anything close to "free" in the near future as many suspect.
vfvthunter
I do not see it this way. Google is a publicly traded company responsible for creating value for their shareholders. When they became dicks about ad blockers on youtube last year or so, was it because they hit a bandwidth Moore's law? No. It was a money grab.
ChatGPT is simply what Google should've been 5-7 years ago, but Google was more interested in presenting me with ads to click on instead of helping me find what I was looking for. ChatGPT is at least 50% of my searches now. And they're losing revenue because of that.
mathiaspoint
I really hate the thinking. I do my best to disable it but don't always remember. So often it just gets into a loop second guessing itself until it hits the token limit. It's rare it figures anything out while it's thinking too but maybe that's because I'm better at writing prompts.
thomashop
I have the impression that the thinking helps even if the actual content of the thinking output is nonsense. It awards more cycles to the model to think about the problem.
wat10000
That would be strange. There's no hidden memory or data channel, the "thinking" output is all the model receives afterwards. If it's all nonsense, then nonsense is all it gets. I wouldn't be completely surprised if a context with a bunch of apparent nonsense still helps somehow, LLMs are weird, but it would be odd.
impure
> In a move that at first went unnoticed
Oh, I noticed. I've also complained how Gemini 2.0 Flash is 50% more expensive than Gemini 1.5 Flash for small requests.
Also I'm sure if Google wanted to price Gemini 2.5 Flash cheaper they could. The reason they won't is because there is almost zero competition at the <10 cents per million input token area. Google's answer to the 10 cents per million input token area is 2.5 Flash Lite which they say is equivalent to 2.0 Flash at the same cost. Might be a bit cheaper if you factor in automatic context caching.
Also the quadratic increase is valid but it's not as simple as the article states due to caching. And if it was a bit issue Google would impose tiered pricing like they do for Gemini 2.5 Pro.
And for what it's worth I've been playing around with Gemma E4B on together.ai. It takes 10x as long as Gemini 2.5 Flash Lite and it sucks at multilingual. But other than that it seems to produce acceptable results and is way cheaper.
sharkjacobs
> If you’re building batch tasks with LLMs and are looking to navigate this new cost landscape, feel free to reach out to see how Sutro can help.
I don't have any reason to doubt the reasoning this article is doing or the conclusions it reaches, but it's important to recognize that this article is part of a sales pitch.
sethkim
Yes, we're a startup! And LLM inference is a major component of what we do - more importantly, we're working on making these models accessible as analytical processing tools, so we have a strong focus on making them cost-effective at scale.
sharkjacobs
I see your prices page lists the average cost per million tokens. Is that because you are using the formula you describe, which depends on hardware time and throughput?
> API Price ≈ (Hourly Hardware Cost / Throughput in Tokens per Hour) + Margin
samtheprogram
There’s absolutely nothing wrong with putting a small plug at the end of an article.
sharkjacobs
Of course not.
But the thrust of the article is that contrary to conventional wisdom, we shouldn't expect llm models to continue getting more efficient, and so its worthwhile to explore other options for cost savings in inference, such as batch processing.
The conclusion they reach is one which directly serves what they're selling.
I'll repeat; I'm not disputing anything in this article. I'm really not, I'm not even trying to be coy and make allusions without directly saying anything. If I thought this was bullshit I'm not afraid to semi-anonymously post a comment saying so.
But this is advertising, just like Backblaze's hard drive reliability blog posts are advertising.
apstroll
Extremely doubtful that it boils down to quadratic scaling of attention. That whole issue is a leftover from the days of small bert models with very few parameters.
For large models, compute is very rarely dominated by attention. Take, for example, this FLOPs calculation from https://www.adamcasson.com/posts/transformer-flops
Compute per token = 2(P + L × W × D)
P: total parameters L: Number of Layers W: context size D: Embedding dimension
For Llama 8b, the window size starts dominating compute cost per token only at 61k tokens.
cmogni1
The article does a great job of highlighting the core disconnect in the LLM API economy: linear pricing for a service with non-linear, quadratic compute costs. The traffic analogy is an excellent framing.
One addition: the O(n^2) compute cost is most acute during the one-time prefill of the input prompt. I think the real bottleneck, however, is the KV cache during the decode phase.
For each new token generated, the model must access the intermediate state of all previous tokens. This state is held in the KV Cache, which grows linearly with sequence length and consumes an enormous amount of expensive GPU VRAM. The speed of generating a response is therefore more limited by memory bandwidth.
Viewed this way, Google's 2x price hike on input tokens is probably related to the KV Cache, which supports the article’s “workload shape” hypothesis. A long input prompt creates a huge memory footprint that must be held for the entire generation, even if the output is short.
trhway
That obviously should and will be fixed architecturally.
>For each new token generated, the model must access the intermediate state of all previous tokens.
Not all the previous tokens are equal, not all deserve the same attention so to speak. The farther the tokens, the more opportunity for many of them to be pruned and/or collapsed with other similarly distant and lesser meaningful tokens in a given context. So instead of O(n^2) it would be more like O(nlog(n))
I mean, you'd expect that for example "knowlegde worker" models (vs. say "poetry" models) would posses some perturbative stability wrt. changes to/pruning of the remote previous tokens, at least to those tokens which are less meaningful in the current context.
Personally, i feel the situation is good - performance engineering work again becomes somewhat valuable as we're reaching N where O(n^2) forces management to throw some money at engineers instead of at the hardware :)
sharkjacobs
> This is the first time a major provider has backtracked on the price of an established model
Arguably that was Haiku 3.5 in October 2024.
I think the same hypothesis could apply though, that you price your model expecting a certain average input size, and then adjust price up to accommodate the reality that people use that cheapest model when they want to throw as much as they can into the context.
ryao
I had the same thought about haiku 3.5. They claimed it was due to the model being more capable, which basically means that they raised the price because they could.
Then there is Poe with its pricing games. Prices at Poe have been going up over time since they were extremely aggressive to gain market share presumably under the assumption that there would be reduced pricing in the future and the reduced pricing for LLMs did not materialize.
simonw
Haiku 3.5 was a completely different model from Haiku 3, and part of a new model generation.
Gemini Flash 2.5 and Gemini 2.5 Flash Preview were presumably a whole lot more similar to each other.
mossTechnician
Is there a consumer expectation that Haiku 3.5 is completely different? Even leaving semantic versioning aside, even if the .5 symbolizes a "halfway point" between major releases, it still suggests a non-major release to me.
simonw
Consumers have no idea what Haiku is.
Engineers who work with LLM APIs are hopefully paying enough attention that they understand the difference between Claude 3, Claude 3.5 and Claude 4.
fusionadvocate
What is holding back AI is this business necessity that models must perform everything. Nobody can push for a smaller model that learns a few simple tasks and then build upon that, similar to the best known intelligent machine: the human.
If these corporations had to build a car they would make the largest possible engine, because "MORE ENGINE MORE SPEED", just like they think that bigger models means bigger intelligence, but forget to add steering, or even a chassi.
dehugger
I agree. I want to be able to get smaller models which are complete, contained, products which we can run on-prem for our organization.
I'll take a model specialized in web scraping. Give me one trained on generating report and documentation templates (I'd commit felonies for one which could spit out a near-conplete report for SSRS).
Models trained for specific helpdesk tasks ("install a printer", "grant this user access to these services with this permission level").
A model for analyzing network traffic and identifying specific patterns.
None of these things should require titanic models nearing trillions of parameters.
furyofantares
This is extremely theorycrafted but I see this as an excellent thing driving AI forward, not holding it back.
I suspect a large part of the reason we've had many decades of exponential improvements in compute is the general purpose nature of computers. It's a narrow set of technologies that are universally applicable and each time they get better/cheaper they find more demand, so we've put an exponentially increasing amount of economical force behind it to match. There needed to be "plenty of room at the bottom" in terms of physics and plenty of room at the top in terms of software eating the world, but if we'd built special purpose hardware for each application I don't think we'd have seen such incredible sustained growth.
I see neural networks and even LLMs as being potentially similar. They're general purpose, a small set of technologies that are broadly applicable and, as long as we can keep making them better/faster/cheaper, they will find more demand, and so benefit from concentrated economic investment.
fnord123
They aren't arguing against LLMs They are arguing against their toaster's LLM to make the perfect toast from being trained on the tax policies of the Chang Dynasty.
furyofantares
I'm aware! And I'm personally excited about small models but my intuition is that maybe pouring more and more money into giant general purpose models will have payoff as long as it keeps working at producing better general purpose results (which maybe it won't).
cruffle_duffle
That’s just machine learning though!
jasonthorsness
Unfounded extrapolation from a minor pricing update. I am sure every generation of chips also came with “end of Moore’s law” articles for the actual Moore’s law.
FWIW Gemini 2.5 Flash Lite is still very good; I used it in my latest side project to generate entire web sites and it outputs great content and markup every single time.
mpalmer
Is this overthinking it? Google had a huge incentive to outprice Anthropic and OAI to join the "conversation". I was certainly attracted to the low price initially, but I'm staying because it's still affordable and I still think the Gemini 2.5 options are the best simple mix of models available.
fathermarz
Google is raising prices for most of their services. I do not agree that this is due to the cost of compute or that this is the end of Moore’s Law. I don’t think we have scratched the surface.
checker659
> cost of compute
DRAM scaling + interconnect bandwidth stagnation
flakiness
It can be just Google trying to capitalize Gemini's increasing popularity. Until 2.5 Gemini was a total underdog. Less so since 2.5.
georgeburdell
Is there math backing up the “quadratic” statement with LLM input size? At least in the traffic analogy, I imagine it’s exponential, but for small amounts exceeding some critical threshold, a quadratic term is sufficient
"In a move that at first went unnoticed, Google significantly increased the price of its popular Gemini 2.5 Flash model"
It's not quite that simple. Gemini 2.5 Flash previously had two prices, depending on if you enabled "thinking" mode or not. The new 2.5 Flash has just a single price, which is a lot more if you were using the non-thinking mode and may be slightly less for thinking mode.
Another way to think about this is that they retired their Gemini 2.5 Flash non-thinking model entirely, and changed the price of their Gemini 2.5 Flash thinking model from $0.15/m input, $3.50/m output to $0.30/m input (more expensive) and $2.50/m output (less expensive).
Another minor nit-pick:
> For LLM providers, API calls cost them quadratically in throughput as sequence length increases. However, API providers price their services linearly, meaning that there is a fixed cost to the end consumer for every unit of input or output token they use.
That's mostly true, but not entirely: Gemini 2.5 Pro (but oddly not Gemini 2.5 Flash) charges a higher rate for inputs over 200,000 tokens. Gemini 1.5 also had a higher rate for >128,000 tokens. As a result I treat those as separate models on my pricing table on https://www.llm-prices.com
One last one:
> o3 is a completely different class of model. It is at the frontier of intelligence, whereas Flash is meant to be a workhorse. Consequently, there is more room for optimization that isn’t available in Flash’s case, such as more room for pruning, distillation, etc.
OpenAI are on the record that the o3 optimizations were not through model changes such as pruning or distillation. This is backed up by independent benchmarks that find the performance of the new o3 matches the previous one: https://twitter.com/arcprize/status/1932836756791177316