Tokens are getting more expensive

142 comments

·August 3, 2025

mystraline

From the article:

> consumers hate metered billing. they'd rather overpay for unlimited than get surprised by a bill.

Yes and no.

Take Amazon. You think your costs are known and WHAMMO surprise bill. Why do you get a surprise bill? Because you cannot say 'Turn shit off at X money per month'. Can't do it. Not an option.

All of these 'Surprise Net 30' offerings are the same. You think you're getting a stable price until GOTAHCA.

Now, metered billing can actually be good, when the user knows exactly where they stand on the metering AND can set maximums so their budget doesn't go over.

Taken realistically, as an AI company, you provide a 'used tokens/total tokens' bar graph, tokens per response, and estimated amount of responses before exceeding.

Again, don't surprise the user. But that's an anathema to companies who want to hide tokens to dollars, the same way gambling companies obfuscate 'corporate bux' to USD.

mhitza

> Again, don't surprise the user. But that's an anathema to companies who want to hide tokens to dollars, the same way gambling companies obfuscate 'corporate bux' to USD.

This is the exact same thing that frustrates me with GitHub's AI rollout. Been trialing the new Copilot agent, and it's cost is fully opaque. Multiple references to "premium requests" that don't show up real-time in my dashboard, not clear how many I have in total/left, and when these premium requests are referenced in the UI they link to the documentation that also doesn't talk about limits (instead of linking to the associated billing dashboard).

saratogacx

They don't make it easy to figure out but after researching it for my Co. this is what I came to.

    * One chat message -> one premium credit (most at 1 credit but some are less and some, like opus, are 10x)
    * Edit mode is the same as Ask/chat
    * One agent session (meaning you start a new agent chat) is one "request" so you can have multiple messages and they cost the credit cost of one chat message.

Microsoft's Copilot offerings are essentially a masterclass in cost opaqueness. Nothing in any offering is spelled out and they always seem to be just short of the expectation they are selling.

9dev

But how much is one premium request in real currency, and how many do I have per month?

llbbdd

Highly recommend getting the $20/month OpenAI sub and letting copilot use that. Quality-wise I feel like I'm getting the same results but OAIs limits are a little more sane.

mhitza

I'm talking about this new agent mode https://github.blog/news-insights/product-news/github-copilo... for which as far as I'm aware there's no option to switch the underlying model used.

debian3

How do you link the openai sub to Gh copilot? I thought you needed to use OpenAI api

Spooky23

Metering is great for defined processes. I love AWS because I can align cost with business. In the old days it was often hard and an internal political process. Some saleschick would shake the assets at a director and now I’m eating the cost for some network gear i don’t need.

But for users, that fine grained cost is not good, because you’re forcing a user to be accountable with metrics that aren’t tied to their productivity. When I was an intern in the 90s, I was at a company that required approval to make long distance phone calls. Some bureaucrat would assess whether my 20 minute phone call was justified and could charge me if my monthly expense was over some limit. Not fun.

Flat rate is the way to go for user ai, until you understand the value in the business and the providers start looking for margin. If I make a $40/hr analyst 20% more productive, that’s worth $16k of value - the $200/mo ChatGPT Pro is a steal.

ajb

Amazon is worse than this, though the AWS bait and switch is that you are supposed to save over the alternatives. So it should be worth switching if you would save more than the dev time you would invest in doing so right? But your company isn't going to do that. Because of opportunity cost. Your company expects to get some multiple of a the cost of dev time back, that they invest in their own business. And because of various uncertainties - in return, in the time taken to develop, in competition, etc - they will only invest dev time when that multiple is not small. I'm not a business manager, but I'd guess a factor of 5.

But that means that if you were conned into using infrastructure that actually costs more than the alternative, making your cost structure worse, you're still going to eat the loss because it's not worth taking your devs time to switch back.

But tokens don't quite have this problem -yet. Most of us can still do development the old way, and it's not a project to turn it off. Expect this to change though.

ikari_pl

I often find Amazon pricing to be vague and cryptic, sometimes there's literally no way to tell ehy, for example, your database cost is fluctuating all the time

joseda-hg

Amazon pricing is nice if you compare it to Azure...

crinkly

Yeah that. We moved to AWS using their best practices and enterprise cost estimation stuff and got a 6x cost increment on something that was supposed to be cheaper and now we’re fucked because we can’t get out.

It’s nearly impossible to tell what the hell is going where and we are mostly surviving on enterprise discounts from negotiations.

The worst thing is they worked out you can blend costs in using AWS marketplace without having to raise due diligence on a new vendor or PO. So up it goes even more.

Not my department or funeral fortunately. Our AWS account is about $15 a month.

ajb

Are you using separate accounts per use case? That's the only real way to get a cost breakdown, otherwise you have no idea what piece of infrastructure is for what. They provide a tagging system but it's only informative if someone spends several hours a month tracking down the stuff that didn't get tagged properly.

AtheistOfFail

> The worst thing is they worked out you can blend costs in using AWS marketplace without having to raise due diligence on a new vendor or PO. So up it goes even more.

Not a bug, a feature.

bufferoverflow

[dead]

UltraSane

AWS pricing is actually extremely clearly specified but it is hard to predict your costs unless you have a good understanding of your expected usage.

graemep

If your AWS costs are too complex for you to understand you need to employ a finops person or AWS specialist to handle it for you.

I am not saying this is desirable, but it is necessary IFF you chose to use these services. They are complex by design, and intended primarily for large scale users who do have the expertise to handle the complexity.

Shank

> If your AWS costs are too complex for you to understand you need to employ a finops person or AWS specialist to handle it for you.

The point where you get sticker shock from AWS is often significantly lower than the point where you have enough money to hire in either of those roles. AWS is obviously the infrastructure of choice if you plan to scale. The problem is that scaling on expertise isn’t instant and that’s where you’re more likely to make a careless mistake and deploy something relatively costly.

motorest

> If your AWS costs are too complex for you to understand you need to employ a finops person or AWS specialist to handle it for you.

What a baffling comment. Is it normal to even consider hiring someone to figure out how you are being billed by a service? You started with one problem and now you have at least two? And what kind of perverse incentive are you creating? Don't you think your "finops" person has a vested interest in preserving their job by ensuring billing complexity will always be there?

lelanthran

> If your AWS costs are too complex for you to understand you need to employ a finops person or AWS specialist to handle it for you

At that point wouldn't it simply be cheaper to do VMs?

ajsnigrutin

But they're also simple and cheap if you're a "one man band" trying out some personal idea that might or might not take off. Those people have no budgets for specialists.

Pricing schemes like these just make them move back to virtual machines with "unlimited" shared cpu usage and setting up services (db,...) manually.

siva7

it's surprising that YC has a gazillion companies doing some ai infrastructure observability product yet i have to see a product that really presents me and the user easily token usage and pricing estimations which for me is the #1 criteria to use that. make billing and pricing for me and the user easier. instead they run their heads into evals and niche features.

chrisweekly

GOTAHCA?

arcanemachiner

Maybe GOTCHA?

scoreandmore

You can set billing alerts and write a lambda function to respond and disable resources. Of course they don’t make it easy but if you don’t learn how to use limits what do you expect? This argument amazes me. Cloud services require some degree of responsibility on the users side.

mystraline

This is complete utter hogwash.

Up until recently, you could hit somebody else's S3 endpoint, no auth, and get 403's that would charge them 10s of thousands of dollars. Coudnt even firewall it. And no way to see, or anything. Number go up every 15-30 minutes in cost dashboard.

Real responsibility is 'I have 100$ a month for cloud compute'. Give me a easy way to view it, and shut down if I exceed that. That's real responsibility, that Scamazon, Azure, Google - none of them 'permit'.

They (and well, you) instead say "you can build some shitty clone of the functionality we should have provided, but we would make less money".

Oh, and your lambda job? That too costs money. It should not cost more money to detect and stop stuff on 'too much cost' report.

This should be a default feature of cloud: uncapped costs, or stop services

HelloImSteven

Lambda has 1mil free requests per month, so there’s a chance it would be free depending on your usage. But still, it’s not straightforward at all, so I get it.

Perhaps requiring support for bill capping is the right way to go, but honestly I don’t see why providers don’t compete at all here. Customers would flock to any platform with something like “You set a budget and uptime requirements, we’ll figure out what needs to be done”, with some sort of managed auto-adjustment and a guarantee of no overage charges.

Ah well, one can only dream.

scoreandmore

[flagged]

gray_-_wolf

Last time I was looking into this, is there not up to an hour of delay for the billing alerts? It did not seem possible to ensure you do not run over your budget.

esafak

So you're okay with turning your site off...

mystraline

This a logical fallacy of false dilemma.

I made it clear that you ask the user to choose between 'accept risk of overrun and keep running stuff', 'shut down all stuff on exceeding $ number', or even a 'shut down these services on exceeding number', or other possible ways to limit and control costs.

The cloud companies do not want to permit this because they would lose money over surprise billing.

verbify

Isn't that the definition of metered billing?

dd36

Cats doing tricks has a limited budget.

furyofantares

> claude code has had to roll back their original unlimited $200/mo tier this week

The article repeats this throughout but isn't it a straight lie? The plan was named 20x because it's 20x usage limits, it always had enforced 5 hour session limits, it always had (unenforced? soft?) 50 session per month limits.

It was limited, but not enough and very very probably still isn't, judging by my own usage. So I don't think the argument would even suffer from telling the truth.

Aurornis

You’re right, the Max plan was never advertised as unlimited.

I can’t believe how many comments and articles I’ve read that assume it was unlimited.

It’s like it has been repeated so many times that it’s assumed to be true.

michaelbuckbee

A major current problem is that we're smashing gnats with sledgehammers via undifferentiated model use.

Not every problem needs a SOTA generalist model, and as we get systems/services that are more "bundles" of different models with specific purposes I think we will see better usage graphs.

empiko

Yeah, but the juiciest tasks are still far from solved. The amount of tasks where people are willing to accept low accuracy answers is not that high. It is maybe true for some text processing pipelines, but all the user facing use cases require good performance.

mustyoshi

Yeah this is the thing people miss a lot. 7,32b models work perfectly fine for a lot of things, and run on previously high end consumer hardware.

But we're still in the hype phase, people will come to their senses once the large model performance starts to plateau

_heimdall

I expect people to come to their senses when LLM companies stop subsidizing cost and start charging customers what it actually costs them to train and run these models.

simonjgreen

Completely agree. It’s worth spending time to experiment too. A reasonably simple chat support system I build recently uses 5 different models dependent on the function it it’s in. Swapping out different models for different things makes a huge difference to cost, user experience, and quality.

alecco

If there was an option to have Claude Opus guide Sonnet I'd use it for most interactions. Doing it manually is a hassle and breaks the flow, so I end up using Opus too often.

This shouldn't be that expensive even for large prompts since input is cheaper due to parallel processing.

isoprophlex

You can define subagents that are forced to run on eg. Sonnet, and call these from your main Opus backed agent. /agent in CC for more info...

danielbln

That's what I do. I used to use Opus for the dumbest stuff, writing commits and such, but now that' all subagent business that run on Sonnet (or even Haiku sometimes). Same for running tests, executing services, docker etc. All Sonnet subagents. Positive side effect: my Opus allotment lasts a lot longer.

nateburke

generalist = fungible?

In the food industry is it more profitable to sell whole cakes or just the sweetener?

The article makes a great point about replit and legacy ERP systems. The generative in generative AI will not replace storage, storage is where the margins live.

Unless the C in CRUD can eventually replace the R and U, with the D a no-op.

marcosdumay

> In the food industry is it more profitable to sell whole cakes or just the sweetener?

I really don't understand where you are trying to get. But on that example, cakes have a higher profit margin, and sweeteners have larger scale.

null

[deleted]

jsnell

> now look at the actual pricing history of frontier models, the ones that 99% of the demand is for at any given time:

The meaningful frontier isn't scalar on just the capability, it's on capability for a given cost. The highest capability models are not where 99% of the demand is on. Actually the opposite.

To get an idea of what point on the frontier people prefer, have a look at the OpenRouter statistics (https://openrouter.ai/rankings). Claude Opus 4 has about 1% of their total usage, not 99%. Claude Sonnet 4 is the single most popular model at about 18%. The runners up in volume are Gemini Flash 2.0 and 2.5, which are in turn significantly cheaper than Sonnet 4.

ej88

The article just isn't that coherent for me.

> when a new model is released as the SOTA, 99% of the demand immediately shifts over to it

99% is in the wrong ballpark. Lots of users use Sonnet 4 over Opus 4, despite Opus being 'more' SOTA. Lots of users use 4o over o3 or Gemini over Claude. In fact it's never been a closer race on who is the 'best': https://openrouter.ai/rankings

>switch from opus ($75/m tokens) to sonnet ($15/m) when things get heavy. optimize with haiku for reading. like aws autoscaling, but for brains.

they almost certainly built this behavior directly into the model weights

???

Overall the article seems to argue that companies are running into issues with usage-based pricing due to consumers not accepting or being used to usage based pricing and it's difficult to be the first person to crack and switch to usage based.

I don't think it's as big of an issue as the author makes it out to be. We've seen this play out before in cloud hosting.

- Lots of consumers are OK with a flat fee per month and using an inferior model. 4o is objectively inferior to o3 but millions of people use it (or don't know any better). The free ChatGPT is even worse than 4o and the vast majority of chatgpt visitors use it!

- Heavy users or businesses consume via API and usage based pricing (see cloud). This is almost certainly profitable.

- Fundamentally most of these startups are B2B, not B2C

motorest

> In fact it's never been a closer race on who is the 'best'

Thank you for pointing out that fact. Sometimes it's very hard to keep perspective.

Sometimes I use Mistral as my main LLM. I know it's not lauded as the top performing LLM but the truth of the matter is that it's results are just as useful as the best models that ChatGPT/Gemini/Claude outputs, and it is way faster.

There is indeed diminished returns on the current blend of commercial LLMs. Deep seek already proved that cost can be a major factor and quality can even improve. I think we're very close to see competition based on price, which might be the reason there is so much talk about mixture of experts approaches and how specialized models can drive down cost while improving targeted output.

torginus

Yeah, my biggest problem with CC is that it's slow, prone to generating tons of bullshit exposition, and often goes down paths that I can tell almost immediately will yield no useful result.

It's great if you can leave it unattended, but personally, coding's an active thing for me, and watching it go is really frustrating.

djhworld

Over the past year or two I've just been paying for the API access and using open source frontends like LibreChat to access these models.

This has been working great for the occasional use, I'd probably top up my account by $10 every few months. I figured the amount of tokens I use is vastly smaller than the packaged plans so it made sense to go with the cheaper, pay-as-you-go approach.

But since I've started dabbling in tooling like Claude Code, hoo-boy those tokens burn _fast_, like really fast. Yesterday I somehow burned through $5 of tokens in the space of about 15 minutes. I mean, sure, the Code tool is vastly different to asking an LLM about a certain topic, but I wasn't expecting such a huge leap, a lot of the token usage is masked from you I guess wrapped up in the ever increasing context + back/forth tool orchestration, but still

zurfer

The simple reason for this is that Claude Code uses way more context and repetitions than what you would use in a typical chat.

TechDebtDevin

$20.00 via Deepseek's api (Yes China, can have my code idc), has lasted me almost a year. Its slow, but better quality output than any of the independently hosted Deepseek models (ime). I don't really use agents or anything tho.

farkin88

Even though tokens are getting cheaper, I think the real killer of "unlimited" LLM plans isn't token costs themselves, it's the shape of the usage curve that's unsustainable. These products see a Zipf-like distribution: thousands of casual users nibble a few-hundred tokens a day while a tiny group of power automations devour tens of millions. Flat pricing works fine until one of those whales drops a repo-wide refactor or a 100 MB PDF into chat and instantly torpedoes the margin. Unless vendors turn those extreme loops into cheaper, purpose-built primitives (search, static analyzers, local quantized models, etc.), every "all-you-can-eat" AI subscription is just a slow-motion implosion waiting for its next whale.

torginus

First of all, do they shoot you in San Francisco, if you use capital letters and punctuation?

Second, why are SV people obsessed with fake exponentials? It's very clear that AI progress has only been exponential in the sense that people are throwing a lot more resources at AI then they did a couple years ago.

369548684892826

> First of all, do they shoot you in San Francisco, if you use capital letters and punctuation?

Is it done like this just to show it wasn't written by a LLM?

luqtas

oh no! i can't deal with the natural morphing a lingua-franca has! /j

Thou needst to live in the archaic.

Waterluvian

On the topic of cost per token, is it accurate to represent a token as, ideally, a composable atomic unit of information. But because we’re (often) using English as the encoding format, it can only be as efficient as English can encode the data.

Does this mean that other languages might offer better information density per token? And does this mean that we could invent a language that’s more efficient for these purposes, and something humans (perhaps only those who want a job as a prompt engineer) could be taught?

Kevin speak good? https://youtu.be/_K-L9uhsBLM?si=t3zuEAmspuvmefwz

fy20

English often has a lot of redundancy, you could rewrite your comment to this and still have it convey the original meaning:

Regarding cost per token: is a token ideally a composable, atomic unit of information? Since English is often used as an encoding format, efficiency is limited by English's encoding capacity.

Could other languages offer higher information density per token? Could a more efficient language be invented for this purpose, one teachable to humans, especially aspiring prompt engineers?

67 tokens vs 106 for the original.

Many languages don't have articles, you could probably strip them from this and still understand what it's saying.

deegles

Human speech has a bit rate of around 39 bits per second, no matter how quickly you speak. assuming reading is similar, I guess more "dense" tokens would just take longer for humans to read.

https://www.science.org/content/article/human-speech-may-hav...

__s

Sure, but that link has Japanese at 5 bits per syllable & Vietnamese at 8 bits per syllable, so if billing was based on syllables per prompt you'd want Vietnamese prompts

Granted English is probably going to have better quality output based on training data size

joseda-hg

IIRC, in linguistics there's a hypothesis for "Uniform Information density" languages seem to follow on a human level (Denser languages slow down, sparse languages speed up) so you might have to go for an Artificial encoding, that maps effectively to english

English (And any of the dominant languages that you could use in it's place) work significantly better than other languages purely by having significantly larger bodies of work for the LLM to work from

Waterluvian

Yeah I was wondering about it basically being a dialect or the CoffeeScript of English.

Maybe even something anyone can read and maybe write… so… Kevin English.

Job applications will ask for how well one can read and write Kevin.

r_lee

Sure, for example Korean is unicode heavy, e.g. 경찰 = police, but its just 2 unicode chars. Not too familiar with how things are encoded but it could be more efficient

null

[deleted]

mensetmanusman

They will soon be subsidized by ads or people will run their own.

GiorgioG

I tried Gemini CLI and in 2 hours somehow spent $22 just messing around with a very small codebase. I didn’t find out until the next day from Google’s billing system. That was enough for me - I won’t touch it again.

adrianbooth17

Isn't Gemini CLI free? Or did you BYOK?

mark_l_watson

I have already thought a lot about the large packaged inference companies hitting a financial brick wall, but I was surprised by material near the end of the article: the discussions of lock in for companies that can’t switch and about Replit making money on the whole stack. Really interesting.

I managed a deep learning team at Capital One and the lock-in thing is real. Replit is an interesting case study for me because after a one week free agent trial I signed up for a one year subscription, had fun the their agent LLM-based coding assistant for a few weeks, and almost never used their coding agent after that, but I still have fun with Replit as an easy way to spin up Nix based coding environments. Replit seems to offer something for everyone.

HN

Tokens are getting more expensive

Tokens are getting more expensive