Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?

137 comments

·August 8, 2025

Sam said yesterday that chatgpt handles ~700M weekly users. Meanwhile, I can't even run a single GPT-4-class model locally without insane VRAM or painfully slow speeds.

Sure, they have huge GPU clusters, but there must be more going on - model optimizations, sharding, custom hardware, clever load balancing, etc.

What engineering tricks make this possible at such massive scale while keeping latency low?

Curious to hear insights from people who've built large-scale ML systems.

Visit

canyon289

I work at Google on these systems everyday (caveat this is my own words not my employers)). So I simultaneously can tell you that its smart people really thinking about every facet of the problem, and I can't tell you much more than that.

However I can share this written by my colleagues! You'll find great explanations about accelerator architectures and the considerations made to make things fast.

https://jax-ml.github.io/scaling-book/

In particular your questions are around inference which is the focus of this chapter https://jax-ml.github.io/scaling-book/inference/

Edit: Another great resource to look at is the unsloth guides. These folks are incredibly good at getting deep into various models and finding optimizations, and they're very good at writing it up. Here's the Gemma 3n guide, and you'll find others as well.

https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-...

KaiserPro

Same explanation but with less mysticism:

Inference is (mostly) stateless. So unlike training where you need to have memory coherence over something like 100k machines and somehow avoid the certainty of machine failure, you just need to route mostly small amounts of data to a bunch of big machines.

I don't know what the specs of their inference machines are, but where I worked the machines research used were all 8gpu monsters. so long as your model fitted in (combined) vram, you could job was a goodun.

To scale the secret ingredient was industrial amounts of cash. Sure we had DGXs (fun fact, nvidia sent literal gold plated DGX machines) but they wernt dense, and were very expensive.

Most large companies have robust RPC, and orchestration, which means the hard part isn't routing the message, its making the model fit in the boxes you have. (thats not my area of expertise though)

blibble

> So I simultaneously can tell you that its smart people really thinking about every facet of the problem, and I can't tell you much more than that.

"we do timesharing"

there, that was easy

tough

Doesn't google have TPU's that makes inference of their own models much more profitable than say having to rent out NVDIA cards?

Doesn't OpenAI depend mostly on its relationship/partnership with Microsoft to get GPUs to inference on?

Thanks for the links, interesting book!

ActorNightly

Yes. Google is probably gonna win the LLM game tbh. They had a massive head start with TPUs which are very energy efficient compared to Nvidia Cards.

baxtr

The only one who can stop Google is Google.

They’ll definitely have the best model, but there is a chance they will f*up the product / integration into their products.

stogot

Hasn’t the Inferentia chip been around long enough to make the same argument? AWS and Google probably have the same order of magnitude of their own custom chips

davedx

But they’re ASICs so any big architecture changes will be painful for them right?

fakedang

Yeah honestly. They could just try selling solutions and SLAs combining their TPU hardware with on-prem SOTA models and practically dominate enterprise. From what I understand, that's GCP's gameplay too for most regulated enterprise clients.

canyon289

Im a research person building models so I can't answer your questions well (save for one part)

That is, as a research person using our GPUs and TPUs I see first hand how choices from the high level python level, through Jax, down to the TPU architecture all work together to make training and inference efficient. You can see a bit of that in the gif on the front page of the book. https://jax-ml.github.io/scaling-book/

I also see how sometimes bad choices by me can make things inefficient. Luckily for me if my code/models are running slow I can ping colleagues who are able to debug at both a depth and speed that is quite incredible.

And because were on HN I want to preemptively call out my positive bias for Google! It's a privilege to be able to see all this technology first hand, work with great people, and do my best to ship this at scale across the globe.

jackhalford

Why does the unsloth guide for gemma 3n say:

> llama.cpp an other inference engines auto add a <bos> - DO NOT add TWO <bos> tokens! You should ignore the <bos> when prompting the model!

That makes the want to try exactly that? Weird

ignoramous

> Another great resource to look at is the unsloth guides.

And folks at LMSys: https://lmsys.org/blog/

  Large Model Systems (LMSYS Corp.) is a 501(c)(3) non-profit focused on incubating open-source projects and research. Our mission is to make large AI models accessible to everyone by co-developing open models, datasets, systems, and evaluation tools. We conduct cutting-edge machine learning research, develop open-source software, train large language models for broad accessibility, and build distributed systems to optimize their training and inference.

airhangerf15

An H100 is a $20k USD card and has 80GB of vRAM. Imagine a 2U rack server with $100k of these cards in it. Now imagine an entire rack of these things, plus all the other components (CPUs, RAM, passive cooling or water cooling) and you're talking $1 million per rack, not including the costs to run them or the engineers needed to maintain them. Even the "cheaper"

I don't think people realize the size of these compute units.

When the AI bubble pops is when you're likely to be able to realistically run good local models. I imagine some of these $100k servers going for $3k on eBay in 10 years, and a lot of electricians being asked to install new 240v connectors in makeshift server rooms or garages.

semi-extrinsic

What do you mean 10 years?

You can pick up a DGX-1 on Ebay right now for less than $10k. 256 GB vRAM (HBM2 nonetheless), NVLink capability, 512 GB RAM, 40 CPU cores, 8 TB SSD, 100 Gbit HBAs. Equivalent non-Nvidia branded machines are around $6k.

They are heavy, noisy like you would not believe, and a single one just about maxes out a 16A 240V circuit. Which also means it produces 13 000 BTU/hr of waste heat.

kj4ips

Fair warning: the BMCs on those suck so bad, and the firmware bundles are painful, since you need a working nvidia-specific container runtime to apply them, which you might not be able to get up and running because of a firmware bug causing almost all the ram to be presented as nonvolatile.

ksherlock

It's not waste heat if you only run it in the winter.

hdgvhicv

Opt if you ignore that both gas furnaces and heat pumps are more efficient than resistive loads.

eulgro

> 13 000 BTU/hr

In sane units: 3.8 kW

andy99

You mean 1.083 tons of refrigeration

Skunkleton

> In sane units: 3.8 kW

5.1 Horsepower

quickthrowman

You’ll need (2) 240V 20A 2P breakers, one for the server and one for the 1-ton mini-split to remove the heat ;)

Dylan16807

Matching AC would only need 1/4 the power, right? If you don't already have a method to remove heat.

kelnos

Well, get a heat pump with a good COP of 3 or more, and you won't need quite as much power ;)

Scoundreller

Just air freight them from 60 degrees North to 60 degrees South and vice verse every 6 months.

CamperBob2

Are you talking about the guy in Temecula running two different auctions with some of the same photos (356878140643 and 357146508609, both showing a missing heat sink?) Interesting, but seems sketchy.

How useful is this Tesla-era hardware on current workloads? If you tried to run the full DeepSeek R1 model on it at (say) 4-bit quantization, any idea what kind of TTFT and TPS figures might be expected?

invaliduser

Even is the AI bubble does not pops, your prediction about those servers being available on ebay in 10 years will likely be true, because some datacenters will simply upgrade their hardware and resell their old ones to third parties.

potatolicious

Would anybody buy the hardware though?

Sure, datacenters will get rid of the hardware - but only because it's no longer commercially profitable run them, presumably because compute demands have eclipsed their abilities.

It's kind of like buying a used GeForce 980Ti in 2025. Would anyone buy them and run them besides out of nostalgia or curiosity? Just the power draw makes them uneconomical to run.

Much more likely every single H100 that exists today becomes e-waste in a few years. If you have need for H100-level compute you'd be able to buy it in the form of new hardware for way less money and consuming way less power.

For example if you actually wanted 980Ti-level compute in a desktop today you can just buy a RTX5050, which is ~50% faster, consumes half the power, and can be had for $250 brand new. Oh, and is well-supported by modern software stacks.

CBarkleyU

Off topic, but I bought my (still in active use) 980ti literally 9 years ago for that price. I know, I know, inflation and stuff, but I really expected more than 50% bang for my buck after 9 whole years…

belter

Except their insane electricity demands will still be the same, meaning nobody will buy them. You have plenty of SPARC servers on Ebay.

cicloid

There is also a community of users known for not making sane financial decisions and keeping older technologies working in their basements.

mattmanser

Someone's take on AI was that we're collectively investing billions in data centers that will be utterly worthless in 10 years.

Unlike the investments in railways or telephone cables or roads or any other sort of architecture, this investment has a very short lifespan.

Their point was that whatever your take on AI, the present investment in data centres is a ridiculous waste and will always end up as a huge net loss compared to most other investments our societies could spend it on.

Maybe we'll invent AGI and he'll be proven wrong as they'll pay back themselves many times over, but I suspect they'll ultimately be proved right and it'll all end up as land fill.

bespokedevelopr

If it is all a waste and a bubble, I wonder what the long term impact will be of the infrastructure upgrades around these dcs. A lot of new HV wires and substations are being built out. Cities are expanding around clusters of dcs. Are they setting themselves up for a new rust belt?

toast0

The servers may well be worthless (or at least worth a lot less), but that's pretty much true for a long time. Not many people want to run on 10 year old servers (although I pay $30/month for a dedicated server that's dual Xeon L5640 or something like that, which is about 15 years old).

The servers will be replaced, the networking equipment will be replaced. The building will still be useful, the fiber that was pulled to internet exchanges/etc will still be useful, the wiring to the electric utility will still be useful (although I've certainly heard stories of datacenters where much of the floor space is unusable, because power density of racks has increased and the power distribution is maxed out)

jonplackett

They probably are right, but a counter argument could be how people thought going to the moon was pointless and insanely expensive, but the technology to put stuff in space and have GPS and comms satellites probably paid that back 100x

dortlick

Sure, but what about the collective investment in smartphones, digital cameras, laptops, even cars. Not much modern technology is useful and practical after 10 years, let alone 20. AI is probably moving a little faster than normal, but technology depreciation is not limited to AI.

mensetmanusman

Utterly? Moores law per power requirement is dead, lower power units can run electric heating for small towns!

torginus

My personal sneaking suspicion is that publicly offered models are using way less compute than thought. In modern mixture of experts models, you can do top-k sampling, where only some experts are evaluated, meaning even SOTA models aren't using much more compute than a 70-80b non-MoE model.

eitally

What I wonder is what this means for Coreweave, Lambda and the rest, who are essentially just renting out fleets of racks like this. Does it ultimately result in acquisition by a larger player? Severe loss of demand? Can they even sell enough to cover the capex costs?

adw

These are also depreciating assets.

ActorNightly

To piggyback on this, at enterprise level in modern age, the question is really not about "how are we going to serve all these users", it comes down to the fact that investors believe that eventually they will see a return on investment, and then pay whatever is needed to get the infra.

Even if you didn't have optimizations involved in terms of job scheduling, they would just build as many warehouses as necessary filled with as many racks as necessary to serve the required user base.

torginus

I wonder if it's feasible to hook up NAND flash with a high bandwidth link necessary for inference.

Each of these NAND chips hundreds of dies of flash stacked inside, and they are hooked up to the same data line, so just 1 of them can talk at the same time, and they still achieve >1GB/s bandwidth. If you could hook them up in parallel, you could have 100s of GBs of bandwidth per chip.

potatolicious

NAND is very, very slow relative to RAM, so you'd pay a huge performance penalty there. But maybe more importantly my impression is that memory contents mutate pretty heavily during inference (you're not just storing the fixed weights), so I'd be pretty concerned about NAND wear. Mutating a single bit on a NAND chip a million times over just results in a large pile of dead NAND chips.

torginus

No it's not slow - a single NAND chip in SSDs offers >1GB of bandwidth - inside the chip there are 100+ wafers actually holding the data, but in SSDs only one of them is active when reading/writing.

You could probably make special NAND chips where all of them can be active at the same time, which means you could get 100GB+ bandwidth out of a single chip.

This would be useless for data storage scenarios, but very useful when you have huge amounts of static data you need to read quickly.

null

[deleted]

neko_ranger

Four H100 in a 2U rack didn't sound impressive, but that is accurate:

>A typical 1U or 2U server can accommodate 2-4 H100 PCIe GPUs, depending on the chassis design.

>In a 42U rack with 20x 2U servers (allowing space for switches and PDU), you could fit approximately 40-80 H100 PCIe GPUs.

michaelt

Why stop at 80 H100s for a mere 6.4 terabytes of GPU memory?

Supermicro will sell you a full rack loaded with servers [1] providing 13.4 TB of GPU memory.

And with 132kW of power output, you can heat an olympic-sized swimming pool by 1°C every day with that rack alone. That's almost as much power consumption as 10 mid-sized cars cruising at 50 mph.

[1] https://www.supermicro.com/en/products/system/gpu/48u/srs-gb...

jzymbaluk

And the big hyperscaler cloud providers are building city-block sized data centers stuffed to the gills with these racks as far as the eye can see

piyh

You have thousands of dollars, they have tens of billions. $1,000 vs $10,000,000,000. They have 7 more zeros than you, which is one less zero than the scale difference in users: 1 user (you) vs 700,000,000 users (openai). They managed to squeak out at least one or two zeros worth of efficiency at scale vs what you're doing.

Also, you CAN run local models that are as good as GPT 4 was on launch on a macbook with 24 gigs of ram.

https://artificialanalysis.ai/?models=gpt-oss-20b%2Cgemma-3-...

cornholio

You can knock off a zero or two just by time shifting the 700 million distinct users across a day/week and account for the mere minutes of compute time they will actually use in each interaction. So they might no see peaks higher than 10 million active inference session at the same time.

Conversely, you can't do the same thing as a self hosted user, you can't really bank your idle compute for a week and consume it all in a single serving, hence the much more expensive local hardware to reach the peak generation rate you need.

0cf8612b2e1e

During times of high utilization, how do they handle more requests than they have hardware? Is the software granular enough that they can round robin the hardware per token generated? UserA token, then UserB, then UserC, back to UserA? Or is it more likely that everyone goes into a big FIFO processing the entire request before switching to the next user?

I assume the former has massive overhead, but maybe it is worthwhile to keep responsiveness up for everyone.

cornholio

Inference is essentially a very complex matrix algorithm run repeatedly on itself, each time the input matrix (context window) is shifted and the new generated tokens appended to the end. So, it's easy to multiplex all active sessions over limited hardware, a typical server can hold hundreds of thousands of active contexts in the main system ram, each less than 500KB and ferry them to the GPU nearly instantaneously as required.

the8472

During peaks they can kick out background jobs like model training or API users doing batch jobs.

abathologist

One clever ingredient in OpenAI's secret sauce is billions of dollars of losses. About $5 billion dollars lost in 2024. https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...

fergal_reid

I think the most direct answer is that at scale, inference can be batched, so that processing many queries together in a parallel batch is more efficient than interactively dedicating a single GPU per user (like your home setup).

If you want a survey of intermediate level engineering tricks, this post we wrote on the Fin AI blog might be interesting. (There's probably a level of proprietary techniques OpenAI etc have again beyond these): https://fin.ai/research/think-fast-reasoning-at-3ms-a-token/

jp57

700M weekly users doesn't say much about how much load they have.

I think the thing to remember is that the majority of chatGPT users, even those who use it every day, are idle 99.9% of the time. Even someone who has it actively processing for an hour a day, seven days a week, is idle 96% of the time. On top of that, many are using less-intensive models. The fact that they chose to mention weekly users implies that there is a significant tail of their user distribution who don't even use it once a day.

So your question factors into a few of easier-but-still-not-trivial problems:

- Making individual hosts that can fit their models in memory and run them at acceptable toks/sec.

- Making enough of them to handle the combined demand, as measured in peak aggregate toks/sec.

- Multiplexing all the requests onto the hosts efficiently.

Of course there are nuances, but honestly, from a high level last problem does not seem so different from running a search engine. All the state is in the chat transcript, so I don't think there any particular reason reason that successive interactions on the same chat need be handled by the same server. They could just be load-balanced to whatever server is free.

We don't know, for example, when the chat says "Thinking..." whether the model is running or if it's just queued waiting for a free server.

roadside_picnic

I'm sure there are countless tricks, but one that can implemented at home, and I know plays a major part in Cerebras' performance is: speculative decoding.

Speculative decoding uses a smaller draft model to generate tokens with much less compute and memory required. Then the main model will accept those tokens based on the probability it would have generated them. In practice this case easily result in a 3x speedup in inference.

Another trick for structured outputs that I know of is "fast forwarding" where you can skip tokens if you know they are going to be the only acceptable outputs. For example, you know that when generating JSON you need to start with `{ "<first key>": ` etc. This can also lead to a ~3x speedup in when responding in JSON.

tough

gpt-oss-120b can be used with gpt-oss-20b as speculative drafting on LM Studio

I'm not sure it improved the speed much

roadside_picnic

To measure the performance gains on a local machine (or even standard cloud GPU setup), since you can't run this in parallel with the same efficiency you could in a high-ed data center, you need to compare the number of calls made to each model.

In my experiences I'd seen the calls to the target model reduced to a third of what they would have been without using a draft model.

You'll still get some gains on a local model, but they won't be near what they could be theoretically if everything is properly tuned for performance.

It also depends on the type of task. I was working with pretty structured data with lots of easy to predict tokens.

vrm

a 6:1 parameter ratio is too small for specdec to have that much of an effect. You'd really want to see 10:1 or even more for this to start to matter

ritz_labringue

The short answer is "batch size". These days, LLMs are what we call "Mixture of Experts", meaning they only activate a small subset of their weights at a time. This makes them a lot more efficient to run at high batch size.

If you try to run GPT4 at home, you'll still need enough VRAM to load the entire model, which means you'll need several H100s (each one costs like $40k). But you will be under-utilizing those cards by a huge amount for personal use.

It's a bit like saying "How come Apple can make iphones for billions of people but I can't even build a single one in my garage"

robotnikman

I wonder then if its possible to load the unused parts into main memory, while the more used parts into VRAM

ryao

At the heart of inference is matrix-vector multiplication. If you have many of these operations to do and only the vector part differs (which is the case when you have multiple queries), you can do matrix-matrix multiplication by stuffing the vectors into a matrix. Computing hardware is able to run the equivalent of dozens of matrix-vector multiplication operations in the same time it takes to do 1 matrix-matrix multiplication operation. This is called batching. That is the main trick.

A second trick is to implement something called speculative decoding. Inference has two phases. One is prompt processing and another is token generation. They actually work the same way using what is called a forward pass, except prompt processing can do them in parallel by switching from matrix-vector to matrix-matrix multiplication and dumping the prompt’s tokens into each forward pass in parallel. Each forward pass will create a new token, but it can be discarded unless it is from the last forward pass, as that will be the first new token generated as part of token generation. Now, you put that token into the next forward pass to get the token after it, and so on. It would be nice if all of the forward passes could be done in parallel, but you do not know the future, so you ordinarily cannot. However, if you make a draft model that is a very fast model runs in a fraction of the time and guesses the next token correctly most of the time, then you can sequentially run the forward pass for that instead N times. Now, you can take the N tokens and put it into the prompt processing routine that did N forward passes in parallel. Instead of discarding all tokens except the last one like in prompt processing, we will compare them to the input tokens. All tokens up to and including the first token that differ, that come out of the parallel forward pass are valid tokens for the output of the main model. This is guaranteed to always produce at least 1 valid token since in the worse case the first token does not match, but the output for the first token will be equal to the output of running the forward pass without having done speculative decoding. You can get a 2x to 4x performance increase from this if done right.

Now, I do not work on any of this professionally, but I am willing to guess that beyond these techniques, they have groups of machines handling queries of similar length in parallel (since doing a batch where 1 query is much longer than the others is inefficient) and some sort of dynamic load balancing so that machines do not get stuck with a query size that is not actively being utilized.

simne

It is not just engineering. There are also huge, very huge, investments into infrastructure.

As already answered, AI companies use extremely expensive setups (servers with professional cards) in large numbers and all these things concentrated in big datcenters with powerful networking and huge power consumption.

Imagine - last time, so huge investments (~1.2% of GDP, and unknown if investments will grow or not) was into telecom infrastructure - mostly wired telephones, but also cable TV and later added Internet and cell communications and clouds (in some countries wired phones just don't cover whole country and they jumped directly into wireless communications).

Larger investments was into railroads - ~6% of GDP (and I'm also not sure, some people said, AI will surpass them as share of possible for AI tasks constantly grow).

So to conclude, just now AI boom looks like main consumer of telecom (Internet) and cloud infrastructure. If you've seen old mainframes in datacenters, and extremely thick core network cables (with hundreds wires or fibers in just one cable), and huge satellite dishes, you could imagine, what I'm talking about.

And yes, I'm not sure, will this boom end like dot-coms (Y2K), or such huge usage of resources will sustain. Why it is not obvious, because for telecoms (internet) also was unknown, if people will use phones and other p2p communications for leisure as now, or will leave phones just for work. Even worse, if AI agents become ordinary things, possible scenario, number of AI agents will surpass number of people.

mquander

I'm pretty much an AI layperson but my basic understanding of how LLMs usually run on my or your box is:

1. You load all the weights of the model into GPU VRAM, plus the context.

2. You construct a data structure called the "KV cache" representing the context, and it hopefully stays in the GPU cache.

3. For each token in the response, for each layer of the model, you read the weights of that layer out of VRAM and use them plus the KV cache to compute the inputs to the next layer. After all the layers you output a new token and update the KV cache with it.

Furthermore, my understanding is that the bottleneck of this process is usually in step 3 where you read the weights of the layer from VRAM.

As a result, this process is very parallelizable if you have lots of different people doing independent queries at the same time, because you can have all their contexts in cache at once, and then process them through each layer at the same time, reading the weights from VRAM only once.

So once you got the VRAM it's much more efficient for you to serve lots of people's different queries than for you to be one guy doing one query at a time.

valbaca

How can Google serve 3B users when I can't do one internet search locally? [2001]