Ironwood: The first Google TPU for the age of inference

184 comments

·April 9, 2025

Visit

nharada

The first specifically designed for inference? Wasn’t the original TPU inference only?

dgacmu

Yup. (Source: was at brain at the time.)

Also holy cow that was 10 years ago already? Dang.

Amusing bit: The first TPU design was based on fully connected networks; the advent of CNNs forced some design rethinking, and then the advent of RNNs (and then transformers) did it yet again.

So maybe it's reasonable to say that this is the first TPU designed for inference in the world where you have both a matrix multiply unit and an embedding processor.

(Also, the first gen was purely a co-processor, whereas the later generations included their own network fabric, a trait shared by this most recent one. So it's not totally crazy to think of the first one as a very different beast.)

miki123211

Wow, you guys needed a custom ASIC for inference before CNNs were even invented?

What were the use cases like back then?

huijzer

According to a Google blog post from 2016 [1], use-cases were RankBrain to improve the relevancy of search results and Street View. Also they used it for AlphaGo. And from what I remember from my MSc thesis, they also probably were starting to use it for Translate. I can't find any TPU reference in the Attention is All You Need or BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, but I have been fine-tuning BERT in a TPU at the time in okt 2018 [2]. If I remember correctly, the BERT example repository showed how to fit a model with a TPU inside a Colab. So I would guess that the natural language research was mostly not on TPU's around 2016-2018, but then moved over to TPU in production. I could be wrong though and dgacmu probably knows more.

[1]: https://cloud.google.com/blog/products/ai-machine-learning/g...

[2]: https://github.com/rikhuijzer/improv/blob/master/runs/2018-1...

dekhn

As an aside, Google used CPU-based machine learning (using enormous numbers of CPUs) for a long time before custom ASICS or tensorflow even existed.

The big ones were SmartASS (ads serving) and Sibyl (everything else serving). There was an internal debate over the value of GPUs with a prominent engineer writing an influential doc that caused Google continue with fat CPU nodes when it was clear that accelerators were a good alternative. This was around the time ImageNet blew up, and some eng were stuffing multiple GPUs in their dev boxes to demonstrate training speeds on tasks like voice recognition.

Sibyl was a heavy user of embeddings before there was any real custom ASIC support for that and there was an add-on for TPUs called barnacore to give limited embedding support (embeddings are very useful for maximizing profit through ranking).

refulgentis

https://research.google/blog/the-google-brain-team-looking-b... is a good overview

I wasn't on Brain, but got obsessed with Kerminology of ML internally at Google because I wanted to know why leadership was so gung ho on it.

The general sense in the early days was these things can learn anything, and they'll replace fundamental units of computing. This thought process is best exhibited externally by ex. https://research.google/pubs/the-case-for-learned-index-stru...

It was also a different Google, the "3 different teams working on 3 different chips" bit reminds me of lore re: how many teams were working on Android wearables until upper management settled it.

FWIW it's a very, very, different company now. Back then it was more entrepreneurial. A better version of Wave-era, where things launch themselves. An MBA would find this top-down company in 2025 even better, I find it less - it's perfectly tuned to do what Apple or OpenAI did 6-12 months ago, but not to lead - almost certainly a better investment, but a worse version of an average workplace, because it hasn't developed antibodies against BSing. (disclaimer: worked on Android)

kleiba

> the advent of CNNs forced some design rethinking, and then the advent of RNNs (and then transformers) did it yet again.

Certainly, RNNs are much older than TPUs?!

woodson

So are CNNs, but I guess their popularity heavily increased at that time, to the point where it made sense to optimize the hardware for them.

hyhjtgh

RNN was of course well known at the at time, but they werent putting out state of the art numbers at that time.

theptip

The phrasing is very precise here, it’s the first TPU for _the age of inference_, which is a novel marketing term they have defined to refer to CoT and Deep Research.

yencabulator

As a previous boss liked to say, this car is the cheapest in its price range, the roomiest in it's size category, and the fastest in its speed group.

dang

Ugh. We should have caught that.

Can anyone suggest a better (i.e. more accurate and neutral) title, devoid of marketing tropes?

shitpostbot

They didn't though?

> first designed specifically for inference. For more than a decade, TPUs have powered Google’s most demanding AI training and serving workloads...

What do they think serving is? I think this marketing copy was written by someone with no idea what they are talking about, and not reviewed by anyone who did.

Also funny enough it kinda looks like they've scrubbed all their references to v4i, where the i stands for inference. https://gwern.net/doc/ai/scaling/hardware/2021-jouppi.pdf

jeffbee

Yeah that made me chuckle, too. The original was indeed inference-only.

m00x

The first one was designed as a proof of concept that it would work at all, not really to be optimal for inference workloads. It just turns out that inference is easier.

no_wizard

Some honest competition in the chip space in the machine learning race! Genuinely interested to see how this ends up playing out. Nvidia seemed 'untouchable' for so long in this space that its nice to see things get shaken up.

I know they aren't selling the TPU as boxed units, but still, even as hardware that backs GCP services and what not, its interesting to see how it'll shake out!

epolanski

> Nvidia seemed 'untouchable' for so long in this space that its nice to see things get shaken up.

Did it?

Both Mistral's LeChat (running on Cerebras) and Google's Gemini (running on Tensors) have clearly showed ages ago Nvidia had no advantage at all in inference.

The hundreds of billions spent in hardware till now focused on training, but inference is in the long run gonna get the lion share of the work.

wyager

> but inference is in the long run gonna get the lion share of the work.

I'm not sure - might not the equilibrium state be that we are constantly fine-tuning models with the latest data (e.g. social media firehose)?

NoahZuniga

Head of groq said that in his experience at google training was less than 10% of compute.

throwaway48476

Its hard to be excited about hardware that will only exist in the cloud before shredding.

crazygringo

You can't get excited about lower prices for your cloud GPU workloads thanks to the competition it brings to Nvidia?

This benefits everyone, even if you don't use Google Cloud, because of the competition it introduces.

01HNNWZ0MV43FF

I like owning things

sodality2

Cloud will buy less NVDA chips, and since they're related goods, prices will drop.

xadhominemx

You own any GB200s?

karmasimida

Oh, you own the generator for you GPU as well?

null

[deleted]

baobabKoodaa

You will own nothing and you will be happy.

throwaway48476

[flagged]

maxrmk

I love to hate on google, but I suspect this is strategic enough that they wont kill it.

Like graviton at AWS its as much of a negotiation tool as it is a technical solution, letting them push harder with NVIDIA on pricing because they have a backup option.

joshuamorton

Google's been doing custom ML accelerators for 10 years now, and (depending on how much you're willing to stretch the definition) has been doing them in consumer hardware for soon to be five years (the Google Tensor chips in pixel phones).

null

[deleted]

foota

Personally, I have a (non-functional) TPU sitting on my desk at home :-)

prometheon1

You don't find news about quantum computers exciting at all? I personally disagree

justanotheratom

exactly. I wish Groq would start selling their cards that they use internally.

xadhominemx

They would lose money on every sale

p_j_w

I think this article is for Wall Street, not Silicon Valley.

mycall

Bad timing as I think Wall Street is preoccupied at the moment.

SquareWheel

Must be part of that Preoccupy Wall Street movement.

asdfman123

Oh, believe me, they are very much paying attention to tech stocks right now.

killerstorm

It might also be for people who consider working for Google...

noitpmeder

What's their use case?

fennokin

As in for investor sentiment, not literally finance companies.

amelius

Gambling^H^H^H^H Making markets more "efficient".

jeffbee

[flagged]

CursedSilicon

Please don't make low-effort bait comments. This isn't Reddit

varelse

[dead]

nehalem

Not knowing much about special-purpose chips, I would like to understand whether chips like this would give Google a significant cost advantage over the likes of Anthropic or OpenAI when offering LLM services. Is similar technology available to Google's competitors?

heymijo

GPUs, very good for pretraining. Inefficient for inference.

Why?

For each new word a transformer generates it has to move the entire set of model weights from memory to compute units. For a 70 billion parameter model with 16-bit weights that requires moving approximately 140 gigabytes of data to generate just a single word.

GPUs have off-chip memory. That means a GPU has to push data across a chip - memory bridge for every single word it creates. This architectural choice, is an advantage for graphics processing where large amounts of data needs to be stored but not necessarily accessed as rapidly for every single computation. It's a liability in inference where quick and frequent data access is critical.

Listening to Andrew Feldman of Cerebras [0] is what helped me grok the differences. Caveat, he is a founder/CEO of a company that sells hardware for AI inference, so the guy is talking his book.

[0] https://www.youtube.com/watch?v=MW9vwF7TUI8&list=PLnJFlI3aIN...

latchkey

Cerebras (and Groq) has the problem of using too much die for compute and not enough for memory. Their method of scaling is to fan out the compute across more physical space. This takes more dc space, power and cooling, which is a huge issue. Funny enough, when I talked to Cerebras at SC24, they told me their largest customers are for training, not inference. They just market it as an inference product, which is even more confusing to me.

I wish I could say more about what AMD is doing in this space, but keep an eye on their MI4xx line.

usatie

Thank you for sharing this perspective — really insightful. I’ve been reading up on Groq’s architecture and was under the impression that their chips dedicate a significant portion of die area to on-chip SRAM (around 220MiB per chip, if I recall correctly), which struck me as quite generous compared to typical accelerators.

From die shots and materials I’ve seen, it even looks like ~40% of the die might be allocated to memory [1]. Given that, I’m curious about your point on “not enough die for memory” — is it a matter of absolute capacity still being insufficient for current model sizes, or more about the area-bandwidth tradeoff being unbalanced for inference workloads? Or perhaps something else entirely?

I’d love to understand this design tension more deeply, especially from someone with a high-level view of real-world deployments. Thanks again.

[1] Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads — Fig. 5. Die photo of 14nm ASIC implementation of the Groq TSP. https://groq.com/wp-content/uploads/2024/02/2020-Isca.pdf

heymijo

> they told me their largest customers are for training, not inference

That is curious. Things are moving so quickly right now. I typed out a few speculative sentences then went ahead and asked an LLM.

Looks like Cerebras is responding to the market and pivoting towards a perceived strength of their product combined with the growth in inference, especially with the advent of reasoning models.

ein0p

Several incorrect assumptions in this take. For one thing, 16 bit is not necessary. For another 140GB/token holds only if your batch size is 1 and your sequence length is 1 (no speculative decoding). Nobody runs LLMs like that on those GPUs - if you do it like that, compute utilization becomes ridiculously low. With batch of greater than 1 and speculative decoding arithmetic intensity of the kernels is much higher, and having weights "off chip" is not that much of a concern.

hanska

The Groq interview was good too. Seems that the thought process is that companies like Groq/Cerebras can run the inference, and companies like Nvidia can keep/focus on their highly lucrative pretraining business.

https://www.youtube.com/watch?v=xBMRL_7msjY

pkaye

Anthropic is using Google TPUs. Also jointly working with Amazon on a data center using Amazon's custom AI chips. Also Google and Amazon are both investors in Anthropic.

https://www.datacenterknowledge.com/data-center-chips/ai-sta...

https://www.semafor.com/article/12/03/2024/amazon-announces-...

avrionov

NVIDIA operates at 70% profit right now. Not paying that premium and having alternative to NVIDIA is beneficial. We just don't know how much.

kccqzy

I might be misremembering here, but Google's own AI models (Gemini) don't use NVIDIA hardware in any way, training or inference. Google bought a large number of NVIDIA hardware only for Google Cloud customers, not themselves.

xnx

Google has a significant advantage over other hyperscalers because Google's AI data centers are much more compute cost efficient (capex and opex).

claytonjy

Because of the TPUs, or due to other factors?

What even is an AI data center? are the GPU/TPU boxes in a different building than the others?

summerlight

Lots of other factors. I suspect this is one of the reasons why Google cannot offer TPU hardware itself out of their cloud service. A significant chunk of TPU efficiency can be attributed external factors which customers cannot easily replicate.

xnx

> Because of the TPUs, or due to other factors?

Google does many pieces of the data center better. Google TPUs use 3D torus networking and are liquid cooled.

> What even is an AI data center?

Being newer, AI installations have more variations/innovation than traditional data centers. Google's competitors have not yet adopted all of Google's advances.

> are the GPU/TPU boxes in a different building than the others?

Not that I've read. They are definitely bringing on new data centers, but I don't know if they are initially designed for pure-AI workloads.

cavisne

Nvidia has ~60% margins in their datacenter chips. So TPU's have quite a bit of headroom to save google money without being as good as Nvidia GPU's.

No one else has access to anything similar, Amazon is just starting to scale their Trainium chip.

buildbot

Microsoft has the MAIA 100 as well. No comment on their scale/plans though.

null

[deleted]

baby_souffle

There are other ai/llm ‘specific’ chips out there, yes. But the thing about asics is that you need one for each *specific* task. Eventually we’ll hit an equilibrium but for now, the stuff that Cerebras is best at is not what TPUs are best at is not what GPUs are best at…

monocasa

I don't even know if eventually we'll hit an equilibrium.

The end of Moore's law pretty much dictates specialization, it's just more apparent in fields without as much ossification first.

fancyfredbot

It looks amazing but I wish we could stop playing silly games with benchmarks. Why compare fp8 performance in ironwood to architectures which don't support fp8 in hardware? Why leave out TPUv6 in the comparison?

Why compare fp64 flops in the El Capitan supercomputer to fp8 flops in the TPU pod when you know full well these are not comparable?

[Edit: it turns out that El Capitan is actually faster when compared like for like and the statement below underestimated how much slower fp64 is, my original comment in italics below is not accurate] (The TPU would still be faster even allowing for the fact fp64 is ~8x harder than fp8. Is it worthwhile to misleadingly claim it's 24x faster instead of honestly saying it's 3x faster? Really?)

It comes across as a bit cheap. Using misleading statements is a tactic for snake oil salesmen. This isn't snake oil so why lower yourself?

fancyfredbot

It's even worse than I thought. El Capitan has 43,808 MI300A APUs. According to AMD each MI300A can do 3922TF of sparse FP8 for a total of 171EF sparse FP8 performance, or 85TF non-sparse.

In other words El Capitan is between 2 and 4 times as fast as one of these pods, yet they claim the pod is 24x faster than El Capitan.

dekhn

Google shouldn't do that comparison. When I worked there I strongly emphasized to the TPU leadership to not compare their systems to supercomputers- not only were the comparisons misleading, Google absolutely does not want supercomputer users to switch to TPUs. SC users are demanding and require huge support.

meta_ai_x

Google needs to sell to Enterprise Customers. It's a Google Cloud Event. Of course they have incentives to hype because once long-term contracts are signed you lose that customer forever. So, hype is a necessity

shihab

I went through the article and it seems you're right about the comparison with El Capitan. These performance figures are so bafflingly misleading.

And so unnecessary too- nobody shopping for AI inference server cares at all about its relative performance vs a fp64 machine. This language seems designed solely to wow tech-illiterate C-Suites.

imtringued

Also, there is no such thing as a "El Capitan pod". The quoted number is for the entire supercomputer.

My impression from this is that they are too scared to say that their TPU pod is equivalent to 60 GB200 NVL72 racks in terms of fp8 flops.

I can only assume that they need way more than 60 racks and they want to hide this fact.

jeffbee

A max-spec v5p deployment, at least the biggest one they'll let you rent, occupies 140 racks, for reference.

aaronax

8960 chips in those 140 racks. $4.20/hour/chip or $4,066/month/chip

So $68k per hour or $27 million per month.

Get 55% off with 3 year commitment.

adrian_b

FP64 is more like 64 times harder than FP8.

Actually the cost is even much higher, because the cost ratio is not much less than the square of the ratio between the sizes of the significands, which in this case is 52 bits / 4 bits = 13, and the square of 13 is 169.

christkv

Memory size and bandwidth goes up a lot right?

zipy124

Because it is a public company that aims to maximise shareholder value and thus the value of it's stock. Since value is largely evaluated by perception, if you can convince people your product is better than it is, your stock valuation, at least in the short term will be higher.

Hence Tesla saying FSD and robo-taxis are 1 year away, the fusion companies saying fusion is closer than it is etc....

Nvidia, AMD, apple and intel have all been publishing misleading graphs for decades and even under constant criticism they continue to.

fancyfredbot

I understand the value of perception.

A big part of my issue here is that they've really messed up the misleading benchmarks.

They've failed to compare to the most obvious alternative, which is Nvidia GPUs. They look like they've got something to hide, not like they're ahead.

They've needlessly made their own current products look bad in comparison to this one understating the long-standing advantage TPUs have given Google.

Then they've gone and produced a misleading comparison to the wrong product (who cares about El Capitan? I can't rent that!). This is a waste of credibility. If you are going to go with misleading benchmarks then at least compare to something people care about.

zipy124

That's fair enough, I agree with all your points :)

segmondy

Why not? If we line up to race. You can't say why compare v8 to v6 turbo or electric engine. It's a race, the drive train doesn't matter. Who gets to the finish line first?

No one is shopping for GPU by fp8, fp16, fp32, fp64. It's all about cost/performance factor. 8 bits is as good as 32bits, great performance is even been pulled out of 4 bits...

fancyfredbot

This is like saying I'm faster because I ran (a mile) in 8 minutes whereas it took you 15 minutes (to run two miles).

scottlamb

I think it's more like saying I ran a mile in 8 minutes whereas it took you 15 minutes to run the same distance, but you weigh twice what I do and also can squat 600 lbs. Like, that's impressive, but it's sure not helping your running time.

Dropping the analogy: f64 multiplication is a lot harder than f8 multiplication, but for ML tasks it's just not needed. f8 multiplication hardware is the right tool for the job.

charcircuit

>Why compare fp8 performance in ironwood to architectures which don't support fp8 in hardware?

Because end users want to use fp8. Why should architectural differences matter when the speed is what matters at the end of the day?

bobim

GP bikes are faster than dirt bikes, but not on dirt. The context has some influence here.

MegaAgenticAI

More details on Ironwood, Cloud TPUs and insights from Jeff Dean: https://www.youtube.com/watch?v=fNjH5izFeyw

gigel82

I was hoping they're launching a Coral kind of device that can run locally and cheaply, with updated specs.

It would be awesome for things like homelabs (to run Frigate NVR, Immich ML tasks or the Home Assistant LLM).

_hark

Can anyone comment on where efficiency gains come from these days at the arch level? I.e. not process-node improvements.

Are there a few big things, many small things...? I'm curious what fruit are left hanging for fast SIMD matrix multiplication.

vessenes

One big area the last two years has been algorithmic improvements feeding hardware improvements. Supercomputer folks use f64 for everything, or did. Most training was done at f32 four years ago. As algo teams have shown fp8 can be used for training and inference, hardware has updated to accommodate, yielding big gains.

NB: Hobbyist, take all with a grain of salt

jmalicki

Unlike a lot of supercomputer algorithms, where fp error accumulates as you go, gradient descent based algorithms don't need as much precision since any fp errors will still show up at the next loss function calculation to be corrected, which allows you to make do with much lower precision.

cubefox

Much lower indeed. Even Boolean functions (e.g. AND) are differentiable (though not exactly in the Newton/Leibniz sense) which can be used for backpropagation. They allow for an optimizer similar to stochastic gradient descent. There is a paper on it: https://arxiv.org/abs/2405.16339

It seems to me that floating point math (matrix multiplication) will over time mostly disappear from ML chips, as Boolean operations are much faster both in training an inference. But currently they are still optimized for FP rather than Boolean operations.

muxamilian

In-memory computing (analog or digital). Still doing SIMD matrix multiplication but using more efficient hardware: https://arxiv.org/html/2401.14428v1 https://www.nature.com/articles/s41565-020-0655-z

gautamcgoel

This is very interesting, but not what the Ironside TPU is doing. The blog post says that the TPU uses conventional HBM RAM.

nsteel

There's been some talk/rumour of next-gen HBMs having some compute capability on the base die. But again, not what they're doing here, this is regular HBM3/HBM3e.

https://semiengineering.com/speeding-down-memory-lane-with-c...

yeahwhatever10

Specialization. Ie specialized for inference.

tuna74

How is API story for these devices? Are the drivers mainlined in Linux? Is there a specific API you use to code for them? How does the instance you rent on Google Cloud look and what does that have for software?

cbarrick

XLA (Accelerated Linear Algebra) [1] is likely the library that you'll want to use to code for these machines.

TensorFlow, PyTorch, and Jax all support XLA on the backend.

[1]: https://openxla.org/

ndesaulniers

They have out of tree drivers. If they don't ship the hardware to end users, it's not clear upstream (Linux kernel) would want them.

DisjointedHunt

Cloud resources are trending towards consumer technology adoption numbers rather than being reserved mostly for Enterprise. This is the most exciting thing in decades!

There is going to be a GPU/Accelerator shortage for the foreseeable future to run the most advanced models, Gemini 2.5 Pro is such a good example. It is probably the first model that many developers i've considered skeptics of extended agent use have started to saturate free token thresholds on.

Grok is honestly the same, but the lack of an API is suggestive of the massive demand wall they face.

lawlessone

Can these be repurposed for other things? Encoding/decoding video? Graphics processing etc?

edit: >It’s a move from responsive AI models that provide real-time information for people to interpret, to models that provide the proactive generation of insights and interpretation. This is what we call the “age of inference” where AI agents will proactively retrieve and generate data to collaboratively deliver insights and answers, not just data.

maybe i will sound like a luddite but im not sure i want this.

I'd rather AI/ML only do what i ask it to.

vinkelhake

Google already has custom ASICs for video transcoding. YouTube has been running those for many years now.

https://streaminglearningcenter.com/encoding/asics-vs-softwa...

lawlessone

Thank you :)

cavisne

The JAX docs have a good explanation for how a TPU works

https://docs.jax.dev/en/latest/pallas/tpu/details.html#what-...

Its not really useful for other workloads (unless your workload looks like a bunch of matrix multiplications).

null

[deleted]