100x defect tolerance: How we solved the yield problem
77 comments
·January 15, 2025ChuckMcM
enragedcacti
Any thoughts on why they are disabling so many cores in their current product? I did some quick noodling based on the 46/970000 number and the only way I ended up close to 900,000 was by assuming that an entire row or column would be disabled if any core within it was faulty. But doing that gave me a ~6% yield as most trials had active core counts in the high 800,000s
__Joker
"While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage"
Can you please explain more why you think so ?
Thank you.
ajb
So they massively reduce the area lost to defects per wafer, from 361 to 2.2 square mm. But from the figures in this blog, this is massively outweighed by the fact that they only get 46222 sq mm useable area out of the wafer, as opposed to 56247 that the H100 gets - because they are using a single square die instead of filling the circular wafer with smaller square dies, they lose 10,025 sq mm!
Not sure how that's a win.
Unless the rest of the wafer is useable for some other customer?
nine_k
It's a win because they have to test one chip, and don't have to spend resources on connecting the chiplets. The latter costs a lot (though it has other advantages). I suspect that a chiplet-based device with total 900k cores would just be not viable due to the size constraints.
If their routing around the defects is automated enough (given the highly regular structure), it may be a massive economy of efforts on testing and packaging the chip.
olejorgenb
Is the wafer itself so expensive? I assume they don't pattern the unused area, so the process should be quicker?
addaon
> I assume they don't pattern the unused area
I’m out of date on this stuff, so it’s possible things have changed, but I wouldn’t make that assumption. It is (used to be?) standard to pattern the entire wafer, with partially-off-the-wafer dice around the edges of the circle. The reason for this is that etching behavior depends heavily on the surrounding area — the amount of silicon or copper whatever etched in your neighborhood affects the speed of etching for you, which effects line width, and (for a single mask used for the whole wafer) thus either means you need to have more margin on your parameters (equivalent to running on an old process) or have a higher defect right near the edge of the die (which you do anyway, since you can only take “similar neighborhood” so far). This goes as far as, for hyper-optimized things like SRAM arrays, leaving an unused row and column at each border of the array.
yannyu
> I assume they don't pattern the unused area, so the process should be quicker?
The primary driver of time and cost in the fabrication process is the number of layers for the wafers, not the surface area, since all wafers going through a given process are the same size. So you generally want to maximize the number of devices per wafer, because a large part of your costs will be calculated at the per-wafer level, not a per-device level.
mattashii
Yes, but isn't a big driver of layer costs the cost of the machines to build those layers?
For patterning, a single iteration could be (example values, no actual values used, probably only ballpark accuracy) on a 300M$ EUV machine with 5-year write off cycle, patterns on average 180 full wafers /hour. Excluding energy usage and service time, each wafer that needs full patterning would cost ~38$. If each wafer only needed half the area patterned, the lithography machine might only spend half its usual time on such a wafer, and that could double the throughput of the EUV machine, halving the write-off based cost component of such a patterning step.
Given that each layer generally consists of multiple patterning steps, a 10-20% reduction in those steps could give a meaningful reduction in time spent in the machines whose time spend on the wafer depends on the used wafer area.
This of course doesn't help reduce time in polishing or etching (and other steps that happen with whole wafers at a time), so it won't be as straightforward as % reduction in wafer area usage == % reduction in cost, but I wouldn't be surprised if it was a meaningful percentage.
olejorgenb
Yes, but my understanding is that the wafer is exposed in multiple steps, so there would still be less exposure steps? Probably insignificant compared to all the rest though. (Etching, moving the wafer, etc.)
EDIT: to clarify - I mean the exposure of one single pattern/layer is done in multiple steps. (https://en.wikipedia.org/wiki/Photolithography#Projection)
ajb
Good question. I think the wafer has a cost per area which is fairly significant, but I don't have any figures. There has historically been a push to utilise them more efficiently, eg by building fabs that can process larger wafers. Although mask exposure would be per processed area, I think that there are also some proportion of processing time which is per wafer, so the unprocessed area would have an opportunity cost relating to that.
kristjansson
AIUI Wafer marginal cost is lower than you'd expect. I had $50k in my head, quick google indicates[1] maybe <$20k at AAPL volumes? Regardless seems like the economics for Cerebras would strongly favor yield over wafer area utilization.
[1] https://www.tomshardware.com/tech-industry/tsmcs-wafer-prici...
null
pulvinar
There's also no reason they couldn't pattern that area with some other suitable commodity chips. Like how sawmills and butchers put all cuts to use.
sitkack
Often those areas are used for test chips and structures for the next version. They are effectively free, so you can use them to test out ideas.
georgeburdell
They probably pattern at least next nearest neighbors for local uniformity. That’s just litho though. The rest of the process is done all at once on the wafer
Scaevolus
Why does their chip have to be rectangular, anyways? Couldn't they cut out a (blocky) circle too?
yannyu
The cost driver for fabbing out wafers is the number of layers and the number of usable devices per wafer. Higher layer count increases cost and tends to decrease yield, and more robust designs with higher yields increase usable devices per wafer. If circles or other shapes could help with either of those, they would likely be used. Generally the end goal is to have the most usable devices per wafer, so they'll be packed as tightly as possible on the wafer so as to have the highest potential output.
nine_k
Rather I wonder why do they even need to cut the extra space, instead of putting something there. I suppose that the structure of the device is highly rectangular from the logical PoV, so there's nothing useful to put there. I suspect smaller unrelated chips can be produced on these areas along the way.
kristjansson
Additional wafer area would be a marginal increase in performance (+~20% core core best case) but increases the complexity of their design, and requires they figure out how to package/connect/house/etc. a non-standard shape. A wafer scale chip is already a huge tech risk, why spend more novelty budget on nonessential weirdness?
ungreased0675
Why does it have to be a square? There’s no need to worry about interchangeable third-party heat sink compatibility. Is it possible to make it an irregular polygon instead of square?
sroussey
It’s a win if you can use the wafer as opposed to throwing it away.
kristjansson
A win is a manufacturing process that results in a functioning product. Wafers, etc. aren't so scarce as to demand every mm2 be used on every one every time.
NickHoff
Neat. What about power density?
An H100 has a TDP of 700 watts (for the SXM5 version). With a die size of 814 mm^2 that's 0.86 W/mm^2. If the cerebras chip has the same power density, that means a cerebras TDP of 37.8 kW.
That's a lot. Let's say you cover the whole die area of the chip with water 1 cm deep. How long would it take to boil the water starting from room temperature (20 degrees C)?
amount of water = (die area of 46225 mm^2) * (1 cm deep) * (density of water) = 462 grams
energy needed = (specific heat of water) * (80 kelvin difference) * (462 grams) = 154 kJ
time = 154 kJ / 39.8 kW = 3.9 seconds
This thing will boil (!) a centimeter of water in 4 seconds. A typical consumer water cooler radiator would reduce the temperature of the coolant water by only 10-15 C relative to ambient, and wouldn't like it (I presume) if you pass in boiling water. To use water cooling you'd need some extreme flow rate and a big rack of radiators, right? I don't really know. I'm not even sure if that would work. How do you cool a chip at this power density?
Paul_Clayton
The enthalpy of vaporization of water (at standard pressure) is listed by Wikipedia[1] as 2.257 kJ/g, so boiling 462 grams would require an additional 1.04 MJ, adding 26 seconds. Cerebras claims a "peak sustained system power of 23kW" for the CS-3 16 Rack Unit system[2], so clearly the power density is lower than for an H100.
[1] https://en.wikipedia.org/wiki/Enthalpy_of_vaporization#Other... [2] https://cerebras.ai/product-system/
buildbot
A Very Fancy cooling engine: https://www.eetimes.com/powering-and-cooling-a-wafer-scale-d...
throwup238
The machine that actually holds one of their wafers is almost as impressive as the chip itself. Tons of water cooling channels and other interesting hardware for cooling.
jwan584
A good talk on how Cerebras does power & cooling (8min) https://www.youtube.com/watch?v=wSptSOcO6Vw&ab_channel=Appli...
flopsamjetsam
Minor correction, the keynote video says ~20 kW
lostlogin
If rack mounted, you are ending up with something like a reverse power station.
So why not use it as an energy source? Spin a turbine.
renhanxue
There's a bunch of places in Europe that use waste heat from datacenters in district heating systems. Same thing with waste heat from various industrial processes. It's relatively common practice.
kristjansson
If you let the chip actual boil enough water to run a turbine you're going to have a hard time keeping the magic smoke inside. Much better to run at reasonable temps and try to recover energy from the waste heat.
ericye16
What if you chose a refrigerant with a lower boiling point?
bentcorner
I'm aware of the efficiency losses but I think it would be amusing to use that turbine to help power the machine generating the heat.
sebzim4500
If my very stale physics is accurate then even with perfect thermodynamic efficiency you would only recover about a third of the energy that you put into the chips.
dylan604
1/3 > 0, so even if you don't get a $0 energy bill I'd venture that any company that could get 1/3 of energy bill would be happy
highfrequency
To summarize: localize defect contamination to a very small unit size, by making the cores tiny and redundant.
Analogous to a conglomerate wrapping each business vertical in a limited liability veil so that lawsuits and bankruptcy do not bring down the whole company. The smaller the subsidiaries, the less defect contamination but also the less scope for frictionless resource and information sharing.
IshKebab
TSMC also have a manufacturing process used by Tesla's Dojo where you can cut up the chips, throw away the defective ones, and then reassemble working ones into a sort of wafer scale device (5x5 chips for Dojo). Seems like a more logical design to me.
ryao
I had been under the impression that Nvidia had done something similar here, but they did not talk about deploying the space saving design and instead only talked about the server rack where all of the chips on the mega wafer normally are.
https://www.sportskeeda.com/gaming-tech/what-nvlink72-nvidia...
mhh__
Amazing. I clicked a button in the azure deployment menu today...
bcatanzaro
This is a strange blog post. Their tables say:
Cerebras yields 46225 * .93 = 43000 square millimeters per wafer
NVIDIA yields 58608 * .92 = 54000 square millimeters per wafer
I don't know if their numbers are correct but it is a strange thing for a startup to brag that it is worse than a big company at something important.
anonymousDan
Very interesting. Am I correct in saying that fault tolerance here is with respect to 'static' errors that occur during manufacturing and are straightforward to detect before reaching the customer? Or can these failures potentially occur later on (and be tolerated) during the normal life of the chip?
ryao
> Take the Nvidia H100 – a massive GPU weighing in at 814mm2. Traditionally this chip would be very difficult to yield economically. But since its cores (SMs) are fault tolerant, a manufacturing defect does not knock out the entire product. The chip physically has 144 SMs but the commercialized product only has 132 SMs active. This means the chip could suffer numerous defects across 12 SMs and still be sold as a flagship part.
Fault tolerance seems to be the wrong term to use here. If I wrote this, I would have written redundant.
jjk166
Redundant cores lead to a fault tolerant chip.
exabrial
I have a dumb question. Why isn't silicon sold in cubes instead of cylinders?
amelius
The silicon ingots have a rotating production process that results in cylinders, not bricks.
bigmattystyles
no matter how you orient a circle on a plane, it's the same
Neywiny
Understanding that there's inherent bias by them being competitors of the other companies, but still this article seems to make some stretches. If you told me you had an 8% core defect rate reduced 100x, I'd assume you got to close to 99% enablement. The table at the end shows... Otherwise.
They also keep flipping between cores, SMs, dies, and maybe other block sizes. At the end of the day I'm not very impressed. They seemingly have marginally better yields despite all that effort.
bee_rider
> Second, a cluster of defects could overwhelm fault tolerant areas and disable the whole chip.
That’s an interesting point. In architecture class (which was basic and abstract so I’m sure Cerebras is doing something much more clever), we learned that defects cluster, but this is a good thing. A bunch of defects clustering on one core takes out the core, a bunch of defects not clustering could take out… a bunch of cores, maybe rendering the whole chip useless.
I wonder why they don’t like clustering. I could imagine in a network of little cores, maybe enough defects clustered on the network could… sort of overwhelm it, maybe?
Also I wonder how much they benefit from being on one giant wafer. It is definitely cool as hell. But could chiplets eat away at their advantage?
abrookewood
Looking at the H100 on the left, why is the chip yield (72) based on a circular layout/constraint? Why do they discard all of the other chips that fall outside the circle?
donavanm
AFAIK all wafer ingots are cylinders, which means the wafers themselves are a circular cross section. So manufacturing is binpacking rectangles in to a circle. Plus different effects/defects in the chips based on the distance from the edge of the wafer.
So I believe its the opposite: why are they representing the larger square and implying lower yield off the wafer in space that doesnt practically exist?
flumpcakes
Because the circle is the physical silicon. Any chips that fall outside the circle are only part of a full chip. They will be physically missing half the chip.
therealcamino
That's just the shape of the wafer. I don't know why the diagram continued the grid outside it.
I think this is an important step, but it skips over that 'fault tolerant routing architecture' means you're spending die space on routes vs transistors. This is exactly analogous to using bits in your storage for error correcting vs storing data.
That said, I think they do a great job of exploiting this technique to create a "larger"[1] chip. And like storage it benefits from every core is the same and you don't need to get to every core directly (pin limiting).
In the early 2000's I was looking at a wafer scale startup that had the same idea but they were applying it to an FPGA architecture rather than a set of tensor units for LLMs. Nearly the exact same pitch, "we don't have to have all of our GLUs[2] work because the built in routing only uses the ones that are qualified." Xilinx was still aggressively suing people who put SERDES ports on FPGAs so they were pin limited overall but the idea is sound.
While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage. I appreciate the the amount of money people are willing to put at risk here allow for folks to try these "out of the box" kinds of ideas.
[1] It is physically more cores on a single die but the overall system is likely smaller, given the integration here.
[2] "Generic Logic Unit" which was kind of an extended LUT with some block RAM and register support.