An Interview with Zen Chief Architect Mike Clark
18 comments
·March 24, 2025adgjlsfhk1
tobias216893
Second this, truly a great read!
IshKebab
> it also gives us better density overall by having smaller instructions. x86 can put more work into each instruction byte, so we can have denser binaries and increase performance that way.
Is that really true? I was under the impression that ARM core density was better than x86.
The first Google result I found agrees: https://www.bitsnbites.eu/cisc-vs-risc-code-density/
> It may take a little bit more microarchitectural work for us to account for stronger memory order on the x86 side, but of course the software has to account for the weaker memory ordering on the ARM side. So there are tradeoffs.
I don't think this makes sense either since software has to have synchronisation anyway. I don't think much software relies on x86 TSO (intentionally anyway; I'm sure many bugs are covered up by it).
Anyway fantastic interview! Very refreshing when they're actually technical with interesting questions and answers.
adgjlsfhk1
> The first Google result I found agrees: https://www.bitsnbites.eu/cisc-vs-risc-code-density/
This isn't a great metric for 2 reasons. Firstly, it's looking at static rather than dynamic instruction counts, and 2nd x86 tends to get bigger vectorized implimentations, but these are often on bigger vectors so the "instructions per data" which is the more important metric in many cases is lower.
https://news.ycombinator.com/item?id=19329128 has a pretty interesting discussion on this from a few years ago, but 2 places where x86 shines are the lea and mov instructions. They are both incredibly compact for representing potentially very complicated operations.
jamessinghal
I'd assume the synchronization primitives in OSes are written with the memory ordering in mind at the least.
IshKebab
Yeah I think this is true - they can be nops on x86. But it's still going to be better to have explicit sync instructions only where needed than implicit sync everywhere.
dundarious
Relying on TSO is not a bug if your target platform is just amd64.
IshKebab
I was thinking about code that relies on TSO but doesn't insert synchronisation primitives for the compiler, and it just happens to violate the expected ordering. E.g. maybe the code would break if you increased the optimisation level or switched compiler.
adgjlsfhk1
Generally the way this works is that when you write atomic algorithms, you are doing 2 things. The first is telling the compiler what it's allowed to optimize, and the 2nd is controlling the processor. What this means is that the code that relies on TSO (which is pretty close to the C++ memory model), you add a bunch of information to the code that prevents the compiler from doing some optimizations, and then when the compiler is generating native code, on X86 it will turn into regular loads/stores, but on arm it will have additional fence instructions.
dundarious
You can write data race free code with just compiler reorder fences and extremely limited sync primitives on x86/64, but cannot with most other ISAs.
whizzter
It's an interesting interview but also horribly frustrating.
There's talk about vectors, longer basic blocks and cache utilization and how they wish programmers used more of it, it misses the real world.
Regardless of how "hardcore" programmers feel about it, so much real world executed code is built in JS,etc. The JIT:ed code is super-branchy (to cater for deopt fallbacks) and won't use vectors at all.
This is something Apple has gotten correct with their vertical integration as they seem to have put more focus on making real-world code go fast. Even most games will have huge swaths of non-vectorized code that would benefit from a more scalar focused way of optimizing.
Considering transistor counts in use today, instead of bigger vector units,etc it could be spent on bigger UOP buffers, BTB buffers, bigger caches to eat up branchy-but-linear code flows that are reality for less "optimal" languages).
eska
If you really care about performance, then you should consider using technologies other than js and python, instead of asking hardware vendors to run their implementations faster.
Wumpnot
What is the point of making JS go faster? It is already fast enough even on older computers for the stuff it is designed for, making crappy UI.
ahartmetz
Well. At this point it's much easier to make vectorized code go faster. There is absolutely positively no low-hanging fruit left executing single-threaded, branchy code. A few percent improvement take a huge effort. It's not that they don't want... it's just really difficult and hits diminishing returns.
levodelellis
Great interview, more please. Is there a way to submit question? I'd like to know 1) what affects branch predictors? From my understanding return statements do and cmov does not 2) Why isn't there a conditional exception? to replace if (!cond) { __builtin_trap(); }
wtallis
Conditional instructions that don't branch or otherwise interrupt the program flow don't necessarily have to cause any pipeline stalls or bubbles. The CPU can decode a cmov and then carry on decoding subsequent instructions (and pass the cmov on to subsequent pipeline stages) well before it's known whether the condition is true. For a branching instruction, the CPU doesn't know during the early phases what the next instruction after the branch will be, so it has to predict and speculate.
> Why isn't there a conditional exception? to replace if (!cond) { __builtin_trap(); }
There's not really any way to make that into something that doesn't branch; the best you can hope for is only one instruction that may branch but hopefully gets predicted accurately. But in the event of an exception, there really does need to be something that can cause the instruction pointer to do something other than advance to the next byte after the current instruction.
This interview was great! I really wish there were more of these sorts of interviews published.