Performance of the Python 3.14 tail-call interpreter
31 comments
·March 10, 2025MattPalmer1086
eru
I vaguely remember about some benchmarking project that deliberately randomised these compiler decisions, so that they could give you more stable estimates of how well your code actually performed, and not just how well you won or lost the linker lottery.
FridgeSeal
The Coz profiler from Emery Berger.
It can actually go a step further and give you decent estimate of what functions you need to change to have the desired latency/throughput increases.
Mond_
You're probably thinking of "Performance Matters" by Emery Berger, a Strange Loops talk. https://youtube.com/watch?v=r-TLSBdHe1A
alpaca128
As already mentioned this is likely Emery Berger’s project with the idea of intentionally slowing down different parts of the program, also to find out which parts are most sensitive to slowdowns (aka have the biggest effect on overall performance), with the assumption that these are also the parts that profit the most from optimisations.
MattPalmer1086
There was Stabilizer [1] which did this, although it is no longer maintained and doesn't work with modern versions of LLVM. I think there is something more current now that automates this, but can't remember what it's called.
porridgeraisin
That linker lottery led to a 15% improvement? I'm surprised. Do you know in what cases you get such a big improvement from something like that? Is it rare? How did you end up reasoning about it?
MattPalmer1086
Various research has shown that the variation can be much higher than 15% due to things like this. It's not that rare; I keep bumping into it. Compilers and linkers do a reasonable job but fundamentally modern CPUs are extremely complex beasts.
jeeybee
Kudos to the author for diving in and uncovering the real story here. The Python 3.14 tail-call interpreter is still a nice improvement (any few-percent gain in a language runtime is hard-won), just not a magic 15% free lunch. More importantly, this incident gave us valuable lessons about benchmarking rigor and the importance of testing across environments. It even helped surface a compiler bug that can now be fixed for everyone’s benefit. It’s the kind of deep-dive that makes you double-check the next big performance claim. Perhaps the most thought-provoking question is: how many other “X% faster” results out there are actually due to benchmarking artifacts or unknown regressions? And how can we better guard against these pitfalls in the future?
kryptiskt
This is a very good example of how C is not "close to the machine" or "portable assembly", modern optimizers will do drastic changes to the logic as long as it has no observable effect.
As stated in the post: "Thus, we end up in this odd world where clang-19 compiles the computed-goto interpreter “correctly” – in the sense that the resulting binary produces all the same value we expect – but at the same time it produces an output completely at odds with the intention of the optimization. Moreover, we also see other versions of the compiler applying optimizations to the “naive” switch()-based interpreter, which implement the exact same optimization we “intended” to perform by rewriting the source code."
jmillikin
> This is a very good example of how C is not "close to the machine" or
> "portable assembly",
C is very much "portable assembly" from the perspective of other systems programming languages of the 80s-90s era. The C expression `a += 1` can be trusted to increment a numeric value, but the same expression in C++ might allocate memory or unwind the call stack or do who knows what. Similarly, `a = "a"` is a simple pointer assignment in C, but in C++ it might allocate memory or [... etc].The phrase "C is portable assembly" isn't a claim that each statement gets compiled directly to equivalent machine code.
kryptiskt
When the code has hit the IR in clang or gcc, there is no 'a' (we know that with certainty, since SSA form doesn't mutate but assigns to fresh variables). We don't know if there will be an increment of 1, the additions could be coalesced (or elided if the result can be inferred another way). The number can even decrease, say if things have been handled in chunks of 16, and needs to be adjusted down in the last chunk. Or the code may be auto-vectorized and completely rewritten, so that none of the variables at the C level are reflected on the assembler level.
fouronnes3
Stretching "no observable effect" all the way to a 10000 word blog post.
thrdbndndn
Great article! One detail caught my attention.
In one of the referenced articles, https://simonwillison.net/2025/Feb/13/python-3140a5/, the author wrote: "So 3.14.0a5 scored 1.12 times faster than 3.13 on the benchmark (on my extremely overloaded M2 MacBook Pro)."
I'm quite confused by this. Did the author run the benchmark while the computer was overloaded with other processes? Wouldn't that make the results completely unreliable? I would have thought these benchmarks are conducted in highly controlled environments to eliminate external variables.
vkazanov
So, the compiler is tinkering with the way the loop is organised so the whole tail-call interpreter thing is not as effective as announced... Not surprised.
1. CPU arch (and arch version) matters a lot. The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally. C was never meant to support this.
2. The C abstract machine is also not low-level enough to express the intent properly. Any implementation becomes supersensitivy to a particular compiler's (and compiler version) quirks.
Certain paranoid interpreter implementation go back to writing assembly directly. luajit's famous for implementing a macro system to make its superefficient assembly loop implementation portable across architectures. This is also I find it fun to tinker with these!
Anyways, a few years ago I've put together an article and a a test for popular interpreter loop implementation approaches:
asicsp
Related discussions:
https://docs.python.org/3.14/whatsnew/3.14.html#whatsnew314-... --> https://news.ycombinator.com/item?id=42999672 (66 points | 25 days ago | 22 comments)
https://blog.reverberate.org/2025/02/10/tail-call-updates.ht... --> https://news.ycombinator.com/item?id=43076088 (124 points | 18 days ago | 92 comments)
motbus3
I recently made some benchmarking from python 3.9 to 3.13 And up to 3.11 it only got better. Python 3.12 and 3.13 were about 10% slower than 3.11.
I thought my homemade benchmark wasn't great enough so I deployed it to a core service anyway and I saw same changes in our collected metrics. Does anyone else have the same problem?
albertzeyer
To clarify: The situation is still not completely understood? It's not just only the computed gotos, but there is some other regression in Clang19? Basically, the difference between clang19.nocg and clang19 is not really clear?
Btw, what about some clang18.tc comparison, i.e. Clang18 with the new tail-call interpreter? I wonder how that compares to clang19.tc.
tempay
Trying to assess the performance of a python build is extremely difficult as there are a lot of build tricks you can do to improve it. Recently the astral folks ran into this showing how the conda-forge build is notable faster than most others:
https://github.com/astral-sh/python-build-standalone/pull/54...
I'd be interested to know how the tail-call interpreter performs with other build optimisations that exist.
eru
Compare https://donsbot.com/2009/03/09/evolving-faster-haskell-progr...
The author uses a genetic algorithm to try out lots of different compiler and optimisation flag combinations.
IshKebab
Very interesting! Clearly something else going on though if the 2% Vs 9% thing is true.
unit149
[dead]
Benchmarking is just insanely hard to do well. There are so many things which can mislead you.
I recently discovered a way to make an algorithm about 15% faster. At least, that's what all the benchmarks said. At some point I duplicated the faster function in my test harness, but did not call the faster version, just the original slower one... And it was still 15% faster. So code that never executed sped up the original code...!!! Obviously, this was due to code and memory layout issues, moving something so it aligned with some CPU cache better.
It's actually really really hard to know if speedups you get are because your code is actually "better" or just because you lucked out with some better alignment somewhere.
Casey Muratori has a really interesting series about things like this in his substack.