Performance of the Python 3.14 tail-call interpreter
179 comments
·March 10, 2025kenjin4096
jraph
Reading that you are extremely embarrassed and sorry that you made such a huge oversight, I was imagining you had broken something / worsened CPython's performance.
But it's nothing like this. You announced a 10-15% perf improvement but that improvement is more like 1-5% on a non buggy compiler. It's not even like that 10-15% figure is wrong, it's just that it's correct only under very specific conditions, unknowingly to you.
IIUC, you did your homework: you made an improvement, you measured a 10-15% perf improvement, the PR was reviewed by other people, etc. It just so happens that this 10-15% figure is misleading because of an issue with the version of clang you happened to use to measure. Unless I'm missing something, it looks like a fair mistake anyone could have reasonably made. It even looks like it was hard to not fall into this trap. You could have been more suspicious seeing such a high number, but hindsight is 20/20.
Apparently, you still brought significant performance improvements, your work also helped uncover a compiler regression. The wrong number seems quite minor in comparison. I wonder who was actually hurt by this. I only discover the "case" right now but at a first glance it doesn't feel like you owe an apology to anyone. Kudos for all this!
ehsankia
In some way, by indirectly helping fix this bug, they led to a ~10% performance increase for everyone who was using that faulty compiler! That's even better than an optional flag that many people won't know about or use.
chowells
That performance regression only hit code that was using a very large number of paths with the same table of computed gotos at the end. That's likely to only be relatively complex interpreters that were affected. So it's not a broad performance improvement. But it is nice to have an example of the compiler's new heuristic failing to prove evidence it needs to be tunable.
rtollert
> IIUC, you did your homework: you made an improvement, you measured a 10-15% perf improvement, the PR was reviewed by other people, etc. It just so happens that this 10-15% figure is misleading because of an issue with the version of clang you happened to use to measure. Unless I'm missing something, it looks like a fair mistake anyone could have reasonably made. It even looks like it was hard to not fall into this trap. You could have been more suspicious seeing such a high number, but hindsight is 20/20.
Hah! Is this a Gettier problem [0]?
1. True: The PR improves Python performance 15-20%. 2. True: Ken believes that the PR improves Python performance 15-20%. 3. True: Ken is justified in believing that the PR improves Python performance 15-20%.
Of course, PR discussions don't generally revolve around whether or not the PR author "knows" that the PR does what they claim it does. Still: these sorts of epistemological brain teasers seem to come up in the performance measurement field distressingly often. I wholeheartedly agree that Ken deserves all the kudos he has received; still, I also wonder if some of the strategies used to resolve the Gettier problem might be useful for code reviewers to center themselves every once in a while. Murphy's Law and all that.
jraph
Could very well be!
Interesting, I didn't know about the Gettier problem, thanks for sharing. You could try submitting that page as a proper HN post.
DannyBee
FWIW - the fix was merged since you wrote that blog post ;)
Beyond that - 3-5% is a lot for something as old as the python interpreter if it holds up. I would still be highly proud of that.
After 30 years, i've learned (like i expect you have) to be suspicious of any significant (IE >1%) performance improvement in a system that has existed a long time.
They happen for sure, but are less common. Often, people are shifting time around, and so it just isn't part of your benchmark anymore[1]. Secondarily, benchmarking is often done in controlled environments, to try to isolate the effect. Which seems like the right thing to do. But then the software is run in non-isolated environments (IE with a million other things on a VM or desktop computer), which isn't what you benchmarked it in. I've watched plenty of verifiably huge isolated gains disappear or go negative when put in production environments.
You have the particularly hard job that you have to target lots of environments - you can't even do something like say "okay look if it doesn't actually speed it up in production, it didn't really speed it up", because you have no single production target. That's a really hard world to live in and try to improve.
In the end, performance tuning and measurement is really hard. You have nothing to be sorry for, except learning that :)
Don't let this cause you to be afraid to get it wrong - you will get it wrong anyway. We all do. Just do what you are doing here - say 'whoops, i think we screwed this up', try to figure out how to deal with it, and try to figure out how to avoid it in the future (if you can).
[1] This is common not just in performance, but in almost anything, including human processes. For example, to make something up, the team working on the code review tool would say "we've reduced code review time by 15% and thus sped up everyone's workflow!". Awesome! But actually, it turns out they made more work for some other part of the system, so the workflow didn't get any faster, they just moved the 15% into a part of the world they weren't measuring :)
theLiminator
> After 30 years, i've learned (like i expect you have) to be suspicious of any significant (IE >1%) performance improvement in a system that has existed a long time.
Laughs in corporate code
DannyBee
Sure, let me amend it to "i'm suspicious of any significant performance improvement in as system where performance actually matters, and has existed in a state where performance matters for a long time".
colechristensen
I'll believe you if you say 0.5% improvement, I'll believe you if you say 10,000% improvement, but 10%? That's fishy.
miroljub
That's a different case. Corporate code is never optimized for performance. Performance as a factor doesn't play any factor.
haberman
I think it's important to note that a primary motivation of the tail call interpreter design is to be less vulnerable to the whims of the optimizer. From my original blog article about this technique (https://blog.reverberate.org/2021/04/21/musttail-efficient-i...):
> Theoretically, this control flow graph paired with a profile should give the compiler all of the information it needs to generate the most optimal code [for a traditional switch()-based interpreter]. In practice, when a function is this big and connected, we often find ourselves fighting the compiler. It spills an important variable when we want it to keep it in a register. It hoists stack frame manipulation that we want to shrink wrap around a fallback function invocation. It merges identical code paths that we wanted to keep separate for branch prediction reasons. The experience can end up feeling like trying to play the piano while wearing mittens.
That second-to-last sentence is exactly what has happened here. The "buggy" compiler merged identical code paths, leading to worse performance.
The "fixed" compiler no longer does this, but the fix is basically just tweaking a heuristic inside the compiler. There's no actual guarantee that this compiler (or another compiler) will continue to have the heuristic tweaked in the way that benefits us the most.
The tail call interpreter, on the other hand, lets us express the desired machine code pattern in the interpreter itself. Between "musttail", "noinline", and "preserve_none" attributes, we can basically constrain the problem such that we are much less at the mercy of optimizer heuristics.
For this reason, I think the benefit of the tail call interpreter is more than just a 3-5% performance improvement. It's a reliable performance improvement that may be even greater than 3-5% on some compilers.
kenjin4096
This is a good point. We already observed this in our LTO and PGO builds for the computed goto interpreter. On modern compilers, each LTO+PGO build has huge variance (1-2%) for the CPython interpreter. On macOS, we already saw a huge regression in performance because Xcode just decided to stop making LTO and PGO work properly on the interpreter. Presumably, the tail call interpreter would be immune to this.
sunshowers
In full agreement with this. There is tremendous value in having code whose performance is robust to various compiler configurations.
jxjnskkzxxhx
I just wanted to say: respect for being able to say "sorry, I made a mistake". I hate the fake it till you make it mentality that seems to be the norm now.
kzrdude
I understand the frustration but I don't think it needed to be said (the part about mentality, the thanks is of course cool), because that's still not the norm.
Why do I even bring this message - I want to say that let's not let what we see in the news influence our perception of the real people of the world. Just because fraud and crimes get elevated in the news, does not mean that the common man is a criminal or a fraud. :)
codr7
Divide and conquer; we're supposed to hate each other and trust the state/elite/technology, at least that's the plan.
The real criminals, the people we should keep an eye on, are the plan's designers and implementers.
ptx
Why didn't this regression in baseline performance show up (or did it?) on the faster-cpython benchmarks page [0]? Could the benchmarks be improved to prevent similar issues in the future?
cb321
That is a better than average benchmark page.
As alluded to in https://news.ycombinator.com/item?id=43319010, I see these tests were collected against just 2 Intel and 2 ARM CPUs. So, if you are looking for feedback to improve, you should probably also include (at least) a AMD Zen4 or Zen5 in there. CPU & compiler people have been both trying to "help perf while not trusting the other camp" for as long as I can remember and I doubt that problem will ever go away.
A couple more CPUs will help but not solve generalizability of results. E.g., if somebody tests against some ancient 2008 Nehalem hardware, they might get very different answers. Similarly, answers today might not reflect 2035 very well.
The reality of our world of complex hardware deployment (getting worse with GPUs) is that "portable performance" is almost a contradiction in terms. We all just do the best we can at some semblance of the combination. The result is some kind of weighted average of "not #ifdef'd too heavily source" and "a bit" faster "informally averaged over our user population & their informally averaged workloads" and this applies at many levels of the whole computing stack.
EDIT: And, of course, a compiled impl like Cython or Nim is another way to go if you care about performance, but I do understand the pull & network effects of the Python ecosystem. So, that may not always be practical.
kenjin4096
We don't normally test with bleeding-edge compilers on the faster cpython benchmarks page because that would invalidate historical data. E.g. if 2 years ago we used GCC 11 or something to compile and run a benchmark, we need to run it with GCC 11 again today to get comparable results.
Clang 19 was released last year. We only benched it a few months ago. We did notice there was a significant slowdown on macOS, but that was against Xcode Clang, which is a different compiler. I thought it might've been an Xcode thing, which in the past has bit CPython before (such as Xcode LTO working/not working versus normal Clang) so I didn't investigate deeper (facepalming now in retrospect) and chalked it up to a compiler difference.
TLDR: We didn't run benchmarks of clang 19 versus 18. We only ran benchmarks of clang 19 versus gcc, Xcode clang, and MSVC. All of which are not apples-to-apples to Clang 19, so I naiively thought it was a compiler difference.
EDIT: As to how we could improve this process, I'm not too sure, but I know I'll be more discerning when there's a >4% perf hit now when upgrading compilers.
delijati
One question arises: does the added code [1] bring any improvement, or does it merely complicate the source? Should it not be removed?
kenjin4096
That's a fair question.
The blog post mentions it brings a 1-5% perf improvement. Which is still significant for CPython. It does not complicate the source because we use a DSL to generate CPython's interpreters. So the only complexity is in autogenerated code, which is usually meant for machine consumption anyways.
The other benefit (for us maintainers I guess), is that it compiles way faster and is more debuggable (perf and other tools work better) when each bytecode is a smaller function. So I'm inclined to keep it for perf and productivity reasons.
ot
Being more robust to fragile compiler optimizations is also a nontrivial benefit. An interpreter loop is an extremely specialized piece of code whose control flow is too important to be left to compiler heuristics.
If the desired call structure can be achieved in a portable way, that's a win IMO.
coldtea
There was a plan for a 5x speedup overall looking for funding back in 2022. Then a team with Guido and others involved (and MS backing?) got on the same bandwagon and made some announcements for speeding up CPython a lot.
Several releases in, have we seen even a 2x speedup? Or more like 0.2x at best?
Not trying to dismiss the interpreter changes - more want to know if those speedup plans were even remotely realistic, and if anything close enough to even 1/5 of what was promised will really come out of them...
robertlagrant
You're doing great work pushing Python forwards. Thanks for all your efforts.
pseufaux
The way this was handled is incredibly good form! Thank you for going to such lengths to make sure the mistake was corrected. I respect and thank you (and all the developers working in Python) for all the work you do!
MattPalmer1086
Benchmarking is just insanely hard to do well. There are so many things which can mislead you.
I recently discovered a way to make an algorithm about 15% faster. At least, that's what all the benchmarks said. At some point I duplicated the faster function in my test harness, but did not call the faster version, just the original slower one... And it was still 15% faster. So code that never executed sped up the original code...!!! Obviously, this was due to code and memory layout issues, moving something so it aligned with some CPU cache better.
It's actually really really hard to know if speedups you get are because your code is actually "better" or just because you lucked out with some better alignment somewhere.
Casey Muratori has a really interesting series about things like this in his substack.
porridgeraisin
That linker lottery led to a 15% improvement? I'm surprised. Do you know in what cases you get such a big improvement from something like that? Is it rare? How did you end up reasoning about it?
MattPalmer1086
Various research has shown that the variation can be much higher than 15% due to things like this. It's not that rare; I keep bumping into it. Compilers and linkers do a reasonable job but fundamentally modern CPUs are extremely complex beasts.
I found Casey Muratori's series the best explanation of what is going on at the CPU level.
MattPalmer1086
Some additional context. I was actually writing a benchmarking tool for certain kinds of search algorithm. I spent a long time reducing and controlling for external sources of noise. CPU pinning, doing hundreds of those runs with different data, and then repeating them several times and taking the best of each score with the same data (to control for transient performance issues due to the machine doing other things).
I got the benchmarking tool itself to give reasonably repeatable measurements.
The tool had high precision, but the accuracy of "which algorithm is better" was not reliable just due to these code layout issues.
I basically gave up and shelved the benchmarking project at that point, because it wasn't actually useful to determine which algorithm was better.
eru
I vaguely remember about some benchmarking project that deliberately randomised these compiler decisions, so that they could give you more stable estimates of how well your code actually performed, and not just how well you won or lost the linker lottery.
Mond_
You're probably thinking of "Performance Matters" by Emery Berger, a Strange Loops talk. https://youtube.com/watch?v=r-TLSBdHe1A
MattPalmer1086
There was Stabilizer [1] which did this, although it is no longer maintained and doesn't work with modern versions of LLVM. I think there is something more current now that automates this, but can't remember what it's called.
FridgeSeal
The Coz profiler from Emery Berger.
It can actually go a step further and give you decent estimate of what functions you need to change to have the desired latency/throughput increases.
MattPalmer1086
Thanks, I was trying to remember that one!
McP
LLD has a new option "--randomize-section-padding" for this purpose: https://github.com/llvm/llvm-project/pull/117653
MattPalmer1086
Interesting, thanks!
igouy
"Producing wrong data without doing anything obviously wrong!"
igouy
"Producing wrong data without doing anything obviously wrong!"
[pdf]
https://users.cs.northwestern.edu/~robby/courses/322-2013-sp...
alpaca128
As already mentioned this is likely Emery Berger’s project with the idea of intentionally slowing down different parts of the program, also to find out which parts are most sensitive to slowdowns (aka have the biggest effect on overall performance), with the assumption that these are also the parts that profit the most from optimisations.
throwaway2037
Aleksey Shipilёv, a long-time Java "performance engineer" (my term) has written and spoken extensively about the challenging of benchmarking. I highly recommend to read some of his blog posts or watch one of his talks about it.
jeeybee
Kudos to the author for diving in and uncovering the real story here. The Python 3.14 tail-call interpreter is still a nice improvement (any few-percent gain in a language runtime is hard-won), just not a magic 15% free lunch. More importantly, this incident gave us valuable lessons about benchmarking rigor and the importance of testing across environments. It even helped surface a compiler bug that can now be fixed for everyone’s benefit. It’s the kind of deep-dive that makes you double-check the next big performance claim. Perhaps the most thought-provoking question is: how many other “X% faster” results out there are actually due to benchmarking artifacts or unknown regressions? And how can we better guard against these pitfalls in the future?
ehsankia
I guess the bigger question for me is, how was a 10% drop in Python performance not detected when that faulty compiler feature was pushed? Do we not benchmark the compilers themselves? Do the existing benchmarks on the compiler or python side not use that specific compiler?
twoodfin
The author makes this point, too, and I agree it’s the most surprising thing about the entire scenario.
LLVM introduced a major CPython performance regression, and nobody noticed for six months?
ltfish
As far as I am aware, the official CPython binaries on Linux have always been built using GCC, so you will have to build your own CPython using both Clang 18 and 19 to notice the speed difference. I think this is partly why no one has noticed the speed difference yet.
kryptiskt
This is a very good example of how C is not "close to the machine" or "portable assembly", modern optimizers will do drastic changes to the logic as long as it has no observable effect.
As stated in the post: "Thus, we end up in this odd world where clang-19 compiles the computed-goto interpreter “correctly” – in the sense that the resulting binary produces all the same value we expect – but at the same time it produces an output completely at odds with the intention of the optimization. Moreover, we also see other versions of the compiler applying optimizations to the “naive” switch()-based interpreter, which implement the exact same optimization we “intended” to perform by rewriting the source code."
jmillikin
> This is a very good example of how C is not "close to the machine" or
> "portable assembly",
C is very much "portable assembly" from the perspective of other systems programming languages of the 80s-90s era. The C expression `a += 1` can be trusted to increment a numeric value, but the same expression in C++ might allocate memory or unwind the call stack or do who knows what. Similarly, `a = "a"` is a simple pointer assignment in C, but in C++ it might allocate memory or [... etc].The phrase "C is portable assembly" isn't a claim that each statement gets compiled directly to equivalent machine code.
kryptiskt
When the code has hit the IR in clang or gcc, there is no 'a' (we know that with certainty, since SSA form doesn't mutate but assigns to fresh variables). We don't know if there will be an increment of 1, the additions could be coalesced (or elided if the result can be inferred another way). The number can even decrease, say if things have been handled in chunks of 16, and needs to be adjusted down in the last chunk. Or the code may be auto-vectorized and completely rewritten, so that none of the variables at the C level are reflected on the assembler level.
jmillikin
From a high-level academic view, yes, the compiler is allowed to perform any legal transformation. But in practice C compilers are pretty conservative about what they emit, especially when code is compiled without -march= .
You don't have to take my word for it. Go find a moderately complex open-source library written in C, compile it, then open up the result in Hexrays/Ghidra/radare2/whatever. Compare the compiled functions with their original source and you'll see there's not that much magic going on.
titzer
Saying that something "is like XY" when you really mean "is like XY, at least in comparison to C++" isn't what most people mean.
C is not a portable assembler.
In C, "a += 1" could overflow, and signed overflow is undefined behavior--even though every individual ISA has completely defined semantics for overflow, and nearly all of them these days do two's complement wraparound arithmetic. With C's notion of undefined behavior, it doesn't even give you the same wraparound in different places in the same program. In fact, wraparound is so undefined that the program could do absolutely anything, and the compiler is not required to even tell you about it. Even without all the C++ abstraction madness, a C compiler can give you absolutely wild results due to optimizations, e.g. by evaluating "a += 1" at compile time and using a different overflow behavior than the target machine. Compile-time evaluation not matching runtime evaluation is one of a huge number of dumb things that C gives you.
Another is that "a += 1" may not even increment the variable. If this occurs as an expression, and not as a statement, e.g. "f(a += 1, a += 1)", you might only get one increment due to sequence points[1]--not to mention that the order of evaluation might be different depending on the target.
C is not a portable assembler.
C is a low-level language where vague machine-like programs get compiled to machine code that may or may not work, depending on whether it violates UB rules or not, and there are precious few diagnostics to tell if that happened, either statically or dynamically.
pjc50
> The phrase "C is portable assembly" isn't a claim that each statement gets compiled directly to equivalent machine code.
Weasel words. Like a "self driving car" that requires a human driver with constant attention willing to take over within a few hundred milliseconds.
People advocate for C and use it in a way that implies they think it can achieve specific machine outcomes, and it usually does .. except when it doesn't. If people want a portable assembler they should build one.
jmillikin
As a general rule if you're reading a technical discussion and every single participant is using a particular phrase in a way that doesn't make sense to you then you should probably do a quick double-check to make sure you're on the same page.
For example, in this discussion about whether C is "portable assembly", you might be tempted to think back to the days of structured programming in assembly using macros. I no longer remember the exact syntax, but programs could be written to look like this:
.include "some-macro-system.s"
.include "posix-sym.s"
.func _start(argc, argv) {
.asciz message "Hello, world!"
.call3 _write STDOUT message (.len message)
.call1 _exit 0
}
Assembly? Definitely! Portable? Eh, sort of! If you're willing to restrict yourself to DOS + POSIX and write an I/O abstraction layer then it'll probably run on i386/SPARC/Alpha/PA-RISC.But that's not really what people are discussing, is it?
When someone says "C is portable assembly" they don't mean you can take C code and run it through a platform-specific macro expander. They don't mean it's literally a portable dialect of assembly. They expect the C compiler to perform some transformations -- maybe propagate some constants, maybe inline a small function here and there. Maybe you'd like to have named mutable local variables, which requires a register allocator. Reasonable people can disagree about exactly what transformations are legal, but at that point it's a matter of negotiation.
Anyway, now you've got a language that is more portable than assembler macros but still compiles more-or-less directly to machine code -- not completely divorced from the underlying hardware like Lisp (RIP Symbolics). How would you describe it in a few words? "Like assembly but portable" doesn't seem unreasonable.
pjmlp
Systems languages that predated C already could do that, that is the typical myth.
fouronnes3
Stretching "no observable effect" all the way to a 10000 word blog post.
vkazanov
So, the compiler is tinkering with the way the loop is organised so the whole tail-call interpreter thing is not as effective as announced... Not surprised.
1. CPU arch (and arch version) matters a lot. The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally. C was never meant to support this.
2. The C abstract machine is also not low-level enough to express the intent properly. Any implementation becomes supersensitivy to a particular compiler's (and compiler version) quirks.
Certain paranoid interpreter implementation go back to writing assembly directly. luajit's famous for implementing a macro system to make its superefficient assembly loop implementation portable across architectures. This is also I find it fun to tinker with these!
Anyways, a few years ago I've put together an article and a a test for popular interpreter loop implementation approaches:
nelhage
(author here)
> The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally.
A fun fact I learned while writing this post is that that's no longer true! Modern branch predictors can pretty much accurately predict through a single indirect jump, if the run is long enough and the interpreted code itself has stable behavior!
Here's a paper that studied this (for both real hardware and a certain simulated branch predictor): https://inria.hal.science/hal-01100647/document
My experiments on this project anecdotally agree; they didn't make it into the post but I also explored a few of the interpreters through hardware CPU counters and `perf stat`, and branch misprediction never showed up as a dominant factor.
vkazanov
Yes, this was already becoming true around the time I was writing the linked article. And I also read the paper. :-) I also remember I had access to a pre-Haswell era Intel CPUs vs something a bit more recent, and could see that the more complicated dispatcher no longer made as much sense.
Conclusion: the rise of popular interpreter-based languages lead to CPUs with smarter branch predictors.
What's interesting is that a token threaded interpreter dominated my benchmark (https://github.com/vkazanov/bytecode-interpreters-post/blob/...).
This trick is meant to simplify dispatching logic and also spread branches in the code a bit.
celeritascelery
How do you reconcile that with the observation that moving to a computed goto style provides better codegen in zig[1]? They make the claim that using their “labeled switch” (which is essentially computed goto) allows you to have multiple branches which improves branch predictor performance. They even get a 13% speedup in their parser from switch to this style. If modern CPU’s are good at predicting through a single branch, I wouldn’t expect this feature to make any difference.
[1] https://ziglang.org/download/0.14.0/release-notes.html#Code-...
dwattttt
While it's unlikely as neat as this, the blog post we're all commenting on is a "I thought we had a 10-15% speedup, but it turned out to be an LLVM optimisation misbehaving". And Zig (for now) uses LLVM for optimised builds too
tempay
Trying to assess the performance of a python build is extremely difficult as there are a lot of build tricks you can do to improve it. Recently the astral folks ran into this showing how the conda-forge build is notable faster than most others:
https://github.com/astral-sh/python-build-standalone/pull/54...
I'd be interested to know how the tail-call interpreter performs with other build optimisations that exist.
eru
Compare https://donsbot.com/2009/03/09/evolving-faster-haskell-progr...
The author uses a genetic algorithm to try out lots of different compiler and optimisation flag combinations.
asicsp
Related discussions:
https://docs.python.org/3.14/whatsnew/3.14.html#whatsnew314-... --> https://news.ycombinator.com/item?id=42999672 (66 points | 25 days ago | 22 comments)
https://blog.reverberate.org/2025/02/10/tail-call-updates.ht... --> https://news.ycombinator.com/item?id=43076088 (124 points | 18 days ago | 92 comments)
thrdbndndn
Great article! One detail caught my attention.
In one of the referenced articles, https://simonwillison.net/2025/Feb/13/python-3140a5/, the author wrote: "So 3.14.0a5 scored 1.12 times faster than 3.13 on the benchmark (on my extremely overloaded M2 MacBook Pro)."
I'm quite confused by this. Did the author run the benchmark while the computer was overloaded with other processes? Wouldn't that make the results completely unreliable? I would have thought these benchmarks are conducted in highly controlled environments to eliminate external variables.
ambivalence
Simon Willison is a great guy, but he's not a Python core developer and his ad hoc benchmark is not what the CPython core team members are using. For the latter, see https://github.com/faster-cpython/benchmarking-public
cb321
While some people here are citing 10% as "large" and 1% as "normal", there are optimizations like partial inlining of doubly recursive Fibonacci that can reduce the actual work (and so also time) exponentially (factors >10X for double-digit arguments or 1000s of %, technically exponential in the difference of the recursion depth, not the problem size [1]).
C compilers can also be very finicky about their code inlining metrics. So, whether that enormous speed-up is realized can be very sensitive to the shape of your code.
So, while part of the problem is definitely that CPUs have gotten quite fancy/complex, another aspect is that compilers "beyond -O0 or -O1" have also gotten fancy/complex.
The article here, while good & worth reading, is also but one of many examples of how two complex things interacting can lead to very surprising results (and this is true outside of computing). People have a rather strong tendency to oversimplify -- no matter how many times this lesson shows up.
EDIT: Also, the article at least uses two CPUs, Intel & Apple M1 and two compilers (gcc & clang), but there are realistic deployment scenarios across many more generations/impls of Intel, AMD, and ARM and probably other compilers, too. So, it only even samples a small part of the total complexity. Also, if you want to be more scientific, esp. for differences like "1.01X" then time measurements should really have error bars of some kind (either std.dev of the mean or maybe better for a case like this std.dev of the min[2]) and to minimize those measurement errors you probably want CPU core affinity scheduling in the OS.
[1] https://stackoverflow.com/questions/360748/computational-com...
motbus3
I recently made some benchmarking from python 3.9 to 3.13 And up to 3.11 it only got better. Python 3.12 and 3.13 were about 10% slower than 3.11.
I thought my homemade benchmark wasn't great enough so I deployed it to a core service anyway and I saw same changes in our collected metrics. Does anyone else have the same problem?
sgarland
Yes, I found a loop performance regression [0] in 3.12 and 3.13.
lazka
I'm also still using 3.11 for a FastAPI app since 3.12 and 3.13 are quite a bit slower.
bjourne
This is exactly the kind of content I love to see on HN. But I wonder though how this optimization is related to tail-call optimization? How the interpreter jump table is implemented shouldn't affect how stack frames are created, should it?
mbel
It's explained under the first link from the article [0]:
"A new type of interpreter has been added to CPython. It uses tail calls between small C functions that implement individual Python opcodes, rather than one large C case statement."
[0] https://docs.python.org/3.14/whatsnew/3.14.html#whatsnew314-...
thunkingdeep
Well if you’d bother to read it, you’d discover this is about tail calls in C, not in Python. It has nothing to do with tail recursion in Python. Guido has explicitly said that Python will never have it.
ambivalence
A big chunk of the performance gain comes from the fact that with tail calls the CPU doesn't have to reset the registers (configured through the `preserve_none` calling convention).
albertzeyer
To clarify: The situation is still not completely understood? It's not just only the computed gotos, but there is some other regression in Clang19? Basically, the difference between clang19.nocg and clang19 is not really clear?
Btw, what about some clang18.tc comparison, i.e. Clang18 with the new tail-call interpreter? I wonder how that compares to clang19.tc.
nelhage
> Btw, what about some clang18.tc comparison
(post author here) Oh, this is something I could have called out explicitly: The tail-calling interpreter relies on a feature (the `preserve_none` calling convention) that only landed in clang-19. That means you can only test it on that version. That coincidence (that 19 added both this feature, and the regression) is part of why this was so easy to miss at first, and why I had to "triangulate" with so many different benchmarks to be confident I understood what was going on.
aeyes
The article includes a link to the pull request which fixes the regression in LLVM: https://github.com/llvm/llvm-project/pull/114990
Hello. I'm the author of the PR that landed the tail-calling interpreter in CPython.
First, I want to say thank you to Nelson for spending almost a month to get to the root of this issue.
Secondly, I want to say I'm extremely embarrassed and sorry that I made such a huge oversight. I, and probably the rest of the CPython team did not expect the compiler we were using for the baseline to have that bug.
I posted an apology blog post here. https://fidget-spinner.github.io/posts/apology-tail-call.htm...