An Attempt to Catch Up with JIT Compilers
51 comments
·March 3, 2025pizlonator
mintplant
SpiderMonkey actually ditched most of the profiling stuff in favor of transpiling the ICs generated at runtime into the IR used by the optimizing compiler, inlining them into the functions when they're used, and then sending the whole thing through the usual optimization pipeline. The technique is surprisingly effective.
I don't know what the best reference to link for this would be, but look up "Warp" or "WarpMonkey" if you're interested.
hinkley
My understanding is that branch prediction got better in the ‘10s and a bunch of techniques that didn’t work before do now.
pizlonator
The modern VM technique looks almost exactly like what the original PIC papers talked about in the 90s. There are some details that are different, but I'm not sure that the details come down to exploiting changes in branch prediction efficiency. I think the things that changed come mostly down to the fact that the original PIC paper was a first stab by a small team whereas modern VMs involve decades of engineering by larger teams (so everything that could get more complex as a consequence of tuning did get more complex).
So, while it's true that microarches changed in a lot of ways, the overall implications for how you build VMs are not so big.
pizlonator
It sounds like you're describing something similar to what the other JS VMs do
titzer
Indeed, this was literally the conclusion of the first paper that introduced polymorphic inline caches.
I'll add that the real benefit of ICs isn't just that compiled code is specialized to the seen types, but the fact that deoptimization guards are inserted, which split diamonds in the original general cases so that multiple downstream checks become redundant. So specialization is not just a local simplification but a global simplification to all dominated code in the context of the compilation unit.
sitkack
https://bibliography.selflanguage.org/pics.html for those following along
kannanvijayan
Well on dynamic languages the ICs do give a nice order of magnitude speed-up by themselves, since the guard eliminates a whole hashtable (or linear) lookup instead of (in this case) a single memory indirection.
But yeah - on spidermonkey we found that orienting our ICs towards being stable and easy to work with, as opposed to just being fast, ended up leading to a much better design.
This is a nice result though. Negative, but good that they published it.
What would be a good next step is some QEMU-style transformation, pull out basic blocks, profile them for both hotness, and incoming arguments at function starts, and dynamic dispatch targets.. then use that to re-compile the whole thing using method-jit and in particular inlining across call-paths with GVN and DCE applied.
I kind of expect the results to be very positive, just based on intuition.. but it'd be cool to see how it actually turned out.
bjoli
A minor nitpick: ICs don't give that much benefit in monomorphic languages like scheme.
hinkley
One of the last pieces of really good advice I got before I gave up on writing a programming language myself is that if you instrument the paths that are already expected to be slow, you can get most of the value of instrumentation with a fraction of the cost per call. Because people avoid making the slow calls, and if they don’t the app was going to be slower anyway so why not an extra couple percent? Versus the fast path where the instrumentation may be a quarter or more of runtime.
sitkack
The answer is always more feedback. I am excited about DNN powered static profilers. The training data will come from the JIT saving the results of their experiments.
pizlonator
That's an exciting direction!
sitkack
Profile Guided Optimization without Profiles: A Machine Learning Approach
https://www.semanticscholar.org/paper/Profile-Guided-Optimiz...
ajross
This seems poorly grounded. In fact almost three decades after the release of the Java HotSpot runtime we're still waiting for even one system to produce the promised advantages. I guess consensus is that V8 has come closest?
But the reality is that hand-optimized AoT builds remain the gold standard for performance work.
noelwelsh
The benchmarks I have seen show Hotspot is ahead of V8. E.g. https://stefan-marr.de/papers/oopsla-larose-et-al-ast-vs-byt...
What makes this very complicated is that 1) language design plays a big part in performance and 2) CPUs change as well and this anecdotally seems to have more impact on interpreter than compiler performance.
With regards to 1), consider optimizing Javascript. It doesn't have machine integers, so you have to do a bunch of analysis to figure when something is being used as an integer and then you can make that code fast. There are many other cases. Python is even worse in this regard. In comparison AOT compiled languages are usually designed to be fast, so they make tradeoffs that favour performance at the cost of some level of abstraction / expressivity. The JVM is somewhere in the middle, and so is its performance.
With regards to 2) this paper is an example, as is https://inria.hal.science/hal-01100647/file/InterpIBr-hal.pd...
ajross
With all respect that sounds like excuse-making. I mean, yeah, Javascript and JVM and .NET are slower runtimes than C or Rust[1]. Nonetheless that's the world we live in, and if you have a performance-sensitive problem to solve you pick up rustc or g++ and not a managed runtime. If that's wrong, someone's got to actually show that it's wrong.
[1] Maybe Go or Swift would be more apples-to-apples. But even then are there clear benchmarks showing Kotlin or C# beating similar AoT code? If anything the general sense of the community is that Go is faster than Java.
noelwelsh
Excuses for what? I'm not the elected representative for JIT compiled languages, sworn to defend them. There are technical reasons they tend to be slower. I was sketching some of them.
pca006132
When things are performance-sensitive, you want things to be tunable and predictable. Good luck playing with the JIT if you rely that for performance...
titzer
> But the reality is that hand-optimized AoT builds remain the gold standard for performance work.
It's considerably more complicated than that. After working in this area for 25 years, I have vacillated between extremes over decades-long arcs. The reality is much more nuanced than a four sentence HN comment. Profile and measure and stare at machine code. If you don't do that daily, it's hand waving and having hunches.
IshKebab
I agree, the "JITs can be faster because X Y Z" arguments have never turned into "JITs are actually faster".
Maybe that's because JIT is almost always used in languages that were slowed in the first place, e.g. due to GC.
Is there a JITing C compiler, or something like that? Would that even make sense?
sitkack
Binary Translation could be seen as a generalized JIT for native code.
Dynamo: A Transparent Dynamic Optimization System https://dl.acm.org/doi/pdf/10.1145/358438.349303
> We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of -O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their -O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.
https://www.semanticscholar.org/paper/Dynamo%3A-a-transparen...
remexre
Maybe the "allocate as little as possible, use sun.misc.Unsafe a lot, have lots of long-lived global arrays" style of Java programming some high-performance Java programs use would get close to being a good stand-in.
pjmlp
C++/CLI is one example, it is C++, not C, but example holds.
do_not_redeem
Now the money question: can anyone come up with a benchmark where, due to the JIT, C++/CLI runs faster than normal C++ compiled for the same CPU?
zabzonk
It is not C++ (or C) but a Microsoft invented language - which is OK, but don't confuse it with C++ anymore than MS have already done
pjmlp
JVM implementations, especially those with PGO feedback loop across runs do quite well.
Likewise modern Android, runs reasonably well with its mix of JIT, AOT with JIT PGO metadata, baseline profiles shared across devices via Play Store.
The gold standard for anyone that actually cares about ultimate performance is hand written Assembly, naturally guided with a profilers capable to measure everything that the CPU is doing like VTune.
ForTheKidz
> I guess consensus is that V8 has come closest?
V8 better than the JVM? Insanity, maybe it can come to within an order of magnitude in terms of performance.
edflsafoiewq
Comes closest to realizing the concept of a JIT that is better than AOT.
ForTheKidz
I think that's completely silly framing; you can AOT compile any code better—or at least, just as well—if you already know how you want it to perform at runtime. Any efficiency gain would necessarily need to be in the context of total productivity.
pizlonator
> Java HotSpot runtime we're still waiting for even one system to produce the promised advantages.
What promised advantages are you waiting on?
There are lots of systems that have architectures that are similar to HotSpot, or that surpass it in some way. V8 is just one.
CamouflagedKiwi
There were many many statements made that JIT compilers could be faster than AOT compilers because they had more information to use at runtime - originally this was mostly aimed at Java/HotSpot which has not, in practice, significantly displaced languages like C or C++ (or these days Rust) from high-performance work.
pizlonator
Yeah those statements were overly optimistic and I don’t think they’re representative of what most people in the JIT field think. It’s also not what I as a JIT engineer would have promised you.
The actual promise is just: JITs make dynamic languages faster and they are better at doing that than AOTs. I think lots of systems have delivered on that promise.
pjmlp
I guess distributed systems and OS GUI frameworks aren't it then.
twoodfin
Hand-optimized AoT builds with solid profile-based feedback, right?
neonsunset
If you pit virtual-call-heavy code written in C++ against C#, C# will come out on top every single time, especially if you consume dynamically-linked dependencies or if you can't afford to wait until the heat death of the universe when all the LTO plugins finish their job.
Or if you use SIMD-heavy path and your binary is built against, say, X86-64-v2/3 and the target supports AVX512, .NET will happily use the entirety of AVX512 thanks to JIT even when still using 256b-wide operations (i.e. bespoke path that uses Vector256) with AVX512VL. This tends to surpass what you can get out of runtime dispatch under LLVM.
re: Java challenges - those stem from the JVM bytecode being a very difficult optimization target i.e. every call is virtual by default with complex dispatch strategy, everything is a heap-allocated object by default save for very few primitives, generics lose type information and are never monomorphized - PGO optimization through tiered compilation and resulting guarded devirtualization and object escape analysis is something that reclaims performance in Java and makes it acceptable. C and C++ with templates are a massively easier optimization target for GCC, and GCC does not operate under strict time constraints too. Therefore we have the results that we do.
Also interesting data points here if you'd like to look at AOT capabilities of higher-level languages:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
mannyv
[flagged]
mannyv
I wonder if you could use clang/llvm to do a super-JIT by having it recompile its IR as the program runs, taking advantage of profiling to optimize the hot paths.
null
tsunego
chasing inline cache micro-optimizations with dynamic binary modification is a dead end. modern CPUs are laughing at our outdated compiler tricks. maybe it's time to accept that clever hacks won’t outrun silicon.
I think the missing piece here is that JavaScriptCore (JSC) and other such systems don't just use inline caching to speed up dynamic accesses; they use them as profiling feedback.
So, anytime you have an IC in interpreter, baseline, or lightly optimized code, then that IC is monitored to see how polymorphic it gets, and that data is fed back into the optimization pipeline.
Just having an IC as a dead-end, where you don't use it for profiling is way less profitable than having an IC that feeds into profiling.