Tiny JITs for a Faster FFI

142 comments

·February 12, 2025

cchianel

I had to deal with a lot of FFI to enable a Java Constraint Solver (Timefold) to call functions defined in CPython. In my experience, most of the performance problems from FFI come from using proxies to communicate between the host and foreign language.

A direct FFI call using JNI or the new foreign interface is fast, and has roughly the same speed as calling a Java method directly. Alas, the CPython and Java garbage collectors do not play nice, and require black magic in order to keep them in sync.

On the other hand, using proxies (such as in JPype or GraalPy) cause a significant performance overhead, since the parameters and return values need to be converted, and might cause additional FFI calls (in the other direction). The fun thing is if you pass a CPython object to Java, Java has a proxy to the CPython object. And if you pass that proxy back to CPython, a proxy to that proxy is created instead of unwrapping it. The result: JPype proxies are 1402% slower than calling CPython directly using FFI, and GraalPy proxies are 453% slower than calling CPython directly using FFI.

What I ultimately end up doing is translating CPython bytecode into Java bytecode, and generating Java data structures corresponding to the CPython classes used. As a result, I got a 100x speedup compared to using proxies. (Side note: if you are thinking about translating/reading CPython bytecode, don't; it is highly unstable, poorly documented, and its VM has several quirks that make it hard to map directly to other bytecodes).

For more details, you can see my blog post on the subject: https://timefold.ai/blog/java-vs-python-speed

LinXitoW

Speaking from zero experience, the FFI stories of both Python and Java to C seems much better. Wouldn't going connecting them via a little C bridge a general solution?

cchianel

JNI/the new Foreign FFI communicate with CPython via CPython's C API. The primary issue is getting the garbage collectors to work with each other. The Java solver works by repeatedly calling user defined functions when calculating the score. As a result:

- The Java side needs to store opaque Python pointers which may have no references on the CPython side.

- The CPython side need to store generated proxies for some Java objects (the result of constraint collectors, which are basically aggregations of a solution's data).

Solving runs a long time, typically at least a hour (although you can modify how long it runs for). If we don't free memory (by releasing the opaque Python Pointer return values), we will quickly run out of memory after a couple of minutes. The only way to free memory on the Java side is to close the arena holding the opaque Python pointer. However, when that arena is closed, its memory is zeroed out to prevent use-after-free. As a result, if CPython haven't garbage collected that pointer yet, it will cause a segmentation fault on the next CPython garbage collection cycle.

JPype (a CPython -> Java bridge) does dark magic to link the JVM's and CPython's garbage collector, but has performance issues when calling a CPython function inside a Java function, since its proxies have to do a lot of work. Even GraalPy, where Python is ran inside a JVM, has performance issues when Python calls Java code which calls Python code.

ignoramous

Also see: cgo is not Go

  Go code and C code have to agree on how resources like address space, signal handlers, and thread TLS slots are to be shared — and when I say agree, I actually mean Go has to work around the C code's assumption. C code that can assume it always runs on one thread, or blithely be unprepared to work in a multi threaded environment at all.

  C doesn't know anything about Go's calling convention or growable stacks, so a call down to C code must record all the details of the goroutine stack, switch to the C stack, and run C code which has no knowledge of how it was invoked, or the larger Go runtime in charge of the program.

  It doesn't matter which language you’re writing bindings or wrapping C code with; Python, Java with JNI, some language using libFFI, or Go via cgo; it is C's world, you're just living in it.

https://dave.cheney.net/2016/01/18/cgo-is-not-go / https://archive.vn/GZoMK

high_na_euv

How IPC methods would fit such cases?

Like, talk over some queue, file, http, etc

cchianel

IPC methods were actually used when constructing the foreign API prototype, since if you do not use JPype, the JVM must be launched in its own process. The IPC methods were used on the API level, with the JVM starting its own CPython interpreter, with CPython and Java using `cloudpickle` to send each other functions/objects.

Using IPC for all internal calls would probably take significant overhead; the user functions are typically small (think `lambda shift: shift.date in employee.unavailable_dates` or `lambda lesson: lesson.teacher`). Depending on how many constraints you have and how complicated your domain model is, there could be potentially hundreds of context switches for a single score calculation. It might be worth prototyping though.

null

[deleted]

chris12321

Between Rails At Scale and byroot's blogs, it's currently a fantastic time to be interested in in-depth discussions around Ruby internals and performance! And with all the recent improvements in Ruby and Rails, it's a great time to be a Rubyist in general!

jupp0r

Is it? To me it seems like Ruby is declining [1]. It's still popular for a specific niche of applications, but to me it seems like it's well past its days of glory. Recent improvements are nice, but is a JIT really that exciting technologically in 2025?

[1]: https://www.tiobe.com/tiobe-index/ruby/

chris12321

Ruby will probably never again be the most popular language in the world, and it doesn't need to be for the people who enjoy it to be excited about the recent improvements in performance, documentation, tooling, ecosystem, and community.

faizshah

I think ruby can get popular again with the sort of contrarian things Rails is doing like helping developers exit Cloud.

There isn’t really a much more productive web dev setup than Rails + your favorite LLM tool. Will take time to earn Gen Z back to Rails though and away from Python/TS or Go/Rust.

pier25

It never was "the most popular language in the world".

Rails maybe was popular in the US at some point but it was always niche in the rest of the world.

saagarjha

Was it ever

adamtaylor_13

Rails is experiencing something of a renaissance in recent years. It’s easily one of the most pleasant programming experiences I’ve had in years.

All my new projects will be Rails. (What about projects that don’t lend themselves to Rails? I don’t take on such projects ;)

cship2

Hmm I thought Crystal was suppose to be faster Ruby? No?

obiefernandez

Stable mature technology trumps glory.

jupp0r

That’s why the JVM has been using JITs since 1993 while it’s a renaissance inspiring marvel for Ruby in 2025.

pjmlp

Unfortunately it is, because too many folks still reach out to pure interpreted languages for full blown applications, instead of plain OS and application scripting tasks.

haberman

> Rather than calling out to a 3rd party library, could we just JIT the code required to call the external function?

I am pretty sure this is the basis of the LuaJIT FFI: https://luajit.org/ext_ffi.html

I think LuaJIT's FFI is very fast for this reason.

internetter

"write as much Ruby as possible, especially because YJIT can optimize Ruby code but not C code"

I feel like I'm not getting something. Isn't ruby a pretty slow language? If I was dipping into native I'd want to do as much in native as possible.

hinkley

There was a little drama that played out as Java was getting a proper JIT.

In one major release, there was a bunch of Java code responsible for handling some UI element activities. It was found to be a bottleneck, and rewritten in C code for the next major release.

Then the JIT became properly useful, and the FFI overhead was more than the difference between the hand-tuned C code and what the JIT would spit out on its own. So in the next major release, they rolled back to the all-Java implementation.

Java had a fairly reasonably fast FFI for that generation of programming language, but they swapped for a better one a few releases after that. And by then I wasn't doing a lot of Java UI code so I had stopped paying attention. But around the same time they were also making a cleaner interface between the platform-specific and the general Java code for UI, so I'm not entirely sure how that played out.

But that's exactly the sort of see-sawing you need to at least keep an eye out for when doing this sort of work. Would you be better off waiting a couple milestones and saving yourself a bunch of hand-tuning work, or do you need it right now for political or technical reasons?

pjmlp

That is where a JIT enters the picture, ideally a JIT can re-optimize to an ideal state.

While this is suboptimal when doing one shot execution, when an application is long lived, mostly desktop or server workloads, this work pays off versus the overall application.

For example, Dalvik had a pretty lame JIT, thus it was faster calling into C for math functions, eventually with ART this was no longer needed, JIT could outperform the cost of calling into C.

https://developer.android.com/reference/android/util/FloatMa...

genewitch

Depending on the math (this is a hedge) you need, FORTRAN is probably faster still. Every time I fathom a test and compare python, fortran, and C, Fortran always wins by a margin. Fortran:C:Python 1:1.2:1.9 or so. I don't count startup I only time time to return from function call.

Most recently I did hand-looped matrix math and this ratio bore out.

I used gfortran, gcc, and python3.

pjmlp

Sure, but that doesn't fit the desktop or server workloads I mentioned, I guess we need to except stuff like HPC out of those server workloads.

I would also add that modern Fortran looks quite sweet, the punched card FORTRAN is long gone, and folks should spend more time learning it instead of reaching out to Python.

kevingadd

When dealing with a managed language that has a JIT or AOT compiler it's often ideal to write lots of stuff in the managed language, because that enables inlining and other optimizations that aren't possible when calling into C.

This is sometimes referred to as "self-hosting", and browsers do it a lot by moving things into privileged JavaScript that might normally have been written in C/C++. Surprisingly large amounts of the standard library end up being not written in native code.

kenhwang

Ruby has realized this as well. When running in YJIT mode, some standard library methods switch to using a pure ruby implementation instead of the C implementation because the YJIT-optimized-ruby is better performing.

internetter

Oh, I am indeed surprised! I guess I always assumed that most of the JavaScript standard library was written in C++

achierius

Well, most all of the compiler, runtime, allocator, garbage collector, object model, etc, are indeed written in C++ And so are many special operations (eg crypto functions, array sorts, walking the stack)

But specifically with regards to library functions, like the other commentator said, losing out on in lining sucks, and crossing between JS and Native code can be pretty expensive, so even with things like sorting an array it can be better to do it in js to avoid the overhead... Eg esp in cases where you can provide a callback as your comparator, which is js, and thus you have to cross back into js for every element

So it's a balancing game, and the various engines have gone back and forth on which functions are implemented in which language over time

jitl

Much of the standard library in v8 is written in Torque, a custom language.

https://v8.dev/docs/torque

Example file for array.find(…): https://github.com/v8/v8/blob/5fe0aa3bc79c0a9d3ad546b79211f0...

neonsunset

FFI presents an opaque, unoptimizable boundary of code. Having chatty code like this is going to cost a lot. To the point where this is even a factor in much faster languages with zero-cost-ish interop like C# - you still have to make a call, sometimes paying the cost of modifying state flags for VM (GC transition).

If Ruby YJIT is starting to become a measurable factor (after all, it was slower than other, purely interpreted, languages until recently), then the same rule as above will become more relevant.

hahahacorn

There's a phenomenal breakdown by JPCamara (https://jpcamara.com/2024/12/01/speeding-up-ruby.html) on why the Ruby#each method was rewritten in Ruby (https://bugs.ruby-lang.org/issues/20182). And bonus content from tender love: https://railsatscale.com/2023-08-29-ruby-outperforms-c/.

TL;dr - JIT rules.

kazinator

If FFI calls are slow (even slower than Ruby -> Ruby calls) then informs the way you use native code. You look for workflows whereby frequent calls to a FFI function are avoided: e.g. large number of calls in some inner loop. Suppose such a situation cannot be avoided. Then you may have no recourse than to move that loop out of Ruby into C: create a custom FFI for that use case which you can call once and have it execute the loop, calling many times the function you really wanted to call.

If the FFI call can be made faster, maybe you can keep the loop in Ruby.

Of course that is attractive to people writing an application in Ruby.

That's how I interpret keeping as much code Ruby as possible.

Nobody in their right mind wants to add additional application-specific jigs written in C just to use some C piece.

Once you start doing that, why even have FFI; you can just create a module.

One attractive point about FFI is that you can take some C library and use it in a higher level language without writing a line of C.

doppp

It's been fast for a while now.

schneems

To add some nuance to the word "fast."

When we optimize Ruby for performance we debate how to eliminate X thousand heap allocations. When people in Rust optimize for performance, they're talking about how to hint to the compiler that the loop would benefit from SIMD.

Two different communities, two wildly different bars for "fast." Ruby is plenty performant. I had a python developer tell me they were excited for the JIT work in Ruby as they hoped that Python could adopt something similar. For us the one to beat (or come closer to) would be Node.js. We are still slower then them (lots of browser companies spent a LOT of time optimizing javascript JIT), but I feel for the relative size of the communities Ruby is able to punch above its weight. I also feel that we should be celebrating tides that raise all ships. Not everything can (or should be) written in C.

I personally celebrate any language getting faster, especially when the people doing it share as widely and are as good of a communicator as Aaron.

Thaxll

Even a 50% or 2x speed improvment still make it a pretty slow language. It's in the Python range.

CyberDildonics

What is fast here? Ruby has usually been about 1/150th the speed of C.

kenhwang

If the code JITs well, Ruby performs somewhere between Go and NodeJS. Without the JIT, it's similar to Lua.

nicoburns

Has it? I thought Ruby was pretty much the benchmark for the slowest language. What is it faster than?

pansa2

Python. At least, it was a few years ago. Both languages have added JIT compilers since then, so I’m not sure how the most recent versions compare.

epcoa

Tcl, Vbscript, bash/sh.

Tcl had it’s web moment during the first dot com era within AOLserver.

m00x

Python

eay_dev

I've been using Ruby more than 10 years, and seeing its development in these days is very exciting. I hope

pestatije

FFI - Foreign Function Interface, or how to call C from Ruby

tonetegeatinst

The totally safe and sane approach is to write C code that gets passed data via the command line during execution, then vomits results to the command line or just into a memory page.

Then just execute the c program with your flags or data in the terminal using ruby and viola, ruby can run C code.

fomine3

It's slow

grandempire

This. I think many people do not understand Unix processes and don’t realizing how rare it is to need bindings, ffi, and many libraries.

How many programs have an https client in them because they need to make one request and didn’t know they could use curl?

nirvdrum

Can you please elaborate on this because I'm struggling to follow your suggestion. Shelling out to psql every time I want to run an SQL query is going to be prohibitively slow. It seems to me you'd need bindings in almost the exact same cases you'd use a shared library if you were writing in C and that's really all bindings are anyway -- a bridge between the VM and a native library.

aidenn0

Why does this need to be JIT compiled? If it could be written in C, then it certainly could just be compiled at load time, no?

nirvdrum

If what could be written in C? The FFI library allows for dynamic binding of library methods for execution from Ruby without the need to write a native extension. That's a huge productivity boost and makes for code that can be shared across CRuby, JRuby, and TruffleRuby.

I suppose if you could statically determine all of the bindings at boot up you could write a stub and insert into the method table. But, that still would happen at runtime, making it JIT. And it wouldn't be able to adapt to the types flowing through the system, so it'd have to be conservative in what it accepts or what it optimizes, which is what libffi already does today. The AOT approach is to write a native extension.

aidenn0

By "it" I meant this part from TFA:

> you should write a native extension with a very very limited API where most work is done in Ruby. Any native code would be a very thin wrapper around the function we actually want to call that just converts Ruby types in to the types required by the native function.

I think our main disagreement is your assertion that any compilation at runtime qualifiees as JIT. I consider JIT to be dynamic compilation (and possibly recompilation) of a running program, not merely anything that generates machine code at runtime.

brigandish

It's an aside, but

> Now, usually I steer clear of FFI, and to be honest the reason is simply that it doesn’t provide the same performance as a native extension.

I usually avoid it, or in particular, gems that use it, because compilation can be such a pain. I've found it easier to build it myself and cut out the middleman of Rubygems/bundler.

evacchi

somewhat related, this library uses the JVMCI (JVM Compiler Interface) to generate arm64/amd64 code on the fly to call native libraries without JNI https://github.com/apangin/nalim

nialv7

isn't this exactly what libffi does?

kazinator

libffi is slow; it doesn't JIT as far as I know.

In libffi you built up descriptor objects for functions. These are run-time data structures which indicate the arguments and return value types.

When making a FFI call, you must pass in an array of pointers to the values you want to pass, and the descriptor.

Inside libffi there is likely a loop which walks the loop of values, while at the same time traversing the descriptor, and places those values onto the stack in the right way according to the type indicated in the descriptor. When the function is done, it then pulls out the return according to its type. It's probably switching on type for all these pieces.

Even if the libffi call mechanism were JITted, the preparation of the argument array for it would still be slow. It's less direct than a FFI jit that directly accesses the arguments without going through an intermediate array.

FFI JIT code will directly take the argument values, convert them from the Ruby (or whatever) type to the C type, and stick it into the right place on the stack or register, and do that with inline code for each value. Then call the function, and convert the return value to the Ruby type. Basically as if you wrote extension code by hand:

  // Pseudo-code

  RubyValue *__Generated_Jit_Code(RubyValue *arg1, RubyValue *arg2)
  {
     return CStringToRuby(__targetCFunction(RubyToCString(arg1), RubyToCInt(arg2));

  }

If there is type inference, the conversion code can skip type checks. If we have assurance that arg1 is a Ruby string, we can use an unsafe, faster version of the RubyToCString function.

The JIT code doesn't have to reflect over anything other than at worst the Ruby types. It doesn't have to have any array or list related to the arguments. It knows which C types are being converted to and form, and that is hard-coded: there is no data structure describing the C side that has to be walked at run0-time.

nialv7

I am surprised many don't know how libffi works. Yes, it does generate native machine code to handle your call. Look it up.

Yes it's probably worse than doing the jit in Ruby interpreter, since there you can also inline the type conversion calls, but there principles are the same.

Edit: This is wrong, see comments below.

dzaima

It certainly uses native machine code, but I don't think it generates any at runtime outside of the reverse-FFI closures (at least on linux)? PROT_EXEC at least isn't used outside of them, which'd be a minimum requirement for linux JITting.

Running in a debugger an ffi_call to a "int add(int a, int b)" leads me to https://github.com/libffi/libffi/blob/1716f81e9a115d34042950... as the assembly directly before the function is invoked, and, besides clearly not being JITted from me being able to link to it, it is clearly inefficient and unnecessarily general for the given call, loading 7 arguments instead of just the two necessary ones.

tenderlove

libffi can't know how to unwrap Ruby types (since it doesn't know what Ruby is). The advantage presented in this post is that the code for type unboxing is basically "cached" in the generated machine code based on the information the user passes when calling `attach_function`.

dzaima

libffi doesn't JIT for FFI calls; and it still requires you to lay out argument values yourself, i.e. for a string argument you'd still need to write code that converts a Ruby string object to a C string pointer. And libffi is rather slow.

(the tramp.c linked in a sibling comment is for "reverse-FFI", i.e. exposing some dynamic custom operation as a function pointer; and its JITting there amounts to a total of 3 instructions to call into precompiled code)

almostgotcaught

you know i thought i knew what libffi was doing (i thought it was playing tricks with GOT or something like that) but i think you're right

https://github.com/libffi/libffi/blob/master/src/tramp.c

poisonta

I can sense why it didn’t go to tenderlovemaking.com

tenderlove

I think tenderworks wrote this post.

IshKebab

> Even in those cases, I encourage people to write as much Ruby as possible, especially because YJIT can optimize Ruby code but not C code.

But the C code is still going to be waaay faster than the Ruby code even with YJIT. That seems like an odd reason to avoid C. (I think there are other good reasons though.)

Alifatisk

> the C code is still going to be waaay faster than the Ruby code even with YJIT.

I can't find it but I remember seeing a talk where they showed examples of Ruby + YJIT hitting the same speed and in some cases a bit more than C. The downside was though that it required to some warmup time.

IshKebab

I find that hard to believe. I've heard claims JIT can beat C for years, but they usually involve highly artificial microbenchmarks (like Fibonacci) and even for a high performance JITed language like Java it ends up not beating C. There's no way YJIT will.

The YJIT website itself only claims it is around twice as fast as Ruby, which means it is still slower than a flat earth snail.

The benchmarks game has YJIT and it's somewhere between PHP and Python. Very slow.

Alifatisk

I’ll see if I can find the presentation again on Youtube

igouy

Maybe "Ruby Outperforms C" was bait for "Outperforming a C extension" which is about the relative cost of calling C from Ruby?

https://railsatscale.com/2023-08-29-ruby-outperforms-c/

HN

Tiny JITs for a Faster FFI

Tiny JITs for a Faster FFI