The missing tier for query compilers

t0b1

The author striked out the part about CedarDB not being available -- which is true -- but Umbra is available as a docker container[1] for some time now. The "Umbra paper" linked also contains an artifact[2] with a copy of Umbra as well as some instructions on how to control the back-ends used etc. (Cranelift is not available as an option in the docker versions however)

I kind of disagree with the assumption that baseline compilers are easy to build (depending on your definition of baseline). A back-end like DirectEmit is not easy to write and even harder to verify. If you have a back-end written for your own IR you will likely have to write tests in that IR and it will probably be quite hard to simply port over codegen (or run-) tests from other compilers. Especially in the context of databases it is not very reassuring if you have a back-end that may explode the second you start to generate code slightly differently. We're working on making this a bit more commoditized but in our opinion you will always need to do some work since having another IR (with defined semantics someone could write a code generator for you) for a back-end is very expensive. In Umbra, translating Umbra-IR to LLVM-IR takes more time than compiling to machine code with DirectEmit.

Also, if it is easy to write them, I would expect to see more people write them.

Copy-and-patch was also tried in the context of MLIR[3] but the exec-time results were not that convincing and I have been told that it is unlikely for register allocation to work sufficiently well to make a difference.

[1]: https://hub.docker.com/r/umbradb/umbra

[2]: https://zenodo.org/records/10357363

[3]: https://home.cit.tum.de/~engelke/pubs/2403-cc.pdf

UncleEntity

> We're working on making this a bit more commoditized but in our opinion you will always need to do some work since having another IR (with defined semantics someone could write a code generator for you) for a back-end is very expensive.

I've been nipping at the edges of adapting the vmIDL from vmgen and using that to generate the machinery to do some jitting. But I'm slow and lazy...

The general idea is to have the hypothetical user define the operands of their IR along with the code snippets and use this to stitch together a jit compiler library. Or perhaps a wrapper around an existing jit library, dunno. Either way it gives me some yaks to shave which makes me happy.

jamii

The tradeoff in SingleStore is interesting. By default and unlike eg postgres, parameterized queries are planned ignoring the values of the parameters. This allows caching the compiled query but prevents adapting the query plan - for the example in the post SingleStore would pick one plan for both queries.

But you can opt out of this behaviour by wrapping each parameter in NOPARAM (https://docs.singlestore.com/cloud/reference/sql-reference/c...).

jsmith45

If it ignores the parameter values, then how does it estimate cardinality? Does it optimize for the worst case scenario (which runs the risk of choosing join implementations that requiring far more data from other tables than makes sense if the parameters are such that only a few rows are returned from the base table), or does it assume a row volume for a more average parameter, which risks long running time or timeouts if the parameter happens to be an outlier?

Consider that Microsoft SQL Server for example, does cache query plans ignoring parameters for the purposes of caching (and will auto-parameterize queries by default), but SQL Server uses the parameter value to estimate cardinalities (and thus join types, etc) when first creating the plan, under the assumption that the first parameters it sees for this query are reasonably representative of future parameters.

That approach can work, but if the first query it sees has outlier parameter values, it may cache a plan that is terrible for more common values, and users may have preferred the plan for the more typical values. Of course, it can be the reverse, where users want the plan that handles the outliers well, because that specific plan is still good enough with the more common values.

foobazgt

The author pans meta-tracing and shows a graph of TruffleRuby having really crazy runtime behavior. However, it looks extremely suspicious - like the kind of thing you'd see if there was a memory leak and the garbage collector was running constantly.

Most people seem really excited about the results coming out of Graal and TruffleRuby seems to have some really impressive results in general, so the graph is surprising. It's also missing a bunch of details, so it's hard to speak to its voracity - for example, what are the different versions of the runtimes, what flags were used, on what hardware?

As a counter example, there's a different (admittedly old) result where TruffleRuby beats cruby on the same benchmark by 4.38x. https://eregon.me/blog/2022/01/06/benchmarking-cruby-mjit-yj...

jamii

> it looks extremely suspicious

It matches my anecdotal experiences with graal. It's a huge complex machine with a lot of surface area for weird performance bugs.

> TruffleRuby beats cruby on the same benchmark by 4.38x

That benchmark only looks at peak performance, which TR unarguably dominates at, and ignores other important considerations like warmup time. Here's another railsbench (https://i.imgur.com/IzMRKdQ.png) from the same paper (https://dl.acm.org/doi/pdf/10.1145/3617651.3622982). TR achieves the best peak performance, but only after 3 minutes of warmup and it starts out 17x slower than cruby. If every time you deploy new code your latency spikes by 17x for 3 minutes, that's gonna be painful.

I don't know of anyone that is using TR in production. Shopify were doing some work on it for a while, but they've since developed and shipped yjit so I don't know if they still have any investment in TR.

zX41ZdbW

This is an interesting article!

ClickHouse does a hybrid approach. It uses a vectorized interpreter plus the JIT compilation (with LLVM) of small expressions (which is also vectorized and generated for the target instruction set by compiling loops). The JIT compilation does not bother to compile the whole query, only simple fragments. It gives only a marginal improvement over the vectorized interpreter.

Even with this approach, JIT is a huge pile of problems. A few details here: https://clickhouse.com/blog/clickhouse-just-in-time-compiler...

Overall architecture overview: https://clickhouse.com/blog/first-clickhouse-research-paper-...

convolvatron

Clustrix wrote a compiler. It was able to call out to C for string intrinsics and communication and btree operations, so it primarily handled extracting data from serialized formats, simple arithmetic, constructing serializations from downstream.

this was to replace a little byte code VM, and as far I recall took a very good programmer about 2 weeks for the basic version.

I dont see any reason to even try to bring something like a general purpose JIT. there's a huge benefit from just doing the basics, and you can move the line between precompiled functions and dynamically compiled ones on an ongoing basis. its also cheap enough that I wouldn't get hung up on needing to amortize that cost with prepared statements.

UncleEntity

The real original copy and patch paper [0] used gcc's addressable gotos to get the addresses of the "patches" and pointers for the operands which can be replaced at jit-time as opposed to the referenced paper which uses llmv tomfoolery to create a database of code patches as part of the building process.

IIRC, there may be another "original" paper.

To help with the author's confusion, they use function arguments to manage the registers. If you have a simple operand like 'add' you pass the left and right arguments as function arguments but if you have some values you want to keep in registers you pass those as additional arguments to the functions and just pass them through to the continuation hoping the compiler does the right thing (i.e. not spill the arguments during the function call).

Of course this leads to a combinatorial explosion of functions as you need one specialized for every register you want to pass through:

  void add0(int l, int r, kont c);
  void add1(int l, int r, int r0, kont c);
  void add2(int l, int r, int r0, int r1, kont c);

and also need to track in your jitifier who's doing what and how many intermediate results are needed at which program point &etc.

I believe both ways share the same method of generating a shitton of functions as a linked library and using that to create a database of code patches at compile time which the runtime uses to patch it all together. One does it using llvm tooling and the other uses 'extern int add_l, add_r, *add_r0' to get the locations of the operands to replace at runtime with the the actual value because C doesn't care one bit what you do with pointers.

I'm probably wrong on some of the details but that's the basics of what's happening.

__edit__ Reading the part on how extern variables are used and the example function definitions doesn't match perfectly with what is actually happening but the concept is the same -- use the locations the pointers point to to find the places to replace with actual values at runtime. The function defs are more like what a 'musttail interpreter' would use. Not sure this 'correction' is any clearer...

[0] http://ieeexplore.ieee.org/document/1342540/

HN

The missing tier for query compilers

The missing tier for query compilers