Still Asking: How Good Are Query Optimizers, Really? [pdf]
15 comments
·August 31, 2025bob1029
Adapting the plan at runtime seems like the most universal solution for optimizer edge cases and is already implemented in the big 3.
If you think about it, the adaptive mechanism doesn't have to be perfect to have a lot of uplift. Even coarse grain detection and starting from zero can make a huge difference if the alternative would be a query burning up the server for hours and a failed batch job.
qazxcvbnm
What are some more information on the state of the art in runtime adaptation? I confess I do not feel like I possess such a thing from the databases I regularly use.
Adaptation sounds very compelling; if instead of emitting a plan based on a cardinality estimate, we emit a plan and a range of reasonable intermediate cardinalities together with expected time, and interrupt the original plan when the expectations are exceeded by an order of magnitude, and perform alternative plans based on newly gathered physical information, it sounds like it would be greatly beneficial. Are there concrete reasons that this has not been done (e.g. cost, complexity)?
mike_hearn
It has been done. It's just that open source databases lag behind the state of the art.
https://docs.oracle.com/en/database/oracle/oracle-database/2...
Adaptive planning also applies to adapting the degree of query parallelism and automatic banning of query plans that take too much time.
Commercial databases have much bigger teams than Postgres does and have a lot more optimizations as a consequence. This includes optimizations that aren't in scope for an open source DB to begin with. For example, Oracle uses the Apple strategy for decades. A lot of Oracle DBs are sold as hardware not software, shipped as a full cluster rack or set of racks to the customer. The gear uses a dedicated inter-node 100Gb/sec ethernet within the cluster to give horizontal scalability of joins and other complex read/write loads. Queries and results go in/out over a separate frontend network. There are customizations up and down the stack from kernel, firmware, device drivers, database nodes and the client drivers. Nodes directly read/write each other's memory using RDMA to fetch the latest version of rows from each other or inform each other about new transactions.
https://static.rainfocus.com/oracle/ocw24/sess/1716494555804...
You could do stuff like this with Postgres perhaps, if you fork and rewrite enough of it, but it's unlikely to ever be done upstream.
Disclosure: I have a part time role at Oracle. All opinions 100% guaranteed unofficial.
namibj
Both reasons together :D
But yeah you can snatch traversal statistics from your index accesses as you're executing and if those hit planner-provided thresholds you go do what the planner said to do/check in case of that specific trigger, then feed your statistics and the extra queried info to the planner to get your plan adjusted.
Doing this without creating unpredictable performance bugs let alone any correctness bugs is going to be hard, though. Even more when you consider it also has to be fast because otherwise it wouldn't be worth using.
You may want to check WCOJ technology where you can run a multi-way-join with cardinalities being of a "could well be >10, don't know and even if it'll vary across data/keys"-nature:
1. You index all data as needed for perfect index lookup joins on ANY non-stupid join ordering that query could be run as. 2. You prepare separate indices that tell you the number of values your single-key lookup join would return, for each key, for each such possible index lookup join included in (1). 3. For querying you start with a key or key range that could be fed to an index from (1) and then for each tuple at each step between individual joins you ask the candidate indices of (2) which one would thus be the cheapest to expand with, use the selected index of (1) (as told by the fine-grained cardinality counts of (2)) to expand your tuple, then choose the next relation to expand with.
If you ever have a choice where you know the cardinality of a key in (1) is always 0 or 1, you don't need an index of (2) for that index of (1) and you will always greedily apply those, preferring those that have a higher chance of killing the currently evaluating partial tuple from propagation through further indices.
That alone btw. is sufficient to get the worst case optimal runtime of triangle listing/counting in arbitrary edge lists of O(n^(3/2)) (incidentally equal to the maximum possible number of triangles) instead of the naive/classic O(n^2). n being the number of edges.
Constants here for the triangle case are on the order of 10~50 depending on how hard you want to argue vectorizing and using flat array "columnar" data structures instead of in-memory structures that allow incremental updating without brute force rebuilding from scratch at every change. E.g. supporting adding (but not deleting) edges and efficiently getting updated results is something that doesn't fit the most aggressive performance optimizations that would give the ~50 constant factor vs. a "mere" ~10 when you want to support incremental updates.
(At constant=10 break even would be at n=100; at constant=50 break-even would be at n=2500; IMO it's therefore quite relevant in many practical situations with how quick those asymptotes hit results/benefits.)
RaftPeople
> Adapting the plan at runtime seems like the most universal solution for optimizer edge cases and is already implemented in the big 3
But the issues frequently aren't edge cases and frequently are at runtime (i.e. new query), it's just the best the optimizers can do with limited info (cardinality) and a limited duration to evaluate alternatives.
EDIT:
I just realized I misunderstood the post I responded to. I thought adapt meant update the query plan that was previously stored, but I believe the meaning is during execution of the query.
LtdJorge
Completely off-topic question, but, does anyone more knowledgeable about academia know why bascially no paper states the date of writing/publication clearly? Is it because the publication may be delayed?
I've read or at least downloaded many of these VLDB and other DB papers and it's difficult to get the year it was written at just from the PDF, you need to find the DOI reference or something like that. And the data is importart, for chronology and to know how "fresh" the topic is.
fritzo
Have worst case optimal join algorithms made a practical impact, since the linked article's first publication in 2015? I've seen them in the context of egglog, but are they used in real world database management systems?
https://en.wikipedia.org/wiki/Worst-case_optimal_join_algori...
tylerhou
WCOJs guarantee an (asymptotic) upper bound on join complexity, but often with the right join plan and appropriate cardinality estimations, you can do much better than WCOJ.
The runtime of WCOJs algorithms are even more dependent on good cardinality estimation. For instance, in VAAT, the main difficulty to find an appropriate variable ordering, which relies on knowledge about cardinalities conditioned on particular variables having particular values. If you have the wrong ordering, you still achieve worst case optimal, but you could have done far better in some cases with other algorithms (e.g. Yannakakis algorithm for acyclic queries). And as far as I know, many DBMSes do not keep track of this type of conditional cardinality, so it is unlikely that existing WCOJ will be faster in practice.
The new hotness is "instance optimal" joins...
namibj
Materialize.com runs on those techniques; not sure how far this has gotten into the query planner yet, though. It's been a while since I looked at those details for it.
I'm pretty sure some datalog (adjacent?) but otherwise quite proprietary solution (that might be datadog (or is similar enough that I'd have to go search old notes/chats to determine the details)) uses the technologies and was mentioned as affiliation for at least one author of at least one paper in that space; as of about early 2024.
Edit: likely was https://en.wikipedia.org/wiki/LogicBlox
rendaw
I'm a layman, but I was really hoping to see some actual benchmark results in the paper.
null
serte
[dead]
Great article, thanks for sharing. The tension between "SQL is declarative" and "Write the query like this or it will OOM" has always made me uncomfortable.
I've used closed, mature, systems with custom cost based optimization and still find that I sometimes need to override the optimizer on join orders. Interesting to see if any of the benchmarks in this paper have similar shapes.