ClickHouse gets lazier and faster: Introducing lazy materialization
41 comments
·April 22, 2025tmoertel
zX41ZdbW
I checked, and yes - it works: https://pastila.nl/?002a2e01/31807bae7e114ca343577d263be7845...
tmoertel
Thanks! That's a nice 5x improvement. Pretty good for a query that offers only modest opportunity, given that the few columns it asks for are fairly small (`title` being the largest, which isn't that large).
tschreiber
Verified:
EXPLAIN plan actions = 1
SELECT *
FROM amazon.amazon_reviews
WHERE helpful_votes > 0
ORDER BY -log(1 - (rand() / 4294967296.0)) / helpful_votes
LIMIT 3
Lazily read columns: review_body, review_headline, verified_purchase, vine, total_votes, marketplace, star_rating, product_category, customer_id, product_title, product_id, product_parent, review_date, review_idNote that there is a setting query_plan_max_limit_for_lazy_materialization (default value 10) that controls the max n for which lm kicks in for LIMIT n.
tmoertel
Awesome! Thanks for checking :-)
jurgenkesker
I really like Clickhouse. Discovered it recently, and man, it's such a breath of fresh air compared to suboptimal solutions I used for analytics. It's so fast and the CLI is also a joy to work with.
theLiminator
How does it compare to duckdb and/or polars?
EvanAnderson
Same here. I come from a strong Postgres and Microsoft SQL Server background and I was able to get up to speed with it, ingesting real data from text files, in an afternoon. I was really impressed with the docs as well as the performance of the software.
simonw
Unrelated to the new materialization option, this caught my eye:
"this query sorts all 150 million values in the helpful_votes column (which isn’t part of the table’s sort key) and returns the top 3, in just 70 milliseconds cold (with the OS filesystem cache cleared beforehand) and a processing throughput of 2.15 billion rows/s"
I clearly need to update my mental model of what might be a slow query against modern hardware and software. Looks like that's so fast because in a columnar database it only has to load that 150 million value column. I guess sorting 150 million integers in 70ms shouldn't be surprising.
(Also "Peak memory usage: 3.59 MiB" for that? Nice.)
This is a really great article - very clearly explained, good diagrams, I learned a bunch from it.
amluto
> I guess sorting 150 million integers in 70ms shouldn't be surprising.
I find sorting 150M integers at all to be surprising. The query asks for finding the top 3 elements and returning those elements, sorted. This can be done trivially by keeping the best three found so far and scanning the list. This should operate at nearly the speed of memory and use effectively zero additional storage. I don’t know whether Clickhouse does this optimization, but I didn’t see it mentioned.
Generically, one can find the kth best of n elements in time O(n):
https://en.m.wikipedia.org/wiki/Selection_algorithm
And one can scan again to find the top k, plus some extra if the kth best wasn’t unique, but that issue is manageable and, I think, adds at most a factor of 2 overhead if one is careful (collect up to k elements that compare equal to the kth best and collect up to k that are better than it). Total complexity is O(n) if you don’t need the result sorted or O(n + k log k) if you do.
If you’re not allowed to mutate the input (which probably applies to Clickhouse-style massive streaming reads), you can collect the top k in a separate data structure, and straightforward implementations are O(n log k). I wouldn’t be surprised if using a fancy heap or taking advantage of the data being integers with smallish numbers of bits does better, but I haven’t tried to find a solution or disprove the existence of one.
danlark1
I am the author of the optimization of partial sorting and selection in Clickhouse. It uses Floyd-Rivest algorithm and we tried a lot of different things back at the time, read [1]
Overall clickhouse reads blocks of fixed sizes (64k) and finds top elements and then does top of the top until it converges.
[1] https://danlark.org/2020/11/11/miniselect-practical-and-gene...
Akronymus
> This can be done trivially by keeping the best three found so far and scanning the list.
That doesnt seem to guarantee correctness. If you dont track all of the unique values, at least, you could be throwing away one of the most common values.
The wiki entry seems to be specifically about the smallest, rather than largest values.
senderista
The max-heap algorithm alluded to above is correct. You fill it with the first k values scanned, then peek at the max element for each subsequent value. If the current value is smaller than the max element, you evict the max element and insert the new element. This streaming top-k algorithm is ubiquitous in both leetcode interviews and applications. (The standard quickselect top-k algorithm is not useful in the streaming context because it requires random access and in-place mutation.)
datadrivenangel
With an equality that returns true/false, this guarantees correctness. If there can be 3 best/biggest/smallest values, this technique works.
recursive
What? The algorithm is completely symmetrical with respect to smallest or largest, and fully correct and general. I don't understand the problem with unique values. Could you provide a minimal input demonstrating the issue?
simonw
Maybe they do have that optimization and that explains the 3.59 MiB peak memory usage for ~600MB of integers.
baq
Slow VMs on overprovisioned cloud hosts which cost as much per month as a dedicated box per year have broken a generation of engineers.
You could host so much from your macbook. The average HN startup could be hosted on a $200 minipc from a closet for the first couple of years if not more - and I'm talking expensive here for the extra RAM you want to not restart every hour when you have a memory leak.
federiconafria
Not only that, you have a pile of layers that could be advantageous in some situations but are an overkill in most.
I've seen Spark clusters being replaced by a single container using less than 1 CPU core and few 100s MB of RAM.
rfoo
> so much from your macbook
At least on cloud I can actually have hundreds of GiBs of RAM. If I want this on my Macbook it's even more expensive than my cloud bill.
baq
You can, but if you need it you’re not searching for a product market fit anymore.
sofixa
Raw compute wise, you're almost right (almost because real cloud hosts aren't overprovisioned, you get the full CPU/memory/disk reserved for you).
But you actually need more than compute. You might need a database, cache, message broker, scheduler, to send emails, and a million other things you can always DIY with FOSS software, but take time. If you have more money than time, get off the shelf services that provide those with guarantees and maintenance; if not, the DIY route is also great for learning.
baq
My point is all of this can be hosted on a single bare metal box, a small one at that! We used to do just that back in mid naughts and computers only got faster. Half of those cloud services are preconfigured FOSS derivatives behind the scenes anyway (probably…)
null
vjerancrnjak
It's quite amazing how a db like this shows that all of those row-based dbs are doing something wrong, they can't even approach these speeds with btree index structures. I know they like transactions more than Clickhouse, but it's just amazing to see how fast modern machines are, billions of rows per second.
I'm pretty sure they did not even bother to properly compress the dataset, with some tweaking, could have probably been much smaller than 30GBs. The speed shows that reading the data is slower than decompressing it.
Reminds me of that Cloudflare article where they had a similar idea about encryption being free (slower to read than to decrypt) and finding a bug, that when fixed, materialized this behavior.
The compute engine (chdb) is a wonder to use.
ohnoesjmr
Wonder how well this propagates down to subqueries/CTE's
simianwords
Maybe I'm too inexperienced in this field but reading the mechanism I think this would be an obvious optimisation. Is it not?
But credit where it is due, obviously clickhouse is an industry leader.
ahofmann
Obvious solutions are often hard to do right. I bet the code that was needed to pull this off is either very complex or took a long time to write (and test). Or both.
ryanworl
This is a well-known class of optimization and the literature term is “late materialization”. It is a large set of strategies including this one. Late materialization is about as old as column stores themselves.
meta_ai_x
can we take the "packing your luggage" analogy and only pack the things we actually use in the trip and apply that to clickhouse?
Onavo
Reminder clickhouse can be optionally embedded, you don't need to reach for Duck just because of hype (it's buggy as hell everytime I tried it).
https://clickhouse.com/blog/chdb-embedded-clickhouse-rocket-...
sirfz
Chdb is awesome but so is duckdb
null
This optimization should provide dramatic speed-ups when taking random samples from massive data sets, especially when the wanted columns can contain large values. That's because the basic SQL recipe relies on a LIMIT clause to determine which rows are in the sample (see query below), and this new optimization promises to defer reading the big columns until the LIMIT clause has filtered the data set down to a tiny number of lucky rows.
Can anyone from ClickHouse verify that the lazy-materialization optimization speeds up queries like this one? (I want to make sure the randomization in the ORDER BY clause doesn't prevent the optimization.)