A sharded DuckDB on 63 nodes runs 1T row aggregation challenge in 5 sec

71 comments

·October 24, 2025

Visit

1a527dd5

The title buries the lede a little

> Our cluster ran on Azure Standard E64pds v6 nodes, each providing 64 vCPUs and 504 GiB of RAM.

Yes, I would _expect_ when each node has that kind of power it should return very impressive speeds.

lolive

Why doesn't such large-scale test the big feature everyone needs, which is inner join at scale?

MobiusHorizons

> Once trusted, each worker executes its local query through DuckDB and streams intermediate Arrow IPC datasets back to the server over secure WebSockets. The server merges and aggregates all results in parallel to produce the final SQL result—often in seconds.

Can someone explain why you would use websockets in an application where neither end is a browser? Why not just use regular sockets and cut the overhead of the http layer? Is there a real benefit I’m missing?

kevincox

> the overhead of the http layer

There isn't much overhead here other than connection setup. For HTTP/1 the connection is just "upgraded" to websockets. For HTTP/2 I think the HTTP layer still lives on a bit so that you can use connection multiplexing (which maybe be overhead if you have no use for it here) but that is still a very thin layer.

So I think the question isn't so much HTTP overhead but WebSocket overhead. WebSockets add a bit of message framing and whatnot that may be overhead if you don't need it.

In 99% of applications if you need encryption, authentication and message framing you would be hard-pressed to find a significantly more efficient option.

toast0

> In 99% of applications if you need encryption, authentication and message framing you would be hard-pressed to find a significantly more efficient option.

AFAIK, websockets doesn't do authentication? And the encryption it does is minimal, optional xor with a key disclosed in the handshake. It does do framing.

It's not super common, but if all your messages have a 16-bit length, you can just use TLS framing. I would argue that TLS framing is ineffecient (multiple length terms), but using it by itself is better than adding a redundant framing layer.

null

[deleted]

simonw

If you're using sockets you still need to come up with some kind of protocol on top of those sockets for the data that's being transferred - message delimiter, a data format etc. Then you have to build client libraries for that protocol.

WebSockets solve a bunch of those low level problems for you, in a well specified way with plenty of existing libraries.

zerd

WebSocket doesn't specify data format, it's just bytes, so they have to handle that themselves. It looks like they're using Arrow IPC.

Since they're using Arrow they might look into Flight RPC [1] which is made for this use case.

[1] https://arrow.apache.org/docs/format/Flight.html

HumblyTossed

ASCII table codes 1,2,3 & 4 pretty simple to use.

dns_snek

Sure, in principle. Someone already mentioned binary data, then you come up with a framing scheme and get to write protocol documentation, but why? What's the benefit?

jcheng

Not if you're passing binary data

philbe77

Hi MobiusHorizons, I happened to use websockets b/c it was the technology I was familiar with. I will try to learn more about normal sockets to see if I could perhaps make them work with the app. Thanks for the suggestion...

gopalv

> will try to learn more about normal sockets to see if I could perhaps make them work with the app.

There's a whole skit in the vein of "What have the Romans ever done for us?" about ZeroMQ[1] which has probably lost to the search index now.

As someone who has held a socket wrench before, fought tcp_cork and dsack, Websockets isn't a bad abstraction to be on top of, especially if you are intending to throw TLS in there anyway.

Low level sockets is like assembly, you can use it but it is a whole box of complexity (you might use it completely raw sometimes like a tickle ack in the ctdb[2] implementation).

[1] - https://news.ycombinator.com/item?id=32242238

[2] - https://linux.die.net/man/1/ctdb

DanielHB

if you really want maximum performance maybe consider using CoAP for node-communication:

https://en.wikipedia.org/wiki/Constrained_Application_Protoc...

It is UDP-based but adds handshakes and retransmissions. But I am guessing for your benchmark transmission overhead isn't a major concern.

Websockets are not that bad, only the initial connection is HTTP. As long as you don't create a ton of connections all the time it shouldn't be much slower than a TCP-based socket (purely theoretical assumption on my part, I never tested).

sureglymop

Wait but websockets aren't over http right? Just the initiation and then there is a protocol upgrade or am I wrong? What overhead is there otherwise?

tsimionescu

You're right, WebSockets aren't over HTTP, they just use HTTP for the connection initiation. They do add some overhead in two places: one, when opening a new connection, since you go TCP -> TLS -> HTTP -> WebSockets -> Your protocol ; and two, they do add some per packet overhead, since there is a WebSocket encapsulation of your data - but this is much smaller than typical HTTP request/response overhead.

lucideer

> overhead of the http layer

Detail of this well-covered in sibling comments, but at a higher-level, two thoughts on this:

1. I see a lot of backlash lately against everything being HTTP-ified, with little justification other than a presumption that it necessarily adds overhead. Perf-wise, HTTP has come a long way & modern HTTP is a very efficient protocol. I think this has cleared the way for it to be a foundation for many more things than in the past. HTTP/3 being over UDP might clear the way for more of this (albeit I think the overhead of TCP/IP is also often overstated - see e.g. MQTT).

2. Overhead can be defined in two ways: perf. & maintenance complexity. Modern HTTP does add a bit of the latter, so in that context it may be a fair concern, but I think the large range of competing implementations probably obviates any concern here & the alternative usually involves doing something custom (albeit simpler), so you run into inconsistency, re-invented wheels & bus factor issues there.

fweimer

One reason comes to my mind: HTTP is no longer a stable protocol with well-understood security properties. If you deploy it today, people expect interoperability with clients and servers that implement future protocol upgrades, resulting in an ongoing maintenance burden that a different protocol choice would avoid.

lucideer

I'm absolutely not an expert of any kind on protocol details, so pardon my ignorance here but this surprises me: is this true?

High-level spec changes have been infrequent, with long dual support periods, & generally seen pretty slow gradual client & server adoption. 1.1 was 1997 & continues to have widespread support today. 2 & 3 were proposed in 2015 & 2016 - almost 2 decades later - & 2 is only really starting to see wide support today, with 3 still broadly unsupported.

I'm likely missing a lot of nuance in between versioned releases though - I know e.g. 2 saw at least two major additions/updates, though I thought those were mostly additive security features rather than changes to existing protocol features.

immibis

Using stuff like HTTP signals a lack of understanding of the whole stack. IMO it's important for programmers to understand computers. You can write programs without understanding computers, but it's best if you go and learn about computers first. You can use abstractions but you should also understand the abstractions.

There are two ways I've noticed to design an application.

Some people grab some tools out of their toolbox that look like they fit - I need a client/server, I know web clients/servers, so I'll use a web client/server.

Other people think about what the computer actually has to do and then write code to achieve that: Computer A has to send a block of data to computer B, and this has to work on Linux (which means no bit-banging - you can only go as low as raw sockets). This type of person may still take shortcuts, but it's by intention, not because it's the only thing they know: if HTTP is only one function call in Python, it makes sense to use HTTP, not because it's the only thing you know but because it's good enough, you know it works well enough for this problem, and you can change it later if it becomes a bottleneck.

Websockets are an odd choice because they're sort of the worst of both worlds: they're barely more convenient as raw sockets (there's framing, but framing is easy), but they also add a bunch of performance and complexity overhead over raw sockets, and more things that can go wrong. So it doesn't seem to win on the convenience/laziness front nor the performance/security/robustness front. If your client had to be a web browser, or could sometimes be a web browser, or if you wanted to pass the connections through an HTTP reverse proxy, those would be good reasons to choose websockets, but none of them are the case here.

lucideer

Acknowledging that a huge number of people (the vast majority) are going to use the only option they know rather than the best of a set of options they know, I still think that for a person who's fully versed in all available options, Websockets is a better option than you make out.

> they're barely more convenient as raw sockets

Honestly, raw sockets are pretty convenient - I'm not convinced Websockets are more convenient at all (assuming you already know both & there's no learning curves). Raw sockets might even be more convenient.

I think it's features rather than convenience that is more likely to drive Websocket usage when comparing the two.

> they also add a bunch of performance and complexity overhead over raw sockets

This is the part that I was getting at in my above comment. I agree in theory, but I just think that the "a bunch" quantifier is bit of an exaggeration. They really add very very little performance overhead in practice: a negligible amount in most cases.

So for a likely-negligible performance loss, & a likely-negligible convenience difference, you're getting a protocol with built-in encryption, widespread documentation & community support - especially important if you're writing code that other people will need to take over & maintain - & as you alluded to: extensibility (you may never need browser support or http proxying, but having the option is compelling when the trade-offs are so negligible).

nurettin

http 101 upgrade isn't much of an overhead and there are tried and tested websocket/ssl libraries with pretty callback interfaces versus your custom binary protocol. I would still choose the latter but I wouldn't recommend it.

djhworld

Interesting and fun

> Workers download, decompress, and materialize their shards into DuckDB databases built from Parquet files.

I'm interested to know whether the 5s query time includes this materialization step of downloading the files etc, or is this result from workers that have been "pre-warmed". Also is the data in DuckDB in memory or on disk?

philbe77

hi djhworld. The 5s does not include the download/materialization step. That parts takes the worker about 1 to 2 minutes for this data set. I didn't know that this was going on HackerNews or would be this popular - I will try to get more solid stats on that part, and update the blog accordingly.

You can have GizmoEdge reference cloud (remote) data as well, but of course that would be slower than what I did for the challenge here...

The data is on disk - on locally mounted NVMe on each worker - in the form of a DuckDB database file (once the worker has converted it from parquet). I originally kept the data in parquet, but the duckdb format was about 10 to 15% faster - and since I was trying to squeeze every drop of performance - I went ahead and did that...

Thanks for the questions.

GizmoEdge is not production yet - this was just to demonstrate the art of the possible. I wanted to divide-and-conquer a huge dataset with a lot of power...

philbe77

I've since learned (from a DuckDB blog) - that DuckDB seems to do better when the XFS filesytem. I used ext4 for this, so I may be able to get another 10 to 15% (maybe!).

DuckDB blog: https://duckdb.org/2025/10/09/benchmark-results-14-lts

tgv

Impressive, but those 63 nodes were "Azure Standard E64pds v6 nodes, each providing 64 vCPUs and 504 GiB of RAM." That's 4000 CPUs and 30TB memory.

ramraj07

Sounds like the equivalent of a 4xl snowflake warehouse, which for such queries would take 30 seconds, with the added benefit of the data being cold stored in s3. Thus you only pay by the minute.

philbe77

Challenge accepted - I'll try it on a 4XL Snowflake to get actual perf/cost

hobs

No, that would be equivalent to 64 4xl snowflake warehouses (though the rest of your point still stands).

philbe77

Cost-wise, 64 4xl Snowflake clusters would cost: 64 x $384/hr - for a total of: $24,576/hr (I believe)

ralegh

Just noting that 4000 vCPUs usually means 2000 cores, 4000 threads

electroly

It doesn't mean that here. Epdsv6 is 1 core = 1 vCPU.

RamtinJ95

At that scale it cannot be cheaper than just running the same workload on BigQuery or Snowflake or?

philbe77

A Standard E64pds v6 costs: $3.744 / hr on demand. At 63 nodes - the cost is: $235.872 / hr - still cheaper than a Snowflake 4XL cluster - costing: 128 credits / hr at $3/credit = $384 / hr.

philbe77

At 5 seconds - the query technically cost: $0.3276

philbe77

If I used "spot" instances - it would have been 63 x $0.732/hr for a total of: $45.99 / hr.

sammy2255

How would a 63 node Clickhouse cluster compare? >:)

ta12653421

When reading such extreme numbers, I'm always thinking what I may be doing wrong, when my MSSQL based CRUD application warms up its caches with around 600.000 rows and it takes 30 seconds to load them from DB into RAM on my 4x3GHz machine :-D

Maybe I'm missing something fundamental here

RobinL

Yes - OLAP database are built with a completely different performance tradeoff. The way data is stored and the query planner are optimised for exactly these types of queries. If you're working in an oltp system, you're not necessarily doing it wrong, but you may wish to consider exporting the data to use in an OLAP tool if you're frequently doing big queries. And nowadays there's ways to 'do both ' e.g. you can run the duckdb query engine within a postgres instance

dgan

I also had misfortune working with MSSQL is it was so so unbearably slow, because i couldnt upload data in bulk. I guess its forbidden technology

zwnow

This type of stuff is usually hyperoptimized for no reason and serves no real purpose, you are doing just fine

rovr138

Would OLAP be better than OLTP for those queries you're doing?

mosselman

Are there any good instructions somewhere on how to set this up? As in not 63 nodes. But a distributed duckdb instance

philbe77

Hi mosselman, GizmoEdge is not open-source. DeepSeek has "smallpond" however, which is open-source: https://github.com/deepseek-ai/smallpond

I plan on getting GizmoEdge to production-grade quality eventually so folks can use it as a service or licensed software. There is a lot of work to do, though :)

shinypenguin

Is the dataset somewhere accessible? Does anyone know more about the "1T challenge", or is it just the 1B challenge moved up a notch?

Would be interesting to see if it would be possible to handle such data on one node, since the servers they are using are quite beefy.

philbe77

Hi shinypenguin - the dataset and challenge are detailed here: https://github.com/coiled/1trc

The data is in a publicly accessible bucket, but the requester is responsible for any egress fees...

shinypenguin

Hi, thank you for the link and quick response! :)

Do you know if anyone attempted to run this on the least amount of hardware possible with reasonable processing times?

philbe77

Yes - I also had GizmoSQL (a single-node DuckDB database engine) take the challenge - with very good performance (2 minutes for $0.10 in cloud compute cost): https://gizmodata.com/blog/gizmosql-one-trillion-row-challen...

simonw

I suggest linking to that from the article, it is a useful clarification.

philbe77

Good point - I'll update it...

achabotl

The One Trillion Row Challenge was proposed by Coiled in 2024. https://docs.coiled.io/blog/1trc.html

maxmcd

Are there any open sourced sharded query planners like this? Something that can aggregate queries across many duckdb/sqlite dbs?

mritchie712

DeepSeek released smallpond

0 - https://github.com/deepseek-ai/smallpond

1 - https://www.definite.app/blog/smallpond (overview for data engineers, practical application)

hobofan

Not directly DuckDB (though I think it might be able to be connected to that), but I think Apache Datafusion Ballista[0] would be a typical modern open source benchmark here.

[0]: https://datafusion.apache.org/ballista/contributors-guide/ar...

boshomi

>“In our talk, we will describe the design rationale of the DuckLake format and its principles of simplicity, scalability, and speed. We will show the DuckDB implementation of DuckLake in action and discuss the implications for data architecture in general.

Prof. Hannes Mühleisen, cofounder of DuckDB:

[DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by Prof. Hannes Mühleisen](https://www.youtube.com/watch?v=YQEUkFWa69o) (53 min) Talk from Systems Distributed '25: https://systemsdistributed.com

NorwegianDude

This is very silly. You're not doing the challenge if you do the work up front. The idea is that you start with a file and the goal is to get the result as fast as possible.

How long did it take to distribute and import the data to all workers, what is the total time from file to result?

I can do this a million times faster on one machine, it just depends on what work I do up front.

philbe77

You should do it then, and post it here. I did do it with one machine as well: https://gizmodata.com/blog/gizmosql-one-trillion-row-challen...

NorwegianDude

Nobody cares if I can do it a million times faster, everyone can. It's cheating.

The whole reason you have to account for the time you spend setting it up is so that all work spent processing the data is timed. Otherwise we can just precomputed the answer and print it on demand, that is very fast and easy.

Just getting it into memory is a large bottleneck in the actual challenge.

If I first put it into a DB with statistics that tracks the needed min/max/mean then it's basically instant to retrieve, but also slower to set up because that work needs to be done somewhere. That's why the challenge is time from file to result.

HN

A sharded DuckDB on 63 nodes runs 1T row aggregation challenge in 5 sec

A sharded DuckDB on 63 nodes runs 1T row aggregation challenge in 5 sec