Smallpond – A lightweight data processing framework built on DuckDB and 3FS
74 comments
·February 28, 2025jamesblonde
nyrikki
Some of what they are doing is simply what was lost due to the ubiquitous nature of the relational model.
The hierarchical model is applicable to many problems and actually in part why moving off mainframes is challenging because IMS is so much more efficient than the relational model for applications like airline tickets.
There have been several efforts to leverage object stores in the way they did that I am aware of but it was a hard sell.
The hierarchical model really only works for many to one relationships, and it's integrity model differs and is not as DRY.
There are lessons to learn here but it requires some relearning.
When you have a shopping cart, having data local to the server handling the transaction is also a benefit.
Codd's relational model has advantages, but has held back some efforts because we are just use to dealing with the painful parts that we often don't consider other options.
HackerThemAll
DuckDB is specialized in efficient storage and fast query for analytics (OLAP), using a columnar storage (in contrast to row storage, used by usual RDBMSes doing OLTP processing). It's nothing new, it's been there for couple decades already. But this "distributed" DuckDB can indeed be beneficial for training.
auxten
Data operations are increasingly happening near the GPU side to boost efficiency—especially for compute-heavy workflows. Talking about Arrow file processing and zero-copy queries on DataFrames, which are becoming crucial for modern data pipelines. I think another option worth considering is chdb, which supports these features and fits well with this shift. (author of chdb here)
agilob
I'm super impressed how much effort DeepSeek did and how much of it they opensourced.
orlp
One thing I found peculiar is that for the GraySort benchmark it dispatches to Polars by default to do the actual sorting, not DuckDB: https://github.com/deepseek-ai/smallpond/blob/ed112db42af4d0....
tomnipotent
The function argument defaults to polars, but the actual benchmark code sets duckdb by default.
https://github.com/deepseek-ai/smallpond/blob/ed112db42af4d0...
orlp
I see, confusing multiple layers of defaults :)
dang
Related ongoing thread:
Understanding Smallpond and 3FS - https://news.ycombinator.com/item?id=43232410
also:
DuckDB goes distributed? DeepSeek's smallpond takes on Big Data - https://news.ycombinator.com/item?id=43206964 (no comments there, but some people have been recommending that article)
rubenvanwyk
May Data Engineering content keep on hitting front page HN!
HackerThemAll
DuckDB itself is cool enough, especially when combined with SQLite and/or PostgreSQL, and now this. Thanks DeepSeek!
dcreater
How is duckdb combined with SQLite? Aren't they alternatives to each other?
jitl
Not sure what the poster meant but DuckDB is an analytics DB, it doesn’t have a btree index - at least not last time I looked. You could consider it the OLAP embedded DB to SQLite’s OLTP embedded db.
DuckDB can read SQLite so you can even imagine using them side by side in the same system, serving point reads and writes from SQLite and using DuckDB for stuff like aggregates and searches that SQLite is slower at.
HackerThemAll
They are complementary to each other. There's an SQLite extension for use within DuckDB [1], which gives you a power of great transactional capabilities of SQLite and speed of analytical queries within DuckDB's columnar storage engine, all within a single database.
dcreater
Confused by the example in the repo? What is the use case for this? Is it a replacement for dask, ray etc? (Not a professional swe)
fastasucan
What does this do - what is the benefit over DuckDB, Polers etc?
articsputnik
Mehdi just wrote about this. Mainly starting DAGs parallelism using Ray (core) and their filesystem 3FS. See https://mehdio.substack.com/p/duckdb-goes-distributed-deepse....
mritchie712
I don't think you get any really benefits over duckdb unless your data is 10tb+ or you spin up 3FS (which seem challenging).
ilove196884
Any benchmark and comparisons?
RyanHamilton
If you want to checkout duckdb try QStudio. It's a free sql client with duckdb integrated: https://www.timestored.com/qstudio/help/duckdb-sql-editor. Disclaimer: I'm the main author.
maximilianroos
Big fan of QStudio! Thanks for building it!
dcreater
What's with the win95 ui?
RyanHamilton
There are many themes to choose from. I recorded the demo on that page and I like windows 95. I concede it may not be pretty but I've always found it functional. The default is darcula theme like shown on the main page: https://www.timestored.com/qstudio/
shipp02
Is the code written by the deepseek model?
I should probably give up on being a software engineer if it is.
cavisne
There is a chinese blogpost from 2019 about 3FS so it predates deepseek [1]. It will be interesting to see the benchmarks but I suspect without 3FS smallpond is not that useful (the bottleneck would move to the networked file system).
None of the big US clouds support Infiniband broadly (Azure & Oracle have some support) so 3FS itself is not very useful to US companies who want to use public clouds.
breadwinner
Give up and become what? Most white collar jobs will be automated in the coming years. You think doctors' jobs are safe?
ezst
Not OP, but, anything that actually physically affects the real world for the better? For instance, large infrastructure engineering and construction projects are not going to run themselves any time soon. The world doesn't revolve around ad and fin tech.
nurettin
If your white collar job consists of simply using software, like copying numbers you see to an excel sheet, maybe. Otherwise they are pretty safe. People have been building tools and automation for thousands of years, yet nobody invented a fully automated cook for your fancy family dinner.
didntknowyou
you can already google the information , the majority of a doctor's value is not in their information but their people and technical skills.
agilob
>the majority of a doctor's value is not in their information but their people and technical skills.
Not even that, in the last years I haven't met a single doctor who would even care. Their value is now a necessary evil, they have the legal powers to recommend you to a hospital and give prescriptions. These legal powers will be much harder to change.
rscho
Well, googling the info is one thing. But today, medicine is still mostly a know-how profession. Residency is there mostly to transmit know-how.
rscho
Yes, doctors are safe. Because they do things. With their hands. That no one else does.
aragonite
> Because they do things. With their hands. That no one else does
That's only true of surgeons :) What if your specialty is nonsurgical (internal medicine, pediatrics, psychiatry, etc)?
mdaniel
Also, a hallucination for 'SELECT mising_field FROM borgus_tuble' is one thing, hallucinating that taking a dose of Cl Na O along with CH3 CO2 H will cure covid is another thing entirely
delfinom
Nope.
Healthcare megacorps are buying up independent practices like crazy. All because doctors can't keep up with the bullshit IT required for insurance, state mandates, etc and that's in addition to the insanity of even renting commercial real estate for an office these days.
These megacorps set quotas and push doctors to nickel and dime like crazy. They sure as shit will spend the money to find robots that can give you a prostate exam with a robot dildo.
lvl155
Looking forward to next few years when we can finally abstract away all the back-end techs.
BobbyJo
We ain't even solved garbage collection yet, and you think "back end systems" are going to abstracted away in the next few years?
tarruda
> We ain't even solved garbage collection yet
Can you elaborate on that?
BobbyJo
People still write in languages that force you to manage your own memory.
Once performance starts to matter (either due to scale or time requirements) abstractions always have tradeoffs you can't accept.
purplerabbit
Maybe they just mean for the type of projects they care about
BobbyJo
Can't you already just use FaaS and managed persistence?
dudeinjapan
[dead]
We are seeing more and more specialized query engines. This is a query engine specialized for training pipelines. It is not general purpose - it is for providing batches of training data at workers. It uses Ray for parallelization. The kind of queries you need are random reads (to implement shuffling across epochs), arrow support (zero copy to Pandas DataFrames), and efficient checkpointing.