Understanding Smallpond and 3FS
47 comments
·March 2, 2025ok123456
dang
We should probably be having a thread about that actual release, so I've re-upped https://news.ycombinator.com/item?id=43200793, will move most of the comments thither, and will post links to this blog post and the other one that people have been referencing.
mritchie712
updated.
westurner
smallpond: https://github.com/deepseek-ai/smallpond :
> A lightweight data processing framework built on DuckDB and 3FS.
jauntywundrkind
Smallpond. Runs on their RDMA powered 3fs ("fire-flyer file system") filesystem.
https://github.com/deepseek-ai/smallpond
https://news.ycombinator.com/item?id=43200793
I didn't find anything of value in this article.
Did enjoy https://mehdio.substack.com/p/duckdb-goes-distributed-deepse... some, which eventually talks about smallpond being built on Ray, and… Smallpond actually running multiple partitioned duckdb instances?! Wow.
memco
Love this straightforward analysis of use cases:
> Using smallpond and 3FS depends largely on your data size and infrastructure:
> Under 10TB: smallpond is likely unnecessary unless you have very specific distributed computing needs. A single-node DuckDB instance or simpler storage solutions will be simpler and possibly more performant.
> 10TB to 1PB: smallpond begins to shine. You'd set up a cluster with several nodes, leveraging 3FS or another fast storage backend to achieve rapid parallel processing.
> Over 1PB (Petabyte-Scale): smallpond and 3FS were explicitly designed to handle massive datasets. At this scale, you'd need to deploy a larger cluster with substantial infrastructure investments.
Makes it very easy to determine if this would be useful for me and how much work I would expect to do to use it.
dartos
I very much felt like that entire portion of the article was ai generated, actually.
IMO pretty obvious, surface level, information and some prose on each bullet.
xixixao
Saying something is “obvious” without specifying an audience is meaningless.
(because obviousness is subjective and depends on the knowledge, experience, and context of the audience)
dartos
Notice the “IMO pretty” before the word “obvious”
IMO means “in my opinion.” I used that phrase to express how the following statement is my opinion and not a universal truth. My “audience” in this case is myself.
I do that because otherwise there’s always a comment saying how things like “obvious” can be subjective.
I also used the word “pretty” to, again, soften the word “obvious” so that readers don’t think that it’s a universal truth.
genewitch
with some "no s, sherlock" on the ">1PB will require additional infra."
go on...
like people talking about 1gbit iSCSI, and no one thought to say that 120MB/s, which is technically slower than ATA/133 which came out twenty years ago, might be the bottleneck. Obviously 10gbit will be "as fast as a local drive"!
Yes, exactly right! This means you need to buy additional hardware, like network cards[0], and possibly gbic and fiber optics.
mritchie712
I updated the post. In this case, I meant "exotic" infra... e.g. 3FS isn't like adding more EC2 instances.
Adding ec2 instances is trivial, setting up 3FS is hard.
7thpower
You’ve been wanting to get this off your chest for a while haven’t you.
fs111
The authors are Chinese so they may simply use AI to make it sound right in English
varispeed
I had a Chinese co-worker and something like this was actually his style of writing, no use of AI, because I was sitting next to him few times when he was writing documents.
mritchie712
some was AI generated, but I made sure everything was accurate. I'd normally rewrite everything, but I wrote this quickly before I had to leave the house. Didn't think it'd be on the front page!
dartos
Not judging you for using AI for a post like this!
Don’t feel bad. I just didn’t think AI generated bullet points were as impressive as the comment I was replying to did.
jimmyl02
I wonder at which scale spark fits into this picture and what the tradeoffs / benefits would be
mritchie712
spark is certainly the incumbent for this sort of thing.
one benefit for me personally: you should be able to move from local dev to cloud more easily.
benrutter
Yeah I reeeaaally want to see benchmarks! Single node duckdb is absolutely insane (as in fast) performance wise, especially compared to something like Spark. There's been a lot of speed focussed work in the project and I don't know of any faster data processing (I'm not counting traditional SQL since a lot of the speed benefits there come from indexing etc and essentially doing additional work ahead of time).
I guess it comes down to how well written the distributed workflows are, there's a lot to get wrong, but in theory it should be able to achieve very impressive numbers.
My reasoning behind this is Dask, which uses Pandas under the hood being capable of better benchmarks than Spark, I think this is partly some good optimisations, but also simply that pandas is faster than spark's row based model. Duckdb is on some benchmarks more than 10x faster than pandas, you can see where this is going. . .
DannyPage
“Releases” is used in the article - instead of “drops” - and would be a clearer title.
dang
Ok, fixed now. (Submitted title was "DeepSeek Drops Distributed DuckDB")
Edit: I've since changed the title above to the article title, in keeping with the site guidelines (https://news.ycombinator.com/newsguidelines.html). It has been taking me a while to figure out what we're looking at here!
conqrr
Drop in the context of Databases isn't even close to anything being released or launched. Drop = Delete. Release is a much better word for this context.
joshuaturner
Even in the context of an application stack - my initial read had me believing they were moving away from DuckDB
mritchie712
yeah, I thought drop was amusing in this case paired with the tautogram
freehorse
It was, but people here prioritise lexixal inambiguity rather than fun.
djeastm
This is one of my "Kids these days..." moments. I've been caught several times mistaking the meaning of this new slang.
BHSPitMonkey
Not _so_ new:
- https://boards.straightdope.com/t/where-did-the-term-album-d... (2009) - https://www.talkbass.com/threads/when-did-release-become-dro... (2013)
But it _has_ spread much faster outside of the music scene these last few years, e.g. describing software and products.
wigster
drop should be un-dropped.
dboreham
Not only clearer, but 180 degrees different in meaning.
4ndrewl
I thought "dropped" these days meant released? Not helpful I know...
kaashif
I was surprised because I thought the title meant they dropped support or something. Weird because I'd never heard of distributed DuckDB.
derefr
In denotation, "dropped" can be used equivalently to "released", yes; but in connotation, using "dropped" instead of "released" implies either that:
1. the particular release was sudden, unexpected, and not highly pre-advertised or post-advertised — as in an album being "dropped" by a band (where the band more often "releases" albums.) Usage of "dropped" here evokes the feeling that the releaser is casually "dropping" the thing in the public square and walking away, leaving it there to be studied. A band would release an album by going on tour selling it; or they might just drop an album on Spotify one day.
2. the particular release was a single limited production run / limited-time event — where people were anticipating something would be released at a certain specific time, but there was no advance statement from the releaser of exactly what people would be getting. Strong analogy with the NYE "ball drop" — the release is an event that people count down to or line up for. (Think: dropping a new limited-edition colorway of a product people ravenously collect — sneakers, Stanley cups, etc.)
3. the particular release was a bounded-in-size batch or "tranch" of production, all put out to be purchased at once where "once they sell out, they sell out" for now — but with the expectation that the releaser is producing more, but where this will take time, during which the item will remain sold out. (Often, the item has actually been produced in quantity, and this limited dribbling-out and repeated fast selling-out is purely a marketing technique to induce hype and demand.) This usage isn't a figurative extension of the literal verb "drop" — but rather a shortening of the word "airdrop", as in military resupply and/or NFTs. You would be more likely to see this phrased as "[X] dropped another [Y]" or "[X] dropped more [Y]"; or perhaps "there was a drop of [Y] today."
SteveDR
Yes, most young people would say an artist “dropped” new music instead of saying that they released new music. Still a bad title though
0xCMP
I think to be clearer it would have been written "DeepSeek Drops Distributed version of DuckDB". Otherwise it looks like they used DuckDB (the distributed one?) and they have something new or better they're using now.
KaoruAoiShiho
Dropped could also mean they used to use it but stopped, that's also pretty common parlance in software...
null
mritchie712
Sorry, I couldn't resist the tautogram.
farts_mckensy
It's pretty clear what is meant by anyone under the age of 50.
ivandenysov
I’m anyone and it wasn’t clear to me
farts_mckensy
You seriously don't what it means to "drop" something? Fuck. I forgot that a lot of you are social retards.
mritchie712
After posting, I started thinking about how you could push Iceberg (or delta) partitions into smallpond. Spinning up 3FS will be a lot of work, but distributing compute on an existing Iceberg catalog would be worth trying.
null
xnx
"drops" seems to be a fairly recent contronym meaning both "released" and "discontinued".
https://github.com/deepseek-ai/smallpond