Skip to content(if available)orjump to list(if available)

Fire-Flyer File System from DeepSeek

Fire-Flyer File System from DeepSeek

31 comments

·February 28, 2025

ammo1662

For those who are interested, the design was originally published here:

(Chinese) https://www.high-flyer.cn/blog/3fs/

This file system has been developed and utilized by them for several years .

Compared to the traditional file systems, it is more focused on model training that contains a lot of random reads. Read cache and prefetching are useless in this case. Therefore, they designed the file system without those features to improve the performance.

I google translated some key parts here:

3FS is a special file system because it is almost only used in the scenario of batch reading sample data in the computing node during AI training, and accelerates model training through high-speed computing and storage interaction. This is a large-scale random reading task, and the read data will not be used again in a short time, so we cannot use the most important tool "read cache" to optimize file reading, and even advance reading is useless. Therefore, the implementation of 3FS is also quite different from other file systems.

Specifically, as shown in the figure above, 3FS uses the Linux-based AIO and io_uring interfaces to complete sample reading, because in the 3FS scenario, File Cache has no effect at all, but will consume system memory in a way that is difficult for users to control, affecting the operation of subsequent tasks, so we turned off File Cache and only used Direct I/O mode to read data. But it should be noted that when reading in this way, the buffer pointer, offset and length all need to be aligned. If the user is allowed to do this alignment, additional memory copies will be generated, so we have done the alignment inside the file system, which not only optimizes performance but also facilitates users.

tetron

Was curious how they get such performance with a FUSE based design. It seems that they sort of cheat, FUSE is used to manage metadata but to get high performance you have to link in the C++ client library and do all your reads and writes through that. So it isn't general purpose, you have to modify your application to take advantage of it. Still, that's a clever trick, and makes me wonder if there's a LD_PRELOAD strategy that could generalize.

grohan

They appear to have Python bindings which seems reasonable from an API / usability perspective? https://github.com/deepseek-ai/smallpond

In terms of fast FUSE - also my first question, appears to be`io_uring` + FUSE :)

https://github.com/deepseek-ai/3FS/blob/main/src/lib/api/Usr...

yalogin

It’s not clear to me where and how the current popular systems fall short. Do they talk about I anywhere?

Also, what specifically is the data access patterns for training and inference that are different from traditional use cases?

jpgvm

Well current popular systems are pretty much limited to Lustre and the new kid Weka, mostly Lustre though tbh.

You can try to use "standard" options like MinIO/Ceph(RADOS)/SeaweedFS but you will very quickly learn those systems aren't remotely fast enough for these usecases.

AI training is what this is used for, not inference (which has absolutely no need for any filesystem at all). What makes the workload somewhat special is that it's entirely random read and not cacheable at all as most reads are one and done.

Would Lustre be perfectly fine at 6TiB/s? Yes. Is it a huge pain in the ass to operate and make remotely highly available? Also yes. If this thing is capable of the throughput but easier to operate and generally more modern and less baroque it's probably an improvement. TLDR is Lustre is fast but that is literally it's only redeeming quality. I have lost far too many hours of my life to the Lustre gods.

thohj4234234324

This is very humbling.

OpenAI et. al kind of have also been very deep down the systems rabbit hole (eg. Triton), but I can't think of anyone else (outside of Google/Facebook) who pay this amount to attention to things.

Great work; hope Deepseek does even more awesome things going forward.

richardw

I’ve assumed that it’s partly because the company has done a lot of HFT, which is very focused on performance. But I’m not an expert in either.

WiSaGaN

Indeed, the blog mentioned in the other comment showed part of 3FS code was completed at least since 2019, when this was still a project of the quant funds. In HFT, you tend to dogfood a lot of the things to achieve low latency, high performance, sometimes just because HFT system just need to do one specific thing, and those off the shelf stuff usually cater for a lot wider scenarios where HFT doesn't really care about. Here you see similar case which they focus specifically on loading large amount of data during training, and implement that to the extreme.

null

[deleted]

bee_rider

They sure are productive.

What are we going to see tomorrow? DeepSeek OS or something?

logicallee

>They sure are productive.

I have a theory as to why...

null

[deleted]

do_not_redeem

Can someone convince me this isn't NIH syndrome? Why would you use this instead of SeaweedFS, Ceph, or MinIO?

mgerdts

> The final aggregate read throughput reached approximately 6.6 TiB/s with background traffic from training jobs.

The Ceph team has been working on Crimson for years to get past performance bottlenecks inherent to the HDD-based design. I’m having troubles finding any ceph benchmark results that show any close to 100 GB/s.

nivertech

I'd argue that they don't need a filesystem or an object storage, they need a purpose-built data serving layer optimized for their usecase.

null

[deleted]

jpgvm

None of those are close to fast enough.

The only competitors in the parallel FS space that are useful for this are Lustre and Weka.

Otherwise if you don't need a single namespace a bunch of fat AF NFSv4 servers w/NFS over RDMA will also get you to 6TiB/s.

The "surefire" way though is still Lustre, it's the big daddy of distributed parallel filesystems still but it's an absolute beast to setup and operate.

cttet

If NIH syndrome boosts morale of the team, it should be helpful on overall team progress though.

startupsfail

It’s not. When you are a high frequency trader and you’ve mastered RDMA, everything around you looks slow. You are thinking in terms of 20 nanoseconds intervals, while everyone around still thinks that serving a query under a millisecond is fast.

rfoo

Huh? What kind of RDMA has a completion latency of 20 nanoseconds? It's more like 5 microseconds.

I agree that a lot of "modern" storage stack is way too slow though, tried to find a replication-first object storage for crazy-fast random read in small number of objects last year and found none.

ein0p

Seems like Ceph is considerably lower in throughput: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/ A serious concern when saving hundreds of terabytes of weights and optimizer states every now and again, or loading large precomputed prefix KV-caches. Minio seems to be slower still. IDK about SeaweedFS - they don't mention performance in their selling points at all.

do_not_redeem

It's quite funny that I got two opposite answers right away: you say it's to improve throughput, and sibling says it's to improve latency, and as we know throughput and latency trade off against each other. I'm inclined to agree it's more likely they're prioritizing throughput, since their readme charts throughput but not latency. But OTOH, the project looks like it requires RDMA. I wonder if the authors have written about their motivations and the tradeoffs they made, so we don't have to speculate.

EDIT: Their blog post answered all my questions and more. https://www.high-flyer.cn/blog/3fs/

ein0p

Because the two are interconnected and aren't in conflict with each other. You not only want high throughput - that by itself would be quite limiting. You want it along with low latency as well, or else it's very easy to end up in the situation where your throughput is effectively zero if the access pattern is "bad".

budududuroiu

Does anyone know if there’s a benefit to porting this to an orchestrator like K8s, maybe overkill for training but the KVCache might be useful when having multiple replicas for inference?

jauntywundrkind

Man, 6.6TB/s across 180 nodes is 300Gbps/node, or 37.5GBps.

That's with 14 unnamed SSD per node. I wonder how this would scale to higher end SSD,dealing from PCIe 4 to PCIe 5 or PCIe 6... Particularly whether one could scale down!

pepsi-not-coke

I love it. AWS EFS costs too much. The open source solutions are clunky. I'm hoping DS applied their ingenuity to this one, too. Can't wait to trial it.

WithinReason

Why is this even necessary? Can you just shard your training set to the training nodes ahead of time instead?

jeffbee

Interesting that their GraySort result is CPU bound while they are using 3x more CPUs than the record holder from ten years ago.

sitkack

How can you determine that it CPU bound from the attached charts?

brcmthrowaway

What does Anthropic use?