Skip to content(if available)orjump to list(if available)

File Systems Unfit as Distributed Storage Back Ends (2019)

acidmath

Before Bluestore, we ran Ceph on ZFS with the ZFS Intent Log on NVDIMM (basically non-volatile RAM backed by a battery). The performance was extremely good. Today, we run Bluestore on ZVOLs on the same setup and if the zpool is a "hybrid" pool we put the Ceph OSD databases on an all-NVMe zpool. Ceph WAL wants a disk slice for each OSD, so we don't do Ceph WAL and consolidate incoming writes on the ZiL/SLOG on NVDIMM.

nightfly

Why ceph on ZVOLs and not bare disks?

acidmath

In the servers we have only 16gb to 64gb of NVDIMM, depending on density of NVDIMM and how many slots are populated with NVDIMM. Whatever raw NVDIMM is, usable is half because we mirror the contents for physical redundancy (if we lose a transaction it is fatal to our business). NVMe is amazing, but not everything should be NVMe, like petabyte scale object storage for example does not need to be on all NVMe (which is super pricey).

In newer DDR5 servers where we can't get NVDIMM, the alternative battery backed RAM options leave us with even less to work with.

Where we have counts of HDDs or SATA/SAS SSDs in the hundreds, we still want the performance improvements provided by WAL (or functional equivalent such as ZiL/SLOG) on NVDIMM and some layer-2 (where layer-1 is RAM) caching with NVMe.

Ceph OSDs want a dedicated WAL device. Some places use OpenCAS to make "hybrid" devices out of HDDs by pairing them with SSDs where the SSDs can accelerate reads for that HDD and the Ceph OSD goes on a logical OpenCAS device. OpenCAS is really great, but the devices acting as "caching layer" often end up underutilized.

By placing "big" Ceph OSDs on ZVOLs, we don't have individual disk slices for WAL (or equivalent) or individual disks for layer-2 read caching, but a consolidated layer in the form of ZFS Intent Log on "Separate Log" (NVDIMM) and another consolidated layer in the ZFS disk pool's L2ARC (layer-2 adaptive readback cache).

The ZVOLs are striped across multiple relatively large RAIDz3 arrays. Yeah, it's "less efficient" in some ways, but the tradeoff is worth it for us.

  https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#devices
  https://open-cas.com/

__turbobrew__

Do you have any recommendations or warnings about running ceph clusters?

shermantanktop

Also known as:

Write! No, fsync! No, really fsync I mean it!

Wait, why is my disk throughput so low? And why am I out of file descriptors?

chupasaurus

Article is focused on Ceph where FS is a frontend to the storage backend(s), now read the title again...

Dylan16807

> Wait, why is my disk throughput so low?

Because many filesystems do fsync wrong, for reasons that are not inherent to filesystems in general.

baruch

It's easier to write the system's front end while paying little attention to the backend and "just" letting a local filesystem do a lot of the work for you, but it doesn't work well. The interesting question is if the result is also that the frontend-to-backend communication abstraction is good enough to replace the backend with a better solution. I'm not familiar enough with Ceph and BlueStore to have a conclusion on that.

I happen to work for a distributed file-system company, and while I don't do the filesystem part itself, the old saying "it takes software 10 years to mature" is so true in this domain.

sitkack

See also "Hierarchical File Systems are Dead" by Margo Seltzer and Nicholas Murphy https://www.usenix.org/legacy/events/hotos09/tech/full_paper...

MR4D

No mention of LATCH theory? (Location, Alphabet, Time, Category, and Hierarchy)

Oddly, no matter how they are organized, their indices will always be a hierarchy (tree).

Personally, I think human brains just have a categorization approach that is built into our brains as hierarchy, so while other methods are definitely useful, they are an add-on, not a replacement.

zokier

Lot's of these issues seem to be not specific to distributed systems and also impact local single-node systems. Notable example is postgresql fsyncgate, or how mail servers in the past struggled (iirc that was one of the cases where reiserfs shined).

resurrected

Noooo, really?

It all depends on what you want to do. For things that are already in files like all that data that DeepSeek and other models train on and for which DS open sourced their own distributed file system, it makes sense to go with a distributed file system.

For OLTP you need a database with appropriate isolation levels.

I know someone will build a distributed file system on top of FoundationDB if they haven’t yet.

_zoltan_

~2006 I've built a fuse fs that used mysql as a backend, kept all file hashes (not blocks, just whole files) and did deduplication. good old times.

darkstar_16

Isn't the Cassandra file system something like that ?

AtlasBarfed

They did it atop Cassandra.

jeffrallen

They have, at Exoscale. My officemate leads the team doing it.

EGreg

Just use hypercore with hyperdrive. And be free!

Spivak

It really is true, I spent years of my life wrangling a massive glusterfs cluster and it was awful. You basically can't do any kind of file system operations on it that aren't CRUD on well known specific paths. Anything else— traversal, moving/copying, linking, updating permissions would just hang forever. You're also at the mercy of the kernel driver which does hate you personally. You will have nightmares about uninterruptible sleep. Migrating it all to S3 over Ceph was a beautiful thing.

ted_dunning

That has more to do with gluster's primitive nature than with a general statement of what can work for distributed storage.