Skip to content(if available)orjump to list(if available)

Working with Files Is Hard (2019)

Working with Files Is Hard (2019)

21 comments

·January 23, 2025

continuational

> Pillai et al., OSDI’14 looked at a bunch of software that writes to files, including things we'd hope write to files safely, like databases and version control systems: Leveldb, LMDB, GDBM, HSQLDB, Sqlite, PostgreSQL, Git, Mercurial, HDFS, Zookeeper. They then wrote a static analysis tool that can find incorrect usage of the file API, things like incorrectly assuming that operations that aren't atomic are actually atomic, incorrectly assuming that operations that can be re-ordered will execute in program order, etc.

> When they did this, they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug. This isn't a knock on the developers of this software or the software -- the programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority programmers and the software has more rigorous tests than most software. But they still can't use files safely every time! A natural follow-up to this is the question: why the file API so hard to use that even experts make mistakes?

Retr0id

> why the file API so hard to use that even experts make mistakes?

I think the short answer is that the APIs are bad. The POSIX fs APIs and associated semantics are so deeply entrenched in the software ecosystem (both at the OS level, and at the application level) that it's hard to move away from them.

__loam

POSIX is also so old and essential that it's hard to imagine an alternative.

jcranmer

Not really, there's been lots of APIs that have improved on the POSIX model.

The kind of model I prefer is something based on atomicity. Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly. (Note that something like writeFileAtomic is already a common primitive in many high-level filesystem APIs, and it's something that's already easily buildable with regular POSIX APIs). For cases like logging, you can extend the model slightly with atomic appends, where the only kind of write allowed is to atomically append a chunk of data to the file (so readers can only possibly either see no new data or the entire chunk of data at once).

I'm less knowledgeable about the way DBs interact with the filesystem, but there the solution is probably ditching the concept of the file stream entirely and just treating files as a sparse map of offsets to blocks, which can be atomically updated. (My understanding is that DBs basically do this already, except that "atomically updated" is difficult with the current APIs).

kccqzy

By the way, LMDB's main developer Howard Chu responded to the paper. He said,

> They report on a single "vulnerability" in LMDB, in which LMDB depends on the atomicity of a single sector 106-byte write for its transaction commit semantics. Their claim is that not all storage devices may guarantee the atomicity of such a write. While I myself filed an ITS on this very topic a year ago, http://www.openldap.org/its/index.cgi/Incoming?id=7668 the reality is that all storage devices made in the past 20+ years actually do guarantee atomicity of single-sector writes. You would have to rewind back to 30 years at least, to find a HDD where this is not true.

So this is a case where the programmers of LMDB thought about the "incorrect" use and decided that it was a calculated risk to take because the incorrectness does not manifest on any recent hardware.

This is analogous to the case where someone complains some C code has undefined behavior, and the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.

Furthermore both the LMDB issue and the Postgres issue are noted in the paper to be previously known. The paper author states that Postgres documents this issue. The paper mentions pg_control so I'm guessing it's referring to this known issue here: https://wiki.postgresql.org/wiki/Full_page_writes

> We rely on 512 byte blocks (historical sector size of spinning disks) to be power-loss atomic, when we overwrite the "control file" at checkpoints.

dkarl

> why the file API so hard to use that even experts make mistakes?

Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.

(I'm not an expert in the history, just observing the surface similarity and hoping someone with more knowledge can substantiate it.)

liontwist

Something this misses is that all programs make assumptions for example - “my process is the only one writing this file because it created it”

Evaluating correctness without that consideration is too high of a bar.

Safety and correctness cannot be “impossible to misuse”

praptak

Ext4 actually special-handles the rename trick so that it works even if it should not:

"If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and [basically save your ass]"[0]

[0]https://docs.kernel.org/admin-guide/ext4.html

Retr0id

> they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug.

This is why whenever I need to persist any kind of state to disk, SQLite is the first tool I reach for. Filesystem APIs are scary, but SQLite is well-behaved.

Of course, it doesn't always make sense to do that, like the dropbox use case.

nodamage

Before becoming too overconfident in SQLite note that Rebello et al. (https://ramalagappan.github.io/pdfs/papers/cuttlefs.pdf) tested SQLite (along with Redis, LMDB, LevelDB, and PostgreSQL) using a proxy file system to simulate fsync errors and found that none of them handled all failure conditions safely.

In practice I believe I've seen SQLite databases corrupted due to what I suspect are two main causes:

1. The device powering off during the middle of a write, and

2. The device running out of space during the middle of a write.

ablob

I believe it is impossible to prevent dataloss if the device powers off during a write. The point about corruption still stands and appears to be used correctly from what I skimmed in the paper. Nice reference.

SoftTalker

Only way I know of is if you have e.g. a RAID controller with a battery-backed write cache. Even that may not be 100% reliable but it's the closest I know of. Of course that's not a software solution at all.

ziddoap

>SQLite is the first tool I reach for.

Hopefully in whichever particular mode is referenced!

Retr0id

WAL mode, yes!

gavinhoward

I wonder if, in the Pillai paper, I wonder if they tested the SQLite Rollback option with the default synchronous [1] (`NORMAL`, I believe) or with `EXTRA`. I'm thinking that it was probably the default.

I kinda think, and I could be wrong, that SQLite rollback would not have any vulnerabilities with `synchronous=EXTRA` (and `fullfsync=F_FULLFSYNC` on macOS [2]).

[1]: https://www.sqlite.org/pragma.html#pragma_synchronous

[2]: https://www.sqlite.org/pragma.html#pragma_fullfsync

wruza

No mention on ntfs and windows keywords in the article, for those interested.

pjdesno

Although the conference this was presented at is platform-agnostic, the author is an expert on Linux, and the motivation for the talk is Linux-specific. (Dropbox dropping support for non-ext4 file systems)

The post supports its points with extensive references to prior research - research which hasn't been done in the Microsoft environment. For various reasons (NDAs, etc.) it's likely that no such research will ever be published, either. Basically it's impossible to write a post this detailed about safety issues in Microsoft file systems unless you work there. If you did, it would still take you a year or two of full-time work to do the background stuff, and when you finished, marketing and/or legal wouldn't let you actually tell anyone about it.

yahayahya

Is that because the windows APIs are better? Or because businesses build their embedded systems/servers with Windows?

null

[deleted]