Show HN: Zeekstd – Rust Implementation of the ZSTD Seekable Format

19 comments

·June 15, 2025

Hello,

I would like to share a Rust implementation of the Zstandard seekable format I've been working on.

Regular zstd compressed files consist of a single frame, meaning you have to start decompression at the beginning. The seekable format splits compressed data into a series of independent frames, each compressed individually, so that decompression of a section in the middle of an archive only requires zstd to decompress at most a frame's worth of extra data, instead of the entire archive.

I started working with the seekable format because I wanted to resume downloads of big zstd compressed files that are decompressed and written to disk on the fly. At first I created and used bindings to the C functions that are available upstream[1], however, I stumbled over the first segfault rather quickly (it's now fixed) and found out that the functions only allow basic things. After looking closer at the upstream implementation, I noticed that is uses functions of the core API that are now deprecated and it doesn't allow access to low-level (de)compression contexts. To me it looks like a PoC/demo implementation that isn't maintained the same way as the zstd core API, probably that's also the reason it's in the contrib directory.

My use-case seemed to require a complete rewrite of the seekable format, so I decided to implement it from scratch in Rust using bindings to the advanced zstd compression API, available from zstd 1.4.0.

The result is a single dependency library crate[2], and a CLI crate[3] for the seekable format that feels similar to the regular zstd tool.

Any feedback is highly appreciated!

[1]: https://github.com/facebook/zstd/tree/dev/contrib/seekable_f... [2]: https://crates.io/crates/zeekstd [3]: https://github.com/rorosen/zeekstd/tree/main/cli

Visit

rwmj

Seekable formats also allow random reads which lets you do trickery like booting qemu VMs from remotely hosted, compressed files (over HTTPS). We do this already for xz: https://libguestfs.org/nbdkit-xz-filter.1.html https://rwmj.wordpress.com/2018/11/23/nbdkit-xz-curl/

Has zstd actually standardized the seekable version? Last I checked (which was quite a while ago) it had not been declared a standard, so I was reluctant to write a filter for nbdkit, even though it's very much a requested feature.

simeonmiteff

This is very cool. Nice work! At my day job, I have been using a Go library[1] to build tools that require seekable zstd, but felt a bit uncomfortable with the lack of broader support for the format.

Why zeek, BTW? Is it a play on "zstd" and "seek"? My employer is also the custodian of the zeek project (https://zeek.org), so I was confused for a second.

[1] https://github.com/SaveTheRbtz/zstd-seekable-format-go

rorosen

Thanks! I was also surprised that there are very few tools to work with the seekable format. I could imagine that at least some people have a use-case for it.

Yes, the name is a combination of zstd and seek. Funnily enough, I wanted to name it just zeek first before I knew that it already exists, so I switched to zeekstd. You're not the first person asking me if there is any relation to zeek and I understand how that is misleading. In hindsight the name is a little unfortunate.

etyp

Zeek is well known in "security" spaces, but not as much in "developer" spaces. It did get me a bit excited to see Zeek here until I realized it was unrelated, though :)

stu2010

This is cool, I'd say that the most common tool in this space is bgzip[1]. Have you thought about training a dictionary on the first few chunks of each file and embedding the dictionary in a skippable frame at the start? Likely makes less difference if your chunk size is 2MB, but at smaller chunk sizes that could have significant benefit.

[1] https://www.htslib.org/doc/bgzip.html

jeroenhd

Looking at the spec (https://github.com/facebook/zstd/blob/dev/contrib/seekable_f...), I don't see any mention of custom dictionaries like you describe.

The spec does mention:

> While only Checksum_Flag currently exists, there are 7 other bits in this field that can be used for future changes to the format, for example the addition of inline dictionaries.

so I don't think seekable zstd supports these dictionaries just yet.

With multiple inline dictionaries, one could detect when new chunks compress badly with the previous dictionary and train new ones on the fly. Could be useful for compressing formats with headers and mixed data (i.e. game files, which can contain a mix of text + audio + video, or just regular old .tar files I suppose).

threeducks

Assuming that frames come at a cost, how much larger are the seekable zstd files? Perhaps as a graph based on frame size and for different kinds of data (text, binaries, ...).

mcraiha

It depends on content and compression options. ZSTD has four different compression methods: Raw literals, RLE literals, Compressed literals and Treeless literals. I assume that the last two might suffer the most if content is splitted.

ncruces

How's tool support these days to create compress a file with seekable zstd?

Given existing libraries, it should be really simple to create an SQLite VFS for my Go driver that reads (not writes) compressed databases transparently, but tool support was kinda lacking.

Will the zstd CLI ever support it? https://github.com/facebook/zstd/issues/2121

tyilo

I already use zstd_seekable (https://docs.rs/zstd-seekable/) in a project. Could you compare the API's of this crate and yours?

tyilo

Correct me if I'm wrong, but it doesn't seem like you provide the equivalent of Seekable::decompress in zstd_seekable which decompresses at a specific offset, without having to calculate which frame(s) to decompress.

This is basically the only function I use from zstd_seekable, so it would be nice to have that in zeekstd as well.

null

[deleted]

b0a04gl

how do you handle cases where the seek table itself gets truncated or corrupted? do you fallback to scanning for frame boundaries or just error out? wondering if there's room to embed a minimal redundant index at the tail too for safety

throebrifnr

Gz has --rsyncable option that does something similar.

Explanation here https://beeznest.wordpress.com/2005/02/03/rsyncable-gzip/

Scaevolus

Rsyncable goes further: instead of having fixed size blocks, it makes the block split points deterministically content-dependent. This means that you can edit/insert/delete bytes in the middle of the uncompressed input, and the compressed output will only have a few compressed blocks change.

andrewaylett

zstd also has an rsyncable option -- as an example of when it's useful, I take a dump of an SQLite database (my Home Assistant DB) using a command like this:

    sqlite3 -readonly "${i}" .dump | zstd --fast --rsyncable -v -o "${PART}" -

The DB is 1.2G, the SQL dump is 1.4G, the compressed dump is 286M. And I still only have to sync the parts that have changed to take a backup.

77pt77

BTW, something similar can be done with zlib/gzip.

rwmj

It's true, using some rather non-obvious trickery: https://github.com/madler/zlib/blob/develop/examples/zran.c

I also wrote a tool to make a randomly modifiable gzipped disk image: https://rwmj.wordpress.com/2022/12/01/creating-a-modifiable-...

dafelst

Sure, but zstd soundly beats gzip on every single metric except ubiquity, it is just straight up a better compression/decompression strategy.