Taking a Look at Compression Algorithms

40 comments

·January 20, 2025

jkaptur

Amazing writeup - I needed this a few months ago :)

My impression after my own, shallower dive is that trainable dictionaries are an underappreciated part of the (or at least my) system design toolkit.

For example, say you're serving Wikipedia - a bunch of pages that are kind of static. In order to minimize disk space, you'll be tempted to compress the content. Compressing the whole corpus gets a good compression ratio, but it means that, to read an arbitrary item, you need to decompress everything (or 50% of everything, on average, I guess).

So to get random access, you compress each page. That's ok, but you get a worse compression ratio because every compressor starts from scratch.

But with Zstandard and a trainable dictionary, you could train a dictionary on a couple pages, then use that dictionary to compress and decompress arbitrary items.

As far as I can tell, that's probably the best of both worlds - close to the compression ratio of compressing the whole corpus with gzip, but the random access of compressing each item individually.

This seems really generalizable - e.g. maybe Facebook has to store a zillion photos that are very rarely accessed, but 10% of them are selfies. If we use a vector search to find clusters of similar items, we can compress those items with a single dictionary.

In fact, taking another step back, it seems like databases ought to offer this out of the box. Just like the concept of an index, it's not always a win and there are a lot of knobs that you might want to tune, but the benefits seem clear.

Maybe all of this already exists, or there's something I'm missing, but I really appreciate article's like OP's that break things down so clearly.

retrac

What you are describing is sometimes called a shared dictionary and it's a great trick for task-specific compression, where you know what data you're going to be compressing ahead of time.

The Brotli algorithm is typical LZ plus a shared dictionary aimed at common web documents and markup. It does work well and fast for HTML. A common criticism is that it's basically targeted at compressing Wikipedia and the dictionary is loaded with a bunch of junk and now every browser needs a copy of that 120 kB of junk some of which will very rarely be used unless you're compressing Wikipedia. (Both "II, Holy Roman" and "Holy Roman Emperor" are tokens in the Brotli dictionary, for example. Whole dictionary here for the curious: https://gist.github.com/duskwuff/8a75e1b5e5a06d768336c8c7c37... )

svieira

In fact there is a new feature Chrome is championing (and just shipped) called "Compression dictionary transport" - https://datatracker.ietf.org/doc/draft-ietf-httpbis-compress... / https://chromestatus.com/feature/5124977788977152 that allows any HTTP resource to specify the dictionary it wants to use (including the "use me as the dictionary for future reuqests") which allows a website to use a dictionary that specialized to _its_ contents instead of the contents of something completely different.

genewitch

Another state machine in the browser what can go wrong

benwills

If anyone is interested in an example of how ZSTD's dictionary compression performs against standard gzip, a number of years ago I put together an example using some Common Crawl data.

"I was able to achive a random WARC file compression size of 793,764,785 bytes vs Gzip's compressed size of 959,016,011" [0]

In hindsight, I could have written that up and tested it better, but it's at least something.

[0] https://github.com/benwills/proposal-warc-to-zstandard?tab=r...

whizzter

RocksDB has support for Ztd and preset dictionaries and it makes sense since it has the same kind of level-spans chunking being a fork of LevelDB.

Entries are stored in-memory/logged (instead of put into a b-tree like classic DB's) and then periodically placed in span-files that are "linear" for faster search, however as these span files are built in bulk it makes more sense to compress blocks of them since much data is handled at once (so even if it's linear it's still blocks and reading just produces more variable size blocks by decompression upon read).

pronoiac

I remember tools that worked with the Wikipedia dumps, in bzip2, and built indexes to allow decent random access. Once you know where the compressed blocks are, and which Wikipedia entries they contain, you could start from a given block, something like 900k, rather than start at the beginning of the file. Compressing roughly a megabyte at a time, rather than a page, is a pretty solid win for compressibility.

MrLeap

Good thoughts. I'm going to keep this in mind. I've been working on a custom udp netcode for a while. I experimented with LZMAing / RLEing my binary snapshot diffs I send down, and neither felt great, but RLEing beat LZMA for what I was doing so far 100% of the time. Some kind of trained dictionary does sound better.

adgjlsfhk1

In general, it's often worth doing transforms like RLE combined with general purpose compression. General compression algorithms don't know about the details of your data and typically have a max window size period, so if RLE compresses your data a lot, it makes LZMA (or most other compressors) will be seeing a giant block of zeros most of the time and won't be able to see nearly as much of the actual data. Running compression after RLE will mean that te giant chunks of zeros will be squashed down so the regular compressor can fit non-trivially compressable data within the window size and more usefully look for improvements.

bitschubser_

The conclusion: "One interesting idea is that, at their core, AI models are nothing more than compression models that take the corpus of data humanity has and boil it down to a set of weights" is also good, there is a interesting paper about compressing weather date with neuronal networks: https://arxiv.org/abs/2210.12538

What's missing a bit is that the comparison is more for general purpose data, there are some very interesting and super fast compressing algorithms for e.g. numbers (Turbopforc, gorilla, etc...) Daniel Lemires blog is super interesting about the different algorithms and how to make them faster.

quacksilver

Information Theory, Inference, and Learning Algorithms (2005) by David J.C. MacKay (sadly deceased) was one of my favorite books covering some of the Maths in this area. I need to look at it again.

Free link to online version http://www.inference.org.uk/itprnn/book.pdf

userbinator

I believe LZ + Huffman hit a sweet spot in ratio/speed/complexity, and that's why it has remained very popular since the late 80s. It's only more recently that faster hardware made arithmetic coding with higher-order models fast enough to be practically usable.

whizzter

More importantly, there's been a huge shift in the cost of tables vs multiplications.

Back in the 80s and early 90s multiplication was expensive (multiple cycles) while tables were more or less "free" in comparison, today cache-misses are super-expensive (100s of cycles) while multiplications can be run in parallel (MMX,SSE,etc). Sure a huffman table will probably mostly be in-cache but it'll still be at the cost of cache space.

In addition to that various arithmetic encoding methods were patented and thus avoided.

pronoiac

I did some benchmarking of compression and decompression last year. Raspberry Pi 4, Debian, and my corpus was a filesystem with a billion files on it, as a sparse tar of a drive image, which I acknowledge is an odd choice, but that's where my focus was. I made graphs, because the tables were overwhelming. (Which also applies to this post, I think.) There's a blog post, but I think, more quickly useful:

* the graphs - https://gitlab.com/pronoiac/billion-file-fs/-/tree/main/grap...

* the Pareto frontier for compression vs time: zstd -1 and -9, plzip -0, xz -1 and -9. lzop -1 was a bit faster, and plzip -9 a bit smaller, but they had heavy penalties on the other axis.

I wasn't aware of Snappy.

NKosmatos

Reminds me of the Hutter Prize: http://prize.hutter1.net

https://en.wikipedia.org/wiki/Hutter_Prize

ZhiqiangWang

I really like the conclusion. “Gzip is all you need” has lost it’s momentum for some time, but there is definitely more gold to dig in that area.

z0r

Does anyone have a favorite arithmetic encoding reference implementation? It's something I've never implemented myself and it's a gap in my tactile understanding of algorithms that I've always meant to fill in

fraserphysics

"Arithmetic coding for data compression" by Witten Neal and Cleary. 1987. https://dl.acm.org/doi/10.1145/214762.214771 It explains that all that's left to work on for better lossless compression factors is better (higher likelihood) models of sources. It provides C code and a good explanation of arithmetic coding.

paulirish

I really enjoy https://andrewiggins.github.io/gz-heatmap/ to help visualize gzip/deflate at a per-byte level.

WaitWaitWha

If any of you interested in more details on compression, and in use, take a look at "Bit-mapped graphics" by Steve Rimmer and "The data compression book" by Mark Nelson.

leobg

I’d be interested in hearing more about how to “abuse” compression algorithms to detect outliers in datasets or to filter out repetitions. For example, LLMs or Whisper sometimes get stuck in a loop, repeating a word, a group of words, or whole sentences multiple times. I imagine a compression algorithm would be an interesting tool to filter out such noise. And there’s maybe even more shenanigans you can do with them besides actual compression…

remram

Why not use a dedicated algorithm?

Rolling hashes and bloomfilters would get you a lot more than trying to glean something out of a compressed stream. It's not because compression algorithms use dictionaries that it's the only way to build a dictionary...

hinkley

I was promised a description of how FSE works but all I got was an implementation, by a person muttering about how hard ANS was to understand.

So... any links on how it actually works?

HN

Taking a Look at Compression Algorithms

Taking a Look at Compression Algorithms