Skip to content(if available)orjump to list(if available)

Taking a Look at Compression Algorithms

jkaptur

Amazing writeup - I needed this a few months ago :)

My impression after my own, shallower dive is that trainable dictionaries are an underappreciated part of the (or at least my) system design toolkit.

For example, say you're serving Wikipedia - a bunch of pages that are kind of static. In order to minimize disk space, you'll be tempted to compress the content. Compressing the whole corpus gets a good compression ratio, but it means that, to read an arbitrary item, you need to decompress everything (or 50% of everything, on average, I guess).

So to get random access, you compress each page. That's ok, but you get a worse compression ratio because every compressor starts from scratch.

But with Zstandard and a trainable dictionary, you could train a dictionary on a couple pages, then use that dictionary to compress and decompress arbitrary items.

As far as I can tell, that's probably the best of both worlds - close to the compression ratio of compressing the whole corpus with gzip, but the random access of compressing each item individually.

This seems really generalizable - e.g. maybe Facebook has to store a zillion photos that are very rarely accessed, but 10% of them are selfies. If we use a vector search to find clusters of similar items, we can compress those items with a single dictionary.

In fact, taking another step back, it seems like databases ought to offer this out of the box. Just like the concept of an index, it's not always a win and there are a lot of knobs that you might want to tune, but the benefits seem clear.

Maybe all of this already exists, or there's something I'm missing, but I really appreciate article's like OP's that break things down so clearly.

retrac

What you are describing is called a "shared dictionary" and it's a well-known trick for task-specific compression, where you know what data you're going to be compressing ahead of time.

The Brotli algorithm is typical LZ plus a shared dictionary aimed at common web documents and markup. It does work well (and fast) for HTML. A common criticism is that it's basically targeted at compressing Wikipedia and the dictionary is loaded with a bunch of junk and now every browser needs a copy of that 120 kB junk which will rarely be used. (Both "II, Holy Roman" and "Holy Roman Emperor" are tokens in the Brotli dictionary, for example. Whole thing here: https://gist.github.com/duskwuff/8a75e1b5e5a06d768336c8c7c37... )

MrLeap

Good thoughts. I'm going to keep this in mind. I've been working on a custom udp netcode for a while. I experimented with LZMAing / RLEing my binary snapshot diffs I send down, and neither felt great, but RLEing beat LZMA for what I was doing so far 100% of the time. Some kind of trained dictionary does sound better.

adgjlsfhk1

In general, it's often worth doing transforms like RLE combined with general purpose compression. General compression algorithms don't know about the details of your data and typically have a max window size period, so if RLE compresses your data a lot, it makes LZMA (or most other compressors) will be seeing a giant block of zeros most of the time and won't be able to see nearly as much of the actual data. Running compression after RLE will mean that te giant chunks of zeros will be squashed down so the regular compressor can fit non-trivially compressable data within the window size and more usefully look for improvements.

whizzter

RocksDB has support for Ztd and preset dictionaries and it makes sense since it has the same kind of level-spans chunking being a fork of LevelDB.

Entries are stored in-memory/logged (instead of put into a b-tree like classic DB's) and then periodically placed in span-files that are "linear" for faster search, however as these span files are built in bulk it makes more sense to compress blocks of them since much data is handled at once (so even if it's linear it's still blocks and reading just produces more variable size blocks by decompression upon read).

userbinator

I believe LZ + Huffman hit a sweet spot in ratio/speed/complexity, and that's why it has remained very popular since the late 80s. It's only more recently that faster hardware made arithmetic coding with higher-order models fast enough to be practically usable.

whizzter

More importantly, there's been a huge shift in the cost of tables vs multiplications.

Back in the 80s and early 90s multiplication was expensive (multiple cycles) while tables were more or less "free" in comparison, today cache-misses are super-expensive (100s of cycles) while multiplications can be run in parallel (MMX,SSE,etc). Sure a huffman table will probably mostly be in-cache but it'll still be at the cost of cache space.

In addition to that various arithmetic encoding methods were patented and thus avoided.

WaitWaitWha

If any of you interested in more details on compression, and in use, take a look at "Bit-mapped graphics" by Steve Rimmer and "The data compression book" by Mark Nelson.

spookie

The article is excellent!

Btw, does anyone know of an article or book about GPU texture compression? Would love a good in detail reference.

zOneLetter

Wish he took a look at middle-out too. Kind of a missed opportunity tbh.

tmilard

Tanks for the easy explantations. Very Clear. Doing Compression of Data using different strategies are always usefull to understand : After all, many of us software developers are dealing with saving Data with size issues, speed loading of junk of data. Interesting...

ZhiqiangWang

I really like the conclusion. “Gzip is all you need” has lost it’s momentum for some time, but there is definitely more gold to dig in that area.

jslakro

I didn't find the middle-out compression

Scene_Cast2

Note that this is about general purpose compression. For example, most machine learning (including LLMs) results in a compression algorithm, as the objective function is minimizing entropy (see nn.CrossEntropyLoss).