Skip to content(if available)orjump to list(if available)

Bzip3: A better and stronger spiritual successor to BZip2

mmastrac

I've studied the Burrows-Wheeler Transform, I understand the transformation, I've re-implemented it countless times for kicks, I see how it improves compressability, but for the life of me the intuition of _why_ it works has never really clicked.

It's a fantastic bit of algorithmic magic that will always impress me to see it.

thrtythreeforty

Thank you for the reference. I learned something new today. That algorithm is wild. If you had shown me the transform and asked if it had an inverse, I would have said of course it doesn't, it's too weird.

unnah

The Burroughs-Wheeler transform has been described as a unique algorithm idea in that there are no non-trivial variations or related algorithms, unlike more conventional compression algorithms, which can be tweaked and improved in so many ways. There is no general compression theory in which BWT could be described as a special case.

It looks to me that the above still holds: Bzip2 and Bzip3 are simply combining more conventional compression algorithms with the BWT, which itself is still the same old transform. Bzip2 does Huffman coding after BWT, and Bzip3 does arithmetic coding.

altairprime

[delayed]

mnw21cam

I remember the lecturer commenting on what sort of sick and twisted mind could come up with such a ridiculous convoluted notion when I was taught it at university.

null

[deleted]

linsomniac

Highlight (benchmark of Perl source code):

    The results follow:
    xz -T16 -9 -k - 2'056'645'240 bytes (c=12m09s, d=4m40s)
    bzip2 -9 -k   - 3'441'163'911 bytes (c=17m16s, d=9m22s)
    bzip3 -b 256  - 1'001'957'587 bytes (c=7m10s,  d=4m6s? Unclear on source page)
    bzip3 -b 511  -   546'456'978 bytes (c=7m08s,  d=4m6s? Unclear)
    zstd -T12 -16 - 3'076'143'660 bytes (c=6m32s,  d=3m51s)
edit: Adding times and compression levels

zimpenfish

Did a test on a 1.3G text file (output of `find -f -printf ...`); Macbook Pro M3 Max 64GB. All timings are "real" seconds from bash's builtin `time`.

    files.txt         1439563776
    bzip2 -9 -k       1026805779 71.3% c=67 d=53
    zstd --long -19   1002759868 69.7% c=357 d=9
    xz -T16 -9 -k      993376236 69.0% c=93 d=9
    zstd -T12 -16      989246823 68.7% c=14 d=9
    bzip3 -b 256       975153650 67.7% c=174 d=187
    bzip3 -b 256 -j12  975153650 67.7% c=46 d=189
    bzip3 -b 511       974113769 67.6% c=172 d=187
    bzip3 -b 511 -j12  974113769 67.6% c=77s d=186
I'll stick with zstd for now (unless I need to compress the Perl source, I guess.)

(edited to add 12 thread runs of bzip3 and remove superfluous filenames)

linsomniac

My standard test is compressing a "dd" disc image of a Linux install (I use these for work), with unused blocks being zeroed. Results:

    Uncompressed:      7,516,192,768
    zstd:              1,100,323,366
    bzip3 -b 511 -j 4: 1,115,125,019

null

[deleted]

sevg

To make your comment more useful you’ll want to include compression and decompression time.

Using the results from the readme, seems like bzip3 performs competitively with zstd on both counts.

idoubtit

I've experimented a bit with bzip3, and I think the results in the readme are not representative. I think it's a handmade pick, with an uncommon input and unfair choices of parameters. And it's made with a HDD, which skews the results even more.

For instance, with a 800 MB SQL file, for the same compression time and optimal parameters (within my capacity), bzip3 produced a smaller file (5.7 % compression ration) than zstd (6.1 % with `--long -15`). But the decompression was about 20× slower (with all cores or just one).

I'm not claim my stupid benchmark is better or even right. It's just that my results were very different from bzip3's readme. So I'm suspicious.

dralley

Also the compression levels..

linsomniac

I believe the compression levels are included in the list above.

jl6

A 4x improvement over lzma is an extraordinary claim. I see the author has also given a result after applying lrzip (which removes long-range redundancies in large files), and the difference isn’t so great (but bzip3 still wins). Does the amazing result without lrzip mean bzip3 is somehow managing to exploit some of that long-range redundancy natively?

I’d be astonished if such a 4x result generalized to tarballs that aren’t mostly duplicated files.

wongarsu

Currently running my own benchmarks, my preliminary results are that zstd becomes competitive again once you add the --long option (so `zstd --long -16 all.tar` instead of `zstd -16 all.tar`). Which is an option that not everyone might be aware of, but whose usefulness should be intuitive for this benchmark of >200 very similar files.

miohtama

In Linux source benchmark results are interestingly more equal, LZMA still holding up well.

What makes Perl source benchmark special? Deduplication?

mmastrac

An old friend use to say that Perl is line noise that was given sentience.

ape4

This is the source - which is probably C.

loeg

Why -T12 for zstd and T16 for xz? How many threads is bzip3 using?

zimpenfish

From the source, it looks like bzip3 defaults to 1 thread if not explicitly set by arguments.

JoshTriplett

...using zstd level 16, when zstd goes up to 22. And without turning on zstd's long-range mode.

lexicality

> DO NOT COMPRESS ANY DATA WITH THIS PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE POSSIBILITY, HOWEVER SMALL, THAT THE DATA WILL NOT BE RECOVERABLE.

I know every open source project (and quite a lot of expensive proprietary ones!) come with a "btw this software might wipe your computer, if it does that's your fault lol" clause in their license but I can't imagine trying to convince anyone else that using this for anything remotely serious is a good idea with that line in the readme.

PaulKeeble

I had bzip2 eat some files some years back and ended up loosing them. I don't trust bzip2 for backup files and I wouldn't trust bzip3 either as a result.

cwillu

Simple enough to be safe, at the cost of performance: uncompress and compare to the original.

hinkley

You could have bugs that show up on different hardware or compiler versions. So the round trip is table stakes but not a guarantee.

lexicality

And what do you do if it doesn't match?

vardump

Isn't it obvious? Warn the user, who can now use something else instead.

supertrope

A shell script to run a checksum on the original file, compress, uncompress, and verify output integrity should be trivial.

Arech

I wonder why this isn't inbuilt into the program..

hinkley

Honestly I think this may be part of why compression in flight has been a growth area in compression lately. If your transport encoding becomes garbled people notice immediately. If you manage to serve a few billion compressed requests without any incident reports then maybe you can trust it for data at rest.

cowsandmilk

I mean, hosting downloads that have been compressed with this seems fine. I have the original, I just want a smaller version for those downloading.

pxeger1

The author is super cool. They are one of very few people to write any substantial programs in Malbolge, a programming language designed to be cryptographically hard to use (or something like that)

jgalt212

I poke around in this space periodically, but I've never found a compelling reason to move away from gzip.

loeg

zstd is faster and provides better compression than gzip at every point on the curve. There is no reason to use gzip these days other than backwards compatibility.

hinkley

Do zless, zcat and zgrep support zstd everywhere? And I mean everywhere? VMs? Alpine? FreeBSD? OSX? Openwrt?

Nothing is shittier than sshing into a box that doesn’t understand half of your command line tricks. Or the clever shell script you just tested six ways to Sunday. It’s like fighting with your hands tied behind your back.

idoubtit

I've reached the same conclusion, and I've been using zstd unless I want extremely fast (de)compression with lz4.

And even when I still have to use gzip (the format), the executables of libdeflate are fully compatible and noticeably faster than gzip/gunzip (the tool).

jrockway

This was on HN a while ago https://jolynch.github.io/posts/use_fast_data_algorithms/ and reaches the same conclusion. I am inclined to use Zstd even in cases where people would use lz4. Zstd is very fast.

ThatGuyRaion

zstd is a great setup, for sure, but the build system they use is patently awful. Can someone make an autoconf or hell, even CMake build system for them pleasee???

loeg

Just install the binary package from whatever distro you use? Why do you need to build it?

But if it matters, FreeBSD has a pretty trivial BSDmake build of it:

https://github.com/freebsd/freebsd-src/blob/main/lib/libzstd...

https://github.com/freebsd/freebsd-src/blob/main/usr.bin/zst...

You could easily do something similar in GNU make or whatever without the dependencies on the FBSD build system. It's basically just the classic "cc -c a bunch of C files, then link them."

Borg3

Same here. Storage is not that expensive so I do NOT care to squize every byte out of archive. Also, im more into retro, so portability and memory usage is more importand for me :)

otterley

(2022)

froh

what do you mean??

usually this signifies the year some article was finalized and published.

and this is a GitHub repo with recent commits so this is not correct here.

hulitu

> Bzip3: A better and stronger spiritual successor to BZip2

Please, no. What's next ? BZip4 and BZip5, each one incompatible with each other ?

sitkack

this is the way.

wakawaka28

If they have to be incompatible then it's better to not conceal that. Generalized file formats require you to implement more stuff to support them, and we can't tell the format by looking at the file name.

chefandy

Yeah I'm sick of this. Did you know you can't even use ext2/3/4 together on the same partition? What a mess.

loeg

The ext4/3 filesystems, notably, can read/write ext2 (and for ext4: ext3) filesystems in a compatible way.

chefandy

Oh yeah true

snvzz

The format gets a new major version precisely because it is incompatible.