Writing a competitive BZip2 encoder in Ada from scratch in a few days

nayuki

> Another surprising feature of the BWT is that for reversing the permutation, no extra information is needed!

I think this is not true. The way I learned the BWT, after encoding, you need to store the index of the first character (which is a tiny bit of extra information). https://web.archive.org/web/20170325024404/http://marknelson...

horizion2025

I also think when looking at how BWT works, it is initially surprising it is invertible even with such little information as the first character. It appears a bit like sorting the letters and that certainly would require a lot more information to back into place. It is a bit magic even when you know the inversion algorithm.

vintermann

There is a fully bijective version of the BWT which doesn't require the index. It's not what BZip2 uses though.

Someone

That method adds an extra character to your alphabet, and uses that as a marker in the transformed string to h effectively) store the index of the first character inside the transformed text.

If, as is often the case, your alphabet contains exactly as many letters as a byte/word can store, that makes for awkward encoding. I also think that, typically, that method will require more memory.

vintermann

No, I'm not talking about that. I'm talking about David Scott's bijective BWT variant. It sorts symbols by a slightly different type of context, and so can do away with a sentinel symbol or a stored index entirely. It was described in the wikipedia page last I checked.

Scott was a bit of a weirdo, but his algorithm actually got recognition in academia, there have been many papers written about it. One character saved doesn't make a lot of difference for compression, but the transform being a permutation does give it some nice properties.

HN

Writing a competitive BZip2 encoder in Ada from scratch in a few days – part 2

Writing a competitive BZip2 encoder in Ada from scratch in a few days – part 2