Removing newlines in FASTA file increases ZSTD compression ratio by 10x

mfld

    Using larger-than-default window sizes has the drawback of requiring that the same --long=xx argument be passed during decompression reducing compatibility somewhat.

Interesting. Any idea why this can't be stored in the metadata of the compressed file?

nolist_policy

It uses more memory (up to +2gb) during decompression as well -> potential DoS.

leobuskin

What about a specialized dict for FASTA? Shouldn't it increase ZSTD compression significantly?

bede

Yes I'd expect a dict-based approach to do better here. That's probably how it should be done. But --long is compelling for me because using it requires almost no effort, it's still very fast, and yet it can dramatically improve compression ratio.

rini17

This might in general be a good preprocessing step to check for punctuation repeating in fixed intervals and remove it, and restore after decompression.

vintermann

That turns in into specialized compression, which DNA already has plenty of. Many forms of specialized compression even allow string-related queries directly on the compressed data.

bede

Yes, it sounds like 7-Zip/LZMA can do this using custom filters, among other more exotic (and slow) statistical compression approaches.

Kim_Bruning

Now I'm wondering why this works. DNA clearly has some interesting redundancy strategies. (it might also depend on genome?)

dwattttt

The FASTA format stores nucleotides in text form... compression is used to make this tractable at genome sizes, but it's by no means perfect.

Depending on what you need to represent, you can get a 4x reduction in data size without compression at all, by just representing a GATC with 2 bits, rather than 8.

Compression on top of that "should" result in the same compressed size as the original text (after all, the "information" being compressed is the same), except that compression isn't perfect.

Newlines are an example of something that's "information" in the text format that isn't relevant, yet the compression scheme didn't know that.

hyghjiyhu

I think one important factor you missed to account for is frameshifting. Compression algorithms work on bytes - 8 bits. Imagine that you have the exact same sequence but they occur at different offsets mod 4. Then your encoding will give completely different results, and the compression algorithm will be unable to make use of the repetition.

vintermann

This is a dataset of bacterial DNA. Any two related bacteria will have long strings of the same letters. But it won't be neatly aligned, so the line breaks will mess up pattern matching.

bede

Exactly. The line breaks break the runs of otherwise identical bits in identical sequences. Unless two identical subsequences are exactly in phase with respect to their line breaks, the hashes used for long range matching are different for otherwise identical subsequences.

HN

Removing newlines in FASTA file increases ZSTD compression ratio by 10x

Removing newlines in FASTA file increases ZSTD compression ratio by 10x