Removing newlines in FASTA file increases ZSTD compression ratio by 10x
12 comments
·September 12, 2025mfld
nolist_policy
It uses more memory (up to +2gb) during decompression as well -> potential DoS.
leobuskin
What about a specialized dict for FASTA? Shouldn't it increase ZSTD compression significantly?
bede
Yes I'd expect a dict-based approach to do better here. That's probably how it should be done. But --long is compelling for me because using it requires almost no effort, it's still very fast, and yet it can dramatically improve compression ratio.
rini17
This might in general be a good preprocessing step to check for punctuation repeating in fixed intervals and remove it, and restore after decompression.
vintermann
That turns in into specialized compression, which DNA already has plenty of. Many forms of specialized compression even allow string-related queries directly on the compressed data.
bede
Yes, it sounds like 7-Zip/LZMA can do this using custom filters, among other more exotic (and slow) statistical compression approaches.
Kim_Bruning
Now I'm wondering why this works. DNA clearly has some interesting redundancy strategies. (it might also depend on genome?)
dwattttt
The FASTA format stores nucleotides in text form... compression is used to make this tractable at genome sizes, but it's by no means perfect.
Depending on what you need to represent, you can get a 4x reduction in data size without compression at all, by just representing a GATC with 2 bits, rather than 8.
Compression on top of that "should" result in the same compressed size as the original text (after all, the "information" being compressed is the same), except that compression isn't perfect.
Newlines are an example of something that's "information" in the text format that isn't relevant, yet the compression scheme didn't know that.
hyghjiyhu
I think one important factor you missed to account for is frameshifting. Compression algorithms work on bytes - 8 bits. Imagine that you have the exact same sequence but they occur at different offsets mod 4. Then your encoding will give completely different results, and the compression algorithm will be unable to make use of the repetition.
vintermann
This is a dataset of bacterial DNA. Any two related bacteria will have long strings of the same letters. But it won't be neatly aligned, so the line breaks will mess up pattern matching.
bede
Exactly. The line breaks break the runs of otherwise identical bits in identical sequences. Unless two identical subsequences are exactly in phase with respect to their line breaks, the hashes used for long range matching are different for otherwise identical subsequences.