How to Store Data on Paper?

81 comments

·May 31, 2025

pastage

Bits per [cm²|cm³|kg] is interesting like you get with cuneiform ceramic tablets[1], this one get about 1 word per cm² and cuneiform is crazy dense, I have no real grasp of how sumerian or akkadian words worked. I think it was heavily context based because from some lecture[2] at the British Museum.

I have seen people do ceramics where information was stacked in layers and had to be destroyed to extract. The ultimate form of shifting media to preserve and read information. I guess that could done with better resolution with 3D printed Zirconia (0.1 mm³ blobs) so 1Mb /cm³

Edit: this idea of a cold storage is from Footfall by Niven and Pournelle, where information was stored on monoliths where layers could be incrementally extracted with tools documented on the above layers. i.e. start with 0.1 bit per m² and go down, done with the hand wavy handling of practical problems in science fiction.

[1] https://www.bookandsword.com/2016/10/29/the-information-dens...

[2] https://youtu.be/XVmsfL5LG90

pavel_lishin

I don't recall the Fithp's artefacts requiring destruction to read; I thought they were just created with (presumably) lasers, writing the information in a way that would resist erosion - if one does get eroded, you just slice the eroded part away, re-revealing it again.

pastage

I do not remember, and tried not not imply destruction. It is just the easiest way to do it on your own with ceramics.

dsign

Akkadian is/was syllabic. The language is pretty well preserved I believe, some say there is more text in Akkadian than in classical Latin[^1].

[^1]: Can't find the source right now, so take this with a grain of salt.

benhurmarcel

I have this type of issue professionally too, even though we don't use paper. For regulatory reasons, the only approved format we are allowed to use for long term archiving is PDF/A. No attachments, only pages in a single PDF document.

It has shown to be an issue for including data, or spreadsheets. Most colleagues just print Excel files to a PDF that gets appended, but while it complies with the regulation it's basically unusable as-is.

anthk

DjVU should be the standard format.

tocs3

I have been thinking about this for a long time. Thanks for the link.

The biggest advantage of character-based encodings is that they can be decoded by humans (as opposed to dot-based encodings), which means that you don’t need a camera or a scanner to recover the data.

This is an interesting point. In our post apocalyptic future scholars will be using their quills to translate archives of these (in my imagination anyway). Of course they would have to translate into binary and then into human chars.

I can imaging they will be sad they cannot listen to the mp3's.

Adding color allows on to code more information per dot (3x more with three colors).

Is this right? Wouldn't it be base-3 encoding? Three bits of binary can count to 8. Three trits of base three can count to 27. Color has all sorts of disadvantages but maybe a much greater payoff (unless I m mistaken).

kragen

If a pixel can be printed with no colors (white), cyan, magenta, yellow, cyan and magenta (blue), magenta and yellow (red), yellow and cyan (green), or all three inks (black), that's 8 colors, 3 bits per pixel, not just 3 colors. Typically laser and inkjet printers do more or less work like this, but also have a fourth ink, which is black.

I am very skeptical of this idea that people will be able to write but unable to produce useful digital computers. Computers are a mathematical discovery, not an electronic invention. Electronics makes them a thousand times faster, but a computer made out of wood, flax threads, and bent copper wire would still be hundreds of times faster than a person at tabulating logarithms, balancing ledgers, calculating ballistic trajectories, casting horoscopes, encrypting messages, forecasting finances, calculating architectural measurements, or calculating compound interest. So I think automatic computation as such is likely to survive if any human intellectual tradition does.

tocs3

I am very skeptical of this idea that people will be able to write but unable to produce useful digital computers.

I agree. When I first saw the post and the mention of humans in the reading end of the loop, I though "maybe there is a scifi story here". Hard to imagine a scenario that left humans but not many artifacts except caches of paper (or other "printed" media). Maybe a remote tribe of uncontacted people (or another species altogether) inherit the Earth after a modern world apocalypse kills off everyone in the technologically more advanced world.

A civilization starting from scratch would still need to develop a fair bit of math and tech/science sophistication before understanding and starting to use artifacts left behind. In particular optical/color on paper scanners would have been difficult before the 20th century.

usrbinbash

> In our post apocalyptic future scholars will be using their quills to translate archives of these

Imagine tomes of programming lore, dutifully transcribed by rooms of silent scribes, acolytes carrying freshly finished pages to and fro, each page beautifully illuminated wih pictures of the binary saints, to ward off Beelzebug.

sweettea

See also: the first part of A Canticle for Leibowitz.

nulbyte

Thank you for this. I had never heard of this book, but it sounds intriguing, and my local bookstore happens to have a copy.

adzm

The inhernt errror resilience in charactre encoding of human languige is also an intersetnig point.

myself248

This is why, when pulling wire, I write out the numbers longhand on the end of each one. "SEVENTEEN" is a lot more smudge-resistant and unambiguous umop-apisdn than "L1".

mackmgg

> Is this right? Wouldn't it be base-3 encoding? Three bits of binary can count to 8. Three trits of base three can count to 27. Color has all sorts of disadvantages but maybe a much greater payoff (unless I m mistaken).

In this case they're not directly using the color to store information, they just have three differently colored QR codes overlayed on top of each other. With that method you can use a filter to separate them back out and you've got three separate QR codes worth of data in one place. The way they're added ends up using more than just three colors in that example.

If you were truly to use colored dots to store binary information without worrying about using a standard like QR, I think you'd be going from base-2 (white and black) to base-3 (red, blue, green) or more likely base-4 (white, red, blue, green) or even base-8 (if you were willing to add multiple colors on top of each other) in which case yeah you'd have way more than just 3x the data density.

CorrectHorseBat

>this case they're not directly using the color to store information, they just have three differently colored QR codes overlayed on top of each other. With that method you can use a filter to separate them back out and you've got three separate QR codes worth of data in one place. The way they're added ends up using more than just three colors in that example.

That's only true if you can print and read colors in a higher resolution/don't destroy information at 3x the density with color, I'm not sure if that's generally true.

>If you were truly to use colored dots to store binary information without worrying about using a standard like QR, I think you'd be going from base-2 (white and black) to base-3 (red, blue, green) or more likely base-4 (white, red, blue, green) or even base-8 (if you were willing to add multiple colors on top of each other) in which case yeah you'd have way more than just 3x the data density.

Base 8 is exactly 3x the data density. (Log(8)/log(2))

CorrectHorseBat

Adding 3 colors would make it base 5 (BW+rgb) and give log(5)/log(2) or about 2.3 times the information per dot.

LgWoodenBadger

2 dots at 5 possibilities each gives 25 (5^2)

2 dots at 2 possibilities each gives 4 (2^2)

They only diverge from there. Or am I doing my math wrong?

CorrectHorseBat

Information is ~log(possible states) according to Shannon.

Log(25)/log(4) is 2.3. Among other things this definition has the nice property that two disks/pages/drives/bits together contain the sum of the capacities instead of their product.

null

[deleted]

spencerflem

I think for that use-case (copying by quill), just writing plaintext from the start would be the move

mk_stjames

70-100 kilobytes on a single sheet of paper by tiling QR codes is pretty dense.

I find it interesting that, if you were print 4 sheets double sided you would have roughly the same amount of information stored as a 720kb 5 1/4" floppy disk and if you cut and folded it would take up roughly the same size and weight.

lifthrasiir

I pondered this from time to time and concluded that paper data storage is of very limited use, mainly because of the information density. Any remotely human-readable form is too sprase to be useful (<10 KB/page), while dot-based or color-based approaches are heavily limited by printing techniques (<500 KB/page). It is hard to preserve paper, unless you are willing to sacrifice its information density even more.

For this reason, paper is at best useful as a bootstrapping mechanism, which would allow readers to construct a mechanism to read more densely encoded data. My best guess is that the main storage of information in this case would likely be microfilms, which should be at least 100x dense than the ideal paper data storage. Higher density allows for using less dense encodings to aid readers. And as far as I know microfilms are no harder to preserve than papers.

pastage

It is degrading too fast, microfilm archives need to be digitilized now, the solvent and image chemicals and media are all part of the problem with microfilms. Archival paper is a nice medium that can be stored a long time. This is of course a question of how long you want to store your information if you want to do 00500 years it is probably good.

Or just go with metal https://rosettaproject.org/

Or try to create a culture for humans and store information in that.

xyzzy123

Metal engraving fairly accessible these days.

Fiber laser in 100W range would do it, maybe $10k?

You could do photochemical etching but would be more fuss and wouldn't last as long as a laser engraving.

Probably looking at order of 1gig/1000kg if using 1mm 316 plate (napkin math only, naive estimate). Interesting to explore.

WorldPeas

I would wonder if glass/plastic would be viable given the availability of dvd/modisc burning lasers (though the format is kneecapped by its issues with glue). Is there any good literature about “burning” nonuniform durable materials in a rotary disc burner or am I off base here wrt the capabilities of these smaller lasers

lifthrasiir

Maybe. Anything that can be photographically etched and is durable enough would work well.

tokai

This. The right paper will last significantly longer than microfilm.

IAmBroom

> It is hard to preserve paper, unless you are willing to sacrifice its information density even more.

We have paper books from 500 years ago. Microfiche is already deteriorating.

If you keep paper dry and flat, and use pH-neutral inks and paper, it is extremely stable.

01HNNWZ0MV43FF

Dry and flat... Laminated? Or will the plastic degrade quicker than the paper?

mystified5016

Likely the adhesive in the laminate would degrade the paper over long periods. Lamination also causes additional physical stress on the contained page when handled.

I'd also expect the plastics to go yellow and opaque over long periods, and recovering the document without damage may be difficult or impossible

bbarnett

I wonder, as others have said, an easily OCRable font. However, maybe an added compressor, zip type program specially designed for the limited character set.

If we just have text files, and mayve vector graphics for simple schematics, that's a lot of info.

ctrlc-root

Those fonts do exist:

https://en.wikipedia.org/wiki/OCR-A

https://en.wikipedia.org/wiki/OCR-B

account-5

Color Dot Encodings is interesting, you could encode data in a floor mozaic. And with my limited understanding the more colours the high the amount of data?

You could encode data in monolithic structures this way. They'd last longer than paper and given future generations lots of confusion trying to figure out the meaning.

datadrivenangel

Except when the colors fade over time and people steal the purple ones to decorate their homes preferentially.

goda90

Just "backup" the data with duplication. For example you could color the floor beneath the mosaic, and the grout used for each tile, so as each layer is removed or faded, it still lasts a little longer. Duplicate your mosaic on both the floor and the ceiling. Duplicate your mosaic in multiple buildings in multiple cities.

rickcarlino

I got curious about OCR as a sort of poor man’s microfiche. I printed a test paragraph on high quality paper with a laser printer. The smallest font I could read under a USB microscope was 2.5pt, though I could probably have gone smaller if I used polymer paper. The fibers of the paper are quite apparent under a microscope. Transparency film paper was too smudgy.

ryukoposting

Fun fact: magazines actually distributed software on paper briefly in the 1980s.

https://youtu.be/mIGotStRCkA?si=toG5xeLMZzjIGTxC

It's more like a long, linear barcode, but still. More often, they put the source code in the magazine and you'd just type it into your machine.

lizknope

I typed in a lot of Atari BASIC code from magazines but I never heard of this. Really cool!

Ghoelian

Oh yeah, I forgot all about those! The ZX spectrum was way before my time, but for some reason I still spent a lot of time typing code over from a magazine into a spectrum emulator as a kid.

bn-l

I did my own testing of this. I arrived at using very large QR codes with a lot of redundancy. You can scratch them etc and they’ll still read. Also it’s an extremely ubiquitous format and everyone knows what it is by looking at it.

zvr

Interesting.

I am not sure why, for character-based encodings, they used a general-purpose font (Inconsolata) rather than one that is specifically made for OCR -- and how this would have made it better.

Going further, if you only print a limited alphabet (16, 32 or 39 symbols) why not use a specialized font with only these characters? The final step is to use a bitmap "font" that simply shows different character values as different bit patterns.

upofadown

The font choice is discussed here:

https://www.monperrus.net/martin/perfect-ocr-digital-data

From the linked article:

>The optimal font varies very much on the considered engine. Monospaced fonts (aka fixed-width) such as Inconsolata, are more appropriate in general. ocr-a and ocr-b ocrb give really poor results.

I noticed that they liked using lower case letters for bases where that is a choice. I would think that the larger, upper case letters would be better for OCR. Using lower case for either OCR-A or OCR-B would be a poor idea in any case. The good OCR properties are only provided for the upper case letters. The lower case letters were mostly provided for completeness.

Also, the author might be training on entire blocks of characters rather than individual characters. That isn't really want you want here unless you are using something like words for your representation. OCR-A and OCR-B were designed for character by character OCR.

numpad0

Is there like, "from pynebraskaguyocr import decodeocra" just out there? I haven't seen any of those for some reason.

pmontra

The post is 504 now. Alternative link: http://archive.today/N9ZTb

IAmBroom

I archived it by printing screenshots on my Brother printer, if anyone wants me to snail-mail them the original.

talles

I'm imagining that for most of the examples you have to own a printer/scanner with better than average resolution and that the paper would only work if in pristine condition, considering how small the visual details are.