Skip to content(if available)orjump to list(if available)

Archival Storage

Archival Storage

199 comments

·March 17, 2025

entrepy123

It's kinda mind-blowing that we have (so-called) AI, quantum computing, 6K screens, M2 NVME, billions of networked devices, etc., but regular data *can only be expected to last about 5 years* due to the propensity of moving disk failure, SSD impermanence, bitrot, etc., and is only overcome with great attention and significant cost (continually maintaining a JBOD or RAID or NAS, or painstakingly burning to M-Disc bluray etc.) or handing it over to someone else to manage (cloud) or both. I mean maybe you get lucky with a simple 3-2-1 but maybe you don't, and for larger archives of data that is simply not necessarily a walk in the park either.

Absolutely mindblowing.

Rygian

Emphatic Yes.

I'd like to expand. What I find mindblowing about it is that, as a regular consumer:

* When you need more space you can't just plug in another disk or USB stick. You also have to choose on which device you want to use it, and you have to tell all your software to use it. And that may involve shuffling data around.

* As a corollary, you need to remember in which device you put which stuff.

* As an extra corollary, any data loss is catastrophic by default.

* File copy operations still fail, and when they fail, they do so without ACID-strong commit/fallback semantics.

* Backups don't happen by default, and are not transparent to the end user.

* Data corruption can be silent.

Bonus, but related:

* You can't share arbitrary files with people without going through a 3rd party.

nine_k

This is because most of the non-technical retail customers don't value reliability too much. They want more capacity and good speed at the lowest price point. Hence the proliferation of QLC / MLC NVMe drives with a small SLC write cache. It's not very stable, and it can only keep the write speed high for small files, but hey, you can get a terabyte for $50, and it loads the latest game real fast!

Similarly, many users don't have much valuable, unique data on their computers. The most important stuff very often lives completely in the cloud, where it's properly backed up, etc.

Also, the most ubiquitous computing device now is a smartphone. It has all the automatic backup stuff built in, you can put an SD card into it, and it will transparently extend the free space, without hassle. Even on PCs, MS and Apple nudge the users very prominently to use OneDrive and Apple Cloud for backing up their desktops / laptops. But past certain size, it costs money though, hence many people opt out of that. Again, because most people value the lowest price, and just hope for the best; because what could ever happen?

Silent data corruption can still be an issue, but, frankly, malware is a much bigger threat for a typical non-technical user.

Technical people have no trouble setting up all that: mount your disks under LVM, run ZFS on top of it, set up multiple backups, set up their own "magic wormhole" to share files with stranger easily. But they know why they are doing that.

Educating users about IT hygiene is key for improving it, much like educating people about the dangers of not washing hands, or of eating unhealthy stuff, helped improve their health.

acdha

> This is because most of the non-technical retail customers don't value reliability too much. … Similarly, many users don't have much valuable, unique data on their computers. The most important stuff very often lives completely in the cloud, where it's properly backed up, etc.

There’s a contradiction here which I think is worth fully considering. People definitely do value their data – data recovery services have been getting revenue for decades from people who lost things they care about – but there’s been a market failure where operating system vendors left the users to fend for themselves, everyone agreed that was hard, and the majority of users quite rationally decided that cloud services were better than learning to be unpaid sysadmins.

What we should be asking is why this is all so hard. Local file systems with strong integrity checks and seamless replication onto attached storage should be so basic that people can just assume it’ll work without them having to spend time on it.

rietta

And without care. Everyone's precious photo albums will go with them to the grave. We live in a time where the books and data we write are likely to vanish long before any previous period. We don't leave behind books, codexes, or scrolls, or stone tablets anymore.

Scholars will have more clue about life in 4th century via the Oxyrhynchus Papyri collection than 21st century Terra Haute, Indiana.

account42

> Silent data corruption can still be an issue, but, frankly, malware is a much bigger threat for a typical non-technical user.

Complete BS. Cryptolockers and similar are not a concern for regular users while disk failure is something pretty much everyone gets to experience over their lifetime.

creer

Well, professionally, tape is it - there is technology and it lasts more than 5 years. Unfortunately, the market for tape has evolved such that it's not very friendly to the non-pros. Not impossible but not friendly. That probably has to do with the lack of perceived market for that among non-corporate - or perhaps the impression that clown storage is where it's at for non-corporate.

To be fair some more, JBOD/RAID and hard drives does work pretty well. Past the 5 year horizon to be sure.

Product mgt and corp finance has also fallen in love with subscriptions - and clown storage is such an awesome match for that! Who needs to sell long term terabyte solutions when you can rent it out. Easy to argue against that logic of course, but not easy to fight.

tombert

I have an LTO-6 tape drive. It works fine, but it is a pain in the ass to set up on Linux. It only connects via SAS, you need to load a lot of arcane kernel modules, the logs are non-standardized and often misleading, and the entire interface is command line based.

I don't mind living in the command line and I don't even mind fighting to get everything up, but I don't see most people putting up with it. It's also a huge pain to get working with a laptop, since I don't think most laptops have a SAS connector, so you have to use an eGPU case with a Host Bus Adapter, which is its own share of headaches.

creer

Hehe. If you're trying to live with just a laptop and an LTO drive, I can certainly see it! I'd expect most people who get into LTO drives have a massive set of hard drives, some convoluted desktop-and-up case(s) to run them, a couple laptops and random peripherals such as cameras, scanners and whatnot. USB all over the place. - So that for them both the drive and the command line interface are deeper in a technology map.

But that goes back to the minimal market for LTO among amateurs. They will scratch their itch and write software for it but it's not exactly critical mass.

MisterTea

And here I just bought a NIB LTO-5 SAS drive for home file server backup.

pdimitar

> or perhaps the impression that clown storage is where it's at for non-corporate.

Clown storage.

Thanks for the giggle. :D Needed it, had a pretty rough last couple of days.

null

[deleted]

EGreg

Why is tape the best thing? One would think two- or three-dimensional storage could pack way more data!

wtallis

Nobody needs to worry about storage density in the sense of how much shelf space you need for cold storage. Tape, NAND flash, and even high-capacity hard drives are all dense enough that the incremental cost of more shelf space is not the most important part of the problem.

And NAND flash is already pretty 3D, with hundreds of layers of memory cells fabricated on the surface of each wafer, and several such dies stacked in each BGA package, and it's not uncommon for U.2 drives to contain two PCBs each with NAND packages on each side.

creer

Tape is a way to get volume. Through easily spooled layers. Not fast access but otherwise not unreasonable. Plus tapes stored outside of the drive.

Another way to look at it: "and yet here we are". Tape is still what has led to the highest density so far (depending how much storage you look at, and cost tradeoff). But also cost tradeoff of separating the drive from the medium. which hasn't really worked: LTO tapes are pro supplies - and so, expensive.

codemac

Good thinking - there are many projects around this. Ceramic, DNA, 3d glass, etc. They all run into the same IO problems tape has though.

UltraSane

Tape on a spool is 3D volumetric storage

mrweasel

Tape is generally available, I don't think in of those more esoteric storage medium, like 3D glass cubes are commercially available.

bob1029

I think it's even more mind-blowing that we can hold back the tide of entropy for as long as we can and with so little energy expended.

Solid state electronics and magnetic media are beyond magical. The odds of keeping terabytes of data on rails are astronomically bad.

rietta

Yes, it is. I actually have been mulling over a fictional world set in the future where the period between the 20th and 25th centuries is a mysterious time that so little is known about. The story follows a professor who is obsessed with the "Bit Rot Era" and finding out just what happened to that civilization.

I have a prototype first chapter written that cold opens with an archeological dig '...John Li Wei looked up from his field journal where he had just written “No artifacts found in Basement Level 1, Site 46-012-0023”, wiping sweat from his brow. "Did you find something, Arnold?" he asked, his voice weary. "Three days in this godforsaken jungle, and we've got nothing but mud to show for it. Every book in this library’s long since turned to muck.” Arnold gestured towards the section of the site he had been laboring for the last 30 minutes, digging through layer after layer of brown muck, with fragments of metal hardware that once supported shelving. A glint of metal caught the filtered light. “Arnold, that’s just another computer case,” John sighed, his shoulders slumping slightly. He could already imagine the corroded metal and the disintegrated components inside. Useless. “Help me pull this out.” The two men strained against the clinging earth, their boots sinking into the mud with each heave. As they finally wrestled the heavy, corroded metal case free, a piercing shriek cut through the jungle sounds – beep, beep, beep, beep....'

lou1306

You may want to check Memoirs Found in a Bathtub by Stanisław Lem :)

NetOpWibby

I'm riveted, please share more.

rietta

I suppose I can publish on by blog and then keep on writing. Can discover this future together along with anyone who wants to read it. This is my first serious attempt at fiction. I have only just begun.

This is another portion of later in the first chapter.

'...The transport, a ground-based vehicle, levitated silently outside. John glanced at his chronometer as he boarded. Jeg er ked af det, he thought, I'm late. The doors hissed shut, and the vehicle computer announced, “Destination: University. Estimated arrival: 25 minutes.” With a gentle hum, the vehicle glided smoothly along the elevated guideway. The air inside was cool and faintly metallic. Outside, the landscape was a patchwork of green fields, managed forests, and gleaming white research facilities. The transport’s progress was slow; the gentle sway and hum of the engines were a constant reminder of the NAU’s strict energy policies. John sighed, thinking about the upcoming lecture. How could he convey the importance of the Bit Rot Era when so many were focused on the pressing needs of the present?

At the University, John rushed into the classroom, three students already waiting. “Jeg er ked af det, I’m late,” John said, a slight Danish accent coloring his Danglish. The classroom was intimate, a very different design from the great lecture halls of antiquity, designed for perhaps twenty students, with a central platform surrounded by holographic cameras for remote attendees. Historical maps and timelines adorned the interactive displays lining the walls.

John quickly moved to the lectern and carefully removed six artifacts from his bag. “Velkommen to Ancient North America 1,” he began. “Welcome to Ancient North America 1. This is the class where you are going to learn about the past that was and the present that may yet be.” He gestured to a holographic timeline that appeared above the lectern. “In broad strokes, we consider the history of Ancient North America to consist of four periods: the pre-colonial, the rise of the nations – of Canada, the United States of America, and Mexico – the pre-Collapse, and the post-Collapse periods. You can take in-depth courses on most of these. Dr. Jones’ ‘Rise of the Nations’ is well worth your study to learn about the United States, its Constitution, and Canada. This very university is located in a place that was once called Newfoundland, Canada, and in ancient times the climate was quite harsh. Meget forskelligt from the lush green farmland, forests, and even nice beaches we have today. You can also take Dr. Pech’s history of the pre-colonial tribes and empires. That will teach you a great deal about the people who first inhabited the continent and learn about their history, culture, foodstuffs, many of which are still considered extinct. Desværre.

“However,” John continued, his tone shifting, “the class you cannot take is for what we call the Bit Rot Era. A ‘bit’… imagine a light switch. On or off. That’s a bit. One or zero. The basic unit of digital information. The Bit Rot Era, from roughly the 20th to the 25th century, is a complete black box. En sort boks. What we do know comes from fragments of writing on paper. Millions of books were printed, but even those are often lost to time. We have fragments of ancient texts that talk about computers in every home and the ‘digitization’ of information and libraries. Digitization was the process of scanning physical media into computers. We’ve recovered millions of artifacts – disintegrated polycarbonates, silica, and bits of rare metals that were once these computers. But nothing within survived. Then something happened. The books vanished. The period from about the 21st to the 25th centuries was, as it were, expunged from history. Research indicates they went entirely digital, depending on system administrators to maintain the data… until the administrators stopped.”...'

mtillman

Important to remember that M-Disc Blu-ray is marketing only and not spec whereas M-disc DVD is an actual spec for archival purposes.

RGamma

I often think we're living in a dark age (an age that is characterized by little surviving cultural output)... Dpends on how thing's will go, of course, but I ain't holding my breath.

biofox

An important reason to continue buying printed books and supporting print media.

Most web content I've consumed in my lifetime is already lost, many floppy disks and burned CD-ROMs I held onto are now unreadable, and in 200 years, the situation will only be worse.

But I can go to the British Library and read a 1000+ year old text without much difficulty.

RGamma

FWIW does any of it remain on archive.org? They have some really great collections on there, and writing on digital archiving, like how to preserve CDs as best as possible (spoiler: the brand of CD is a large factor).

account42

It's a mistake to assume that just because there are 1000 year old text that the optimized-for-cost prints you can buy today will also last that long.

remus

In some ways it is surprising, but the examples you gave are only currently straightforward because of massive investment over many years by thousands of people. If you wanted to build chatGPT from scratch I'm sure it would be pretty hard, so it doesn't seem so unreasonable that you might pay someone if you care about keeping your data around for extended periods of time.

vodou

How good/bad would it be to have a poor man's tape archival, using standard cassette tapes (C90, C120, etc)?

For example, using something like ggwave [1]. I guess that would last way more than 5 years (although the data density is rather poor).

[1] https://github.com/ggerganov/ggwave

EvanAnderson

> ... (although the data density is rather poor).

"Rather poor" is putting it mildly. This sent me down a sort rabbit hole. From a Stack Exchange discussion[0] it was a short trip to exceedingly technical discussion about using QAM encoding[1] to really beef-up the storage capability.

With the wacky QAM encoding tt looks like maybe 20MB per C90 cassette (and 90 minutes to "read" it back).

[0] https://retrocomputing.stackexchange.com/questions/9260/how-...

[1] https://redfrontdoor.org/blog/?p=795

rbanffy

It's interesting. I wouldn't dare to go beyond a 1500 baud signaling rate, but, then, audio tape is amenable to QAM, and that could multiply the transmission to 16 bits per token or more depending on the quality of the tape and recorder.

I would be careful with that, however. If you are archiving your data, it's because you like it and, if you like your data, you want it to be readable a long time from now. I'd suggest vinyl records rather than tapes, as they are very robust, and can be read without physical contact.

dmd

A C120 can store, very generously, about 1 megabyte.

A LTO8 tape stores 12,000,000 megabytes.

kmoser

I have a hundred or so such tapes that contain Commodore PET programs from the early 1980s. Last time I tried to read them (about 10 years ago, when they were about 35 years old), I had...mixed results.

Part of that may be the tape drive (about 40 years old) but the reality is that consumer level cassette tapes aren't built to last: magnetic fields weaken, coating flakes off, tape stretches, and other factors prevent these from being storage solutions beyond 10-20 years (my guess), if that.

They might be a fun nostalgic diversion for listening to old music, where the audio degradation is part of the experience, but for data, they're a non-starter in my book.

jabl

Somewhat related, I think there were some projects to use VHS video cassettes for data storage too. It was much better than C cassettes, but still a very far cry from what one would consider worthwhile these days. IIRC a couple of GB per cassette?

creer

LTO tape tech has gotten into pretty nutty territory - in order to achieve its density and speed. It wasn't "easy". So, so far away from C90 technology.

alnwlsn

I've thought about the "hundreds of years" problem on and off for a while (for some yet to be determined future time capsule project), and I figure that about all we know for sure that will work is:

- engraved/stamped into a material (stone tablets, Edison cylinders, shellac 78s, vinyl, voyager golden record(maybe))

- paper, inked (books) or punched (cards, tape)

- photography; microfiche/microfilm (GitHub Arctic Code Vault), lithography?

I actually looked into what it might take to "print" an archival grade microfilm somewhat recently - there might be a couple options to send out and have one made but 99.99% of all the results are to go the other way, scanning microfilm to make digital copies. This is all at the hobbyist grade cheapness scale mind you, but it seems weird that a pencil drawing I did in 2nd grade has a better chance of lasting a few hundred years than any of my digital stuff.

aisamu

You might be interested in this talk by Will Byrd, of miniKanren fame:

Personal Data Preservation, Inspired by Ancient Writing https://ericnormand.me/clojuresync/will-byrd

jack_pp

Not weird at all that a piece of paper you wrote 20 years ago which has like 5-10KB of info that you can decode without any tech can stand the test of time. Archives are hard because of scale, env factors etc.

PaulHoule

Cost calculations are often different at the enterprise scale from the individual scale. Hypothetically

https://en.wikipedia.org/wiki/Linear_Tape-Open

is an affordable storage medium if you need to store petabytes but for what the drive costs

https://www.bhphotovideo.com/c/product/1724762-REG/quantum_t...

you could buy 400 TB worth of hard drives. Overall I'd have more confidence in the produced-in-volume hard drives compared to LTO tapes which have sometimes disappeared from the market because vendors were having patent wars. Personally I've also had really bad experiences with tapes, going back to my TRS-80 Color Computer which was terribly unreliable, getting a Suntape with nothing at zeros on it, when the computer center at NMT ended my account, the "successful" recovery of a lost configuration from a tape robot in 18 hours (reconstructed it manually long before then), ...

sshagent

My day job (company died a couple of weeks ago). We had > 100,000 LTO tapes in the end. With data archived way back in 2002 until present. We were still regularly restoring data. In our busiest years we were doing what averaged to 177 restores per day (365 days a year). Barely any physically destroyed tapes.

I see a few articles citing robotic failures as a big issue, but really someone can just place a tape in the robot if critical recovery is needed and the robot has died.

dabiged

Curious to hear more details about your previous job? What were you doing to require 100k tapes?

sshagent

Vfx Every commercial, movie or TV worked on would have the assets used to create the content archives.

JeremyNT

Tape is reliable and suitable for long term archiving, but it still needs care and feeding.

Having some kind of parity data recorded so losing a single tape does not result in data loss, routine testing and replacement of failing tapes, and a plan to migrate to denser media every x years are all considerations.

Spinning rust just feels simple because the abstractions we use are built on top of a substrate that assumes individual drive (or shelf) failure. Everybody knows that if you use hard drives you'll need people to go around and replace failing hardware for the entire lifetime of the data.

myself248

There's a biiiiiiiiiig asterisk on all tape storage, about temperature and humidity. It's not like paper that you can leave in an attic for a century and still find readable.

People restoring old tapes right now have to do all sorts of exotic things with solvents to remove the mildew and baking the tapes to make the emulsion not immediately fall off the substrate, etc. I have to imagine that at today's density, any such treatment would be much worse for the data.

So those tapes are only as immortal as their HVAC. One hot humid summer in the wrong kind of warehouse may be it.

IIsi50MHz

Similarly, I worked at a place where, before I joined, a system upgrade gone wrong had caused the retrieval of backup tapes stored in a metal safe, where the safe's temperature had been below the dew-point. Neither the tape cases nor the safe were sealable against moisture. This meant they had no backups of data they were required to retain for five years. And of course, the person who attempted the upgrade resigned.

wmf

This is mentioned in the article.

There's an old presentation from Google where they mentioned that they were the only ones who read back their tapes to make sure they work.

simonw

That little note half way through this that said "The Svalbard archipelago is where I spent the summer of 1969 doing a geological survey" made me want to know more about the author - and WOW they have had a fascinating career: https://blog.dshr.org/p/blog-page.html

See also https://en.wikipedia.org/wiki/David_S._H._Rosenthal

hedgehog

I met him once at a conference, very interesting guy and funny too.

jewel

If you're using cloud storage for backups, don't forget to turn on Object Lock. This isn't as good as offline storage, but it's a lot better than R/W media.

At work we've been using restic to back up to B2. Restic does a deduplicating backup, every time, so there's no difference between a "full" and an "incremental" backup.

rtkwe

I wish tape archival was easier to get into. But because it's niche and mainly enterprise, drives usually start in the multiple thousands of dollar range unless you go way down in capacity to less than a modern SSD.

codemac

No, it's because of IBM's monopoly. It has little to do with it being enterprise.

markhahn

I'm not sure you can distinguish those. It is IBM, and IBM has a preference for who its customers are. So do enterprises, who like the sound of "no one ever got fired for..." And it's also because the market is pretty small (at least in terms of sites) - there's just not that much total accessible market for any competitor.

rtkwe

Where's their monopoly I see tons of different drive and tape manufacturers. Are they all paying some royalty to IBM for tech?

markhahn

there are a couple tape makers, regardless of how many companies rebadge the product. afaik, there are only 2-3 drive makers too.

but don't forget that tape doesn't make much sense (in its market) without the robotic library. there might be some off-brands that sell small libraries, but the big ones are, afaik, dominated by IBM.

codemac

Yes, they all pay royalties, and all the drives pay licensing for read heads.

nntwozz

I basically use the 3-2-1 backup strategy.

The 3-2-1 data protection strategy recommends having three copies of your data, stored on two different types of media, with one copy kept off-site.

I keep critical data mirrored on SSDs because I don't trust spinning rust, then I have multiple Blu-ray copies of the most static data (pics/video). Everything is spread across multiple locations at family members.

The reason for Blu-ray is to protect against geomagnetic storms like the Carrington Event in 1859.

[Addendum]

On 23 July 2012, a "Carrington-class" solar superstorm (solar flare, CME, solar electromagnetic pulse) was observed, but its trajectory narrowly missed Earth.

kemotep

3-2-1 has been updated to 3-2-1-1-0 by Veeam’s marketing at least.

At least 3 copies, in 2 different mediums, at least 1 off-site, at least 1 immutable, and 0 detected errors in the data written to the backup and during testing (you are testing your backups regularly?).

nntwozz

All the data is spread across more than 3 sites, both SSDs and Blu-ray (which is immutable). I don't test the SSDs because I trust Rclone, the Blu-ray is only tested after writing.

There is surely risk of Bit rot on the SSDs but it's out of sight and out of mind for my use case.

mindwork

I've been considering to get in to the Blu-ray backups for a while. Is there a good guide on how to organize your files in order to split it in to multiple backup disks? And how to catalogue all your physical Discs to keep track of them?

I remember about 20 years ago my friend had a huge catalogue of 100s of disks with media(anime) and he used some kind of application to keep track of where each file is located across 100s of discs in his collection. I assume that software must have improved for that sort of a thing?

Dylan16807

> The reason for Blu-ray is to protect against geomagnetic storms like the Carrington Event in 1859.

The danger of such an event is the volts per kilometer it induces in long wires.

An unplugged hard drive will experience no voltage and a super tiny magnetic field. Nothing will happen to it.

ievans

Do you store your SSDs powered? They can lose information if they're not semi-frequently powered on.

wtallis

Powering on the SSD does nothing. There is no mechanism for passively recharging a NAND flash memory cell. You need to actually read the data, forcing it to go through the SSD's error correction pipeline so it has a chance to notice a correctable error before it degrades into an uncorrectable error. You cannot rely on the drive to be doing background data scrubbing on its own in any predictable pattern, because that's all in the black box of the SSD firmware—your drive might be doing data scrubbing, but you don't know how long you need to let it sit idle before it starts, or how long it takes to finish scrubbing, or even if it will eventually check all the data.

0xR1CK

Adding to this... Spinrite can re-write the bits so their charge doesn't diminish over time. There's a relevant Security Now and GRC article for those curious.

svilen_dobrev

How about a background cron of diff -br copyX copyY , once per week, for each X and Y .. if they are hot/cold-accessible

Although, in my case, the original is evolving, and renaming a folder and few files makes that diff go awry.. needing manual intervention. Or maybe i need a content-based-naming - $ ln -f x123 /all/sha256-of-x123 then compare those /all

myself248

I've been reading a lot of eMMC datasheets and I see terms like "static data refresh" advertised quite a bit.

You're quite right that we have no visibility into this process, but that feels like something to bring up with the SFF Committee, who keeps the S.M.A.R.T. standard.

UltraSane

best is to have a filesystem that can do background bit rot scrubbing

lizknope

I've got files going back to 1991. They started on floppy and moved to various formats like hard drives, QIC-80 tape, PD optical media, CD-R, DVD-R, and now back to hard drives.

I don't depend on any media format working forever like tape. New LTO tape drives are so expensive and used drives only support small sized tapes so I stick with hard drives.

3-2-1 backup strategy, 3 copies, and 1 offsite.

Verify all the files by checksum twice a year.

You can over complicate it if you want but when you script things it just means a couple of commands once a week.

globular-toast

I have some going back to my first days with computers (~1997), but it's purely luck. I've certainly lost more files since then than I've kept.

Does that tear me up? Not one bit. And I guess that's the reason why people aren't clamouring for archival storage. We can deal with loss. It's a normal part of life.

It's nice when we do have old pictures etc. but maybe they're only nice because it's rare. If you could readily drop into archives and look at poorly lit pictures of people doing mundane things 50 years ago, how often would you do it?

I'm reminded of something one of my school teachers recognised 20+ years ago: you'd watch your favourite film every time it was on TV, but once you get it on DVD you never watch it again.

I think in general we find it very difficult to value things without scarcity. But maybe we just have to think about things differently. Food is already not valuable because it's scarce. Instead I consider each meal valuable because I enjoy it but can only afford to eat two meals a day if I want to remain in shape. I struggle to think of an analogy for post-scarcity data, though.

gtdawg

What is your process for automating this checksum twice a year? Does it give you a text file dump with the absolute paths of all files that fail checksum for inspection? How often does this failure happen for you?

lizknope

I run snapraid once a night and it has a scrub feature to read every file and compare against the stored checksum.

https://www.snapraid.it/manual

All my drives are Linux ext4 and I just run this program on every file in a for loop. It calculates a checksum and stores it along with a timestamp as extended attribute metadata. Run it again and it compares the values and reports if something changed.

https://github.com/rfjakob/cshatag

These days I would suggest people start with zfs or btrfs that has checksums and scrubbing built in.

Over 400TB of data I get a single failed checksum about every 2 years. So I get a file name and that it failed but since I have 3 copies of every file I check the other 2 copies and overwrite the bad copy. This is after verifying that the hard drive SMART data shows no errors.

gaius_baltar

> What is your process for automating this checksum twice a year?

Backup programs usually do that as a standard feature. Borg, for example, can do a simple checksum verification (for protection against bitrot) or a full repository verification (for protection against malicious modification).

hn_throwaway_99

This article touches on a lot of different topics and is a bit hard for me to get a single coherent takeaway, but the things I'd point out:

1. The article ends with a quote from the Backblaze CTO, "And thus that the moral of the story was 'design for failure and buy the cheapest components you can'". That absolutely makes sense for large enterprises (especially enterprises whose entire business is around providing data storage) that have employees and systems that constantly monitor the health of their storage.

2. I think that absolutely does not make sense for individuals or small companies, who want to write their data somewhere and ensure that it will be there in many years when they might want it without constant monitoring. Personally, I have a lot of video that I want to archive (multiple terabytes). I've found the easiest thing that I'm most comfortable with the risk is (a) for backup, I just store on relatively cheap external 20TB Western Digital hard drives, and (b) for archival storage I write to M-DISC Bluerays, which claim to have lifetimes of 1000 years.

nadir_ishiguro

I personally don't believe an archival storage, at least for personal use.

Data has to be living if it is to be kept alive, so keeping the data within reach, moving it to new media over time and keeping redundant copies seems like the best way to me.

Once things are put away, I fear the chances of recovering that data steadily reduce over time.

lurk2

> Once things are put away, I fear the chances of recovering that data steadily reduce over time.

I’ve run into this a lot. You store a backup of some device without really thinking of it, then over time the backup gets migrated to another drive but the device it ran on is lost and can’t be replaced. I remember reading a post years ago where someone commented that you don’t need a better storage solution, you need fewer files in simpler formats. I never took his advice, but I think he might have been right.

andrewaylett

I won't archive anything on portable media.

Cloud, or variants thereof, is fine -- I use rsync.net for backup and archive. But needing to manually run a backup (say, onto a thumb drive) is not sustainable, and even though the author suggests that disks (spinning rust or optical) might actually have a reasonable lifespan, I don't trust myself to be able to recover data from them if I want it.

As the author says, the limiting factor isn't technical. For media, it's economic. For any archival system it's also going to be social. There's a reason that organisations that really need to keep their archives have professional archivists, and it's not because it's easy :).

sigio

Only 'online' data is live/surviving data... So I keep a raid5 array of (currently 4) disks running for my storage needs. This array has been migrated over the years from 4x1 TB, to 2TB, to 4TB, 8TB and now 4x 16TB disks. The raid array is tested monthly (automated). I do make (occasional, manual) offline backups to external HDD's ( a stack of 4/5 TB seagate 2.5" externals), but this is mostly to protect myself from accidental deletions, and not against bitrot/failing drives.

Tapes are way to slow/expensive for this (low) scale, optical drives are way to limited in capacity, topping out at 25/50GB, and then way to expensive to scale.

Dylan16807

You don't need constant monitoring if you have extra disks. If your budget is at least a thousand dollars, you can set up 4 data disks and 4 parity disks and you'll be able to survive a ton of failure. That's easily inside small company range.

wmf

My takeaway is that for personal/SMB use you have to use the cloud.

layer8

You don’t have to (though it can make sense, but I’d encrypt everything and have an additional backup). Another fairly straightforward solution (in that there is ready-made hard- and software for it and doesn’t need much maintenance) is to use a NAS with RAID 5/6, and to have a second NAS at another location (can be a friend or relative — and it can be their NAS) that you auto-backup the first one to over the internet.

null

[deleted]

globular-toast

This article is specifically about digital archival. That is, keeping bit-perfect copies of data for 100+ years. But I think for regular people this is not so obviously useful. People want to keep things like texts (books), photographs, videos etc. Analogue formats are a much better option for these things, for a couple of reasons:

* They gracefully degrade. You don't just start getting weird corruption or completely lose whole files when a bit gets flipped. They might just fade or get dog-eared, but won't become completely unusable,

* It's a more expensive outlay and uses scarce physical space, so you'll think more carefully about what to archive and therefore have a higher quality archive that you (and subsequent generations) are more likely to access.

The downside I guess is backups are far more difficult, but not impossible, and they will be slightly worse quality than the master copy. But if you lose a master copy of something, would it really be the end of the world? Sometimes we lose things. That's life.

squeedles

Simple wins. Always.

I've backed up on just about everything going back to QIC-150s, but today I just use a set of 4Tb drives that I rsync A/B copies to and rotate offsite. That gives me several generations as well as physical redundancy.

The iteration before that, I made multiple sets of Blu-Rays, which became unwieldy due to volume, but was write-once with multiple physical generations. I miss that, but at one point I needed to restore some files and even though I used good Verbatim media, a backup from a couple months prior was unreadable. All copies had a mottled appearance and the drive that wrote it (and verified) was unable to read it. Did finally find a drive that would read it, but finally pushed me over the edge.

I wonder how the author's 18yo media will compare to modern 5yo media. It's been a long time since we have had the rock solid Taio Yuden gold disks ...

wuschel

This made me smile. I have a very similar configuration. Simple but effective. The only thing that worries me bitrot might get me. Then again, my body will bitrot, too. So no point worrying too much about some random data in some turbulence in time.

mburns

Why not use mdisc and effectively solve the “has my cd/dvd degraded beyond the point of being readable” question entirely.

rtkwe

You may not even be able to get real MDiscs any more [0] and I'm always extremely dubious of 1000 year lifespans since they're effectively impossible to test.

[0] https://www.reddit.com/r/DataHoarder/comments/yu4j1u/psa_ver...

ndiddy

From the comments of that post:

> Hopefully this can put closure to the speculation. Our organization is a databank and is a big user of mdisc for archiving. We reached out to Verbatim last week about this Media Identification Code (MID) discrepancy. Here is their reply, in their own words ---- "The creator of the MDisc technology- Millenniata went out of business in 2017, they sold the technology to Mitsubishi, who until 2019 owned Verbatim. Due to this, the stamper ID changed, but the formula & the disc materials stayed the same. Mitsubishi sold Verbatim & all the technologies to CMC in December of 2020. Verbatim is the only company authorized to sell the original technology. Any Millenniata discs available were all produced before 2017 when the company shut down and any other brand is not the original technology." ----- So there it is, mdiscs with either the 'VERBAT' or 'MILLEN' prefix are fine. Just different production periods. Cheers.

rtkwe

Ah good catch. One big downside of reddit replacing forums is the default sorting does not make it easy to find late contributions like this.

wmf

M-Disc has such low capacity that you'd probably want a robot to burn it which is not cheap.

aeroevan

There are 100 GB BDXL flavors of M-Disc, but yeah definitely not enough for really large amounts of data but large enough to store a good chunk of my photos which is mostly what I'd want to keep around.

Videos would fill that up pretty quick though.

jiveturkey

for storage of large volumes of data, mdisc is impractical. for storage of a few very important folders, sure.

note that only the cd-r flavor of mdisc is rated for long lifetime.