Data centers contain 90% crap data

145 comments

·April 6, 2025

danpalmer

> One organization I knew of had 1,500 terabytes of data, with less than 2% ever having been accessed after it was first stored.

On a related note, probably a similar percentage of people claim on their car insurance. If only the rest realised they had "crap insurance" and were paying for nothing, they could save so much money!

This is obviously sarcasm, but I think it's important to remember that much of the data is stored because we don't know what we will need later. Photos of kids? Maybe that one will be The One that we end up framing? Miscellaneous business records? Maybe those will be the ones we have to dig out for a tax audit? Web pages on government sites? Maybe there will suddenly be an interest in obscure pages on public health policy if a global pandemic happens.

Complaining that data is mostly junk is not a particularly interesting conclusion without acknowledging this. Is there wastage? Yeah sure, but accuracy on what needs storing is directly traded off with time spent figuring that out, and often it's cheaper to store the data.

ivraatiems

This is exactly the problem. Consider an order processing system that processes a million orders a day, and retains them for ninety days. What percentage of those 89-day-old orders are actually needed at 89 days old? It could be quite low, maybe a couple thousand out of a million.

But if those orders aren't there, shit hits the fan. PCI compliance audits fail. The ability for customers to reconcile their charges with their purchases breaks. In that 0.01% of cases where the order was fraudulent, placed by mistake, or just didn't have what the customer thought it had in it, not having that data makes the order processor read as, if not malicious, at least incompetent.

The real question is, how much data do we need to store inefficiently, in a way that uses a lot of power and space?

makeitdouble

> The real question is, how much data do we need to store inefficiently, in a way that uses a lot of power and space?

This is indeed the critical question, and it's far from being trivial.

One issue we all hit is moving the data from the higher tier storage to the cheaper and more efficient one, which requires sync and paying for the transfer most of the time, but also handling two separate access and authorization process, backup and recovery system for data that absolutely needs to be accessible for the few years of legal retention, and can/must completely disappear afterwards.

In most orgs I've seen the cost of going through all that complexity is just not worth it, compared to "just" paying for the higher tier storage for the few years long lifetime of the data.

newAccount2025

Seems like we should be solving this at the storage system layer, instead of per application?

numpad0

  probably a similar percentage of people claim on their car insurance. 
  In that 0.01% of cases where the order was fraudulent, placed by mistake, 
  how much data do we need to store inefficiently, in a way that uses a lot of power and space?

I'm feeling, the real real question, as Sci-Fi as it gets, is, is the winning ticket data even data, OR is it more like a thumbnail of "the whole data" that is 98%+ worthless, than standalone piece of data?

The winning ticket ID, e.g. a "0x-636e-7461-4e7b", only makes sense in the context as one among the entire cohort of contestants; I can make one up like I did, but I can't walkout with the payout unless the rest of the lottery didn't exist.

Statistically, philosophically, technically, and all sorts of *-cally speaking, is the 2% data, the winning ticket datum, even data?

null

[deleted]

AStonesThrow

I was just pondering this today, in terms of how much data and objects I create in a Google or Microsoft account, for example, and then they create a burden of cost and maintenance for me down the road. Especially deleting old emails and photos. That's arduous and sometimes poignant as I flush away my personal life story.

Cloud services make difficult and sometimes even byzantine processes for deleting stuff, and it's often impossible to operate en masse in order to clean up swaths of stuff quickly and efficiently. It's in their interest to retain everything at all costs, because more used storage can mean more profits. Cloud services also profit from unused storage, because if they're charging $20/year to 100,000 users who use 2% of their storage space, ka-ching!

It irks me to this day that standard or even advanced filesystems don't include "expiration dates" or "purge dates" on file objects. Wouldn't it be logical, if an organization has a "data-rentention policy" that mandates destruction after X date, that the filesystem simply purges it automatically? Why does this always get delegated to handmade userland cron jobs? Moreover, to my knowledge, nobody is really interested in devising a way to comb through backup media in order to [selectively] destroy data that shouldn't exist anymore. Not even the read-write media!

Google is now auto-deleting stuff like OTP SMS messages. I'd love it if auto-delete could be configurable account-wide, for more than just our web histories and Maps Timeline and stuff. Unfortunately, to "delete" cloud data means it still exists on backups anyway. But without deleting data in your cloud account, it becomes a juicier hacker target as it ages and accumulates personal stuff that shouldn't fall into the wrong hands. Likewise for any business, it behooves them to delete and destroy data that shouldn't be stolen. At least move it offline so that only physical access can restore it?

I will say that modern encryption techniques can make it easy to "destroy" data, simply by destroying the encryption keys. You can quickly render entire SSDs unreadable in the firmware itself with such a command. Bonus: sometimes it's even done on purpose!

But even deleting data presents a maintenance cost. So if 90% of an org's data is indeed crap, then 90% or more of your processing resources are going to be wasted on sifting through it at some later date. Imagine when your file formats and storage devices are obsolete, and some grunt needs to retrieve some record that's 30 years old, and 90% of your data was always crap. That grunt is hopefully paid by the hour. We really had this happen at a few of my jobs, where we had old reel-to-reel backup tapes and it was difficult enough to load the data into a modern SunOS machine.

https://m.xkcd.com/1683/

jl6

Unfortunately a lot of data retention policies aren’t so mechanical, but are of the form “delete after 7 years unless there is a legal hold in force”, which is usually just rare enough of an edge case that orgs evaluate it manually and hence only do periodic manual purges. But probably the main reason auto-delete isn’t popular is because a process that can delete your old data is one bug/misconfig away from deleting your new data too.

faust201

> Especially deleting old emails and photos.

Google or apple can put a big button delete everything in their phones/accounts but then some prankster will do it to a family member and this gives bad PR. Let's be pragmatic.

> It's in their interest to retain everything at all costs, because more used storage can mean more profits.

As an user, I get the reverse. When I had local NAS then I dumped anything and everything - assuming.. this costs not much. Will clean up later. Once I moved to cloud that changed to If I put crap then it will cost me money! Keep it clean

Once upon a time Google had given generous unlimited to all education and workspace. They stopped it in the last 2 years and you can see most educational and companies are running a tight ship.

Due to backup costs in our organisation we are forcing people to use max of 100GB for emails.

> advanced filesystems don't include "expiration dates" or "purge dates" on file objects. Wouldn't it be logical, if an organization

Totally agree.

Outlook email service has some kind of keep only the last newsletter from this service etc.

> auto-delete could be configurable account-wide, for more than just our web histories

- features are designed with majority in mind - most users have some sort of nostalgia for reading or keeping old emails SMS etc

zanecodes

I would love a feature like retention policies at the filesystem level. If it could somehow take data provenance into account, that would be even better, i.e. data that I've created directly is irreplaceable and should never be deleted unless I say so and should always be backed up (photographs, writing, project source code, art, game save files, my list of installed applications, etc.); data that I've created indirectly is high priority but may be deleted under certain circumstances or may be omitted from backups (browsing history, shell history, recently opened file lists, frequently used app lists); data that can be easily replaced is low priority and may be cleaned up at any time and need not be backed up at all, contingent on how replaceable it is (application files, cached data).

nn3

I suspect for most cloud providers you deleting data is cheaper because the data is not charged by the byte. But then they like having data, maybe just to train their AI models or for bragging rights to their investors.

For the expiration dates most modern file systems have the concept of arbitrary extended attributes per file. It's quite easy to add meta data like this yourself.

InsideOutSanta

The article touches on this issue by pointing out that:

>The Cloud is what happens when the cost of storing data is less than the cost of figuring out what to do with the crap

But I think that's wrong. The actual issue is that you often can't figure out "what to do with the crap" because the difference between useful data and crap data is determined at the point in time when you need it, not when you store it.

I'm relatively careful with deleting data, but even so, there were countless instances where I thought something was no longer needed and deleted it, only to figure out that, a month later, I needed it after all.

Also, the article has a few anecdotes like this:

>Scottish Enterprise had 753 pages on its website, with 47 pages getting 80% of visits

But that is completely orthogonal to the question of what is "crap." The fact that some data is rarely needed does not mean it's crap. If anything, the fact that it is rarely needed might make it more valuable because it might make it less likely that the same data can be found elsewhere.

ghaff

I do cull photos and docs but it takes effort and you will make mistakes. On the other hand, intelligent culling makes it easier to find things.

It’s a trade off especially with digital where there’s not that much cost associated with holding onto things relative to the physical world. My whole house is basically in storage because of a kitchen fire and I’m planning to be pretty selective about what I move back in.

InsideOutSanta

>intelligent culling makes it easier to find things

That's true, but things get easier to find over time. I have thousands of digital photographs from the 1990s that I burned on CDs and then copied to a NAS later. Today, for the first time, they're actually searchable because they're now indexed in Immich. So they're sorted by the person in them and searchable by their contents.

If I had culled the photos back then, I would have lost many photos that have become searchable (and thus valuable) only recently.

mrexroad

I have ~2.5TB of photos in iCloud, via Apple Photos. Excluding the various sized previews, I doubt many originals have been accessed in quite a while. I also have about 1TB Lightroom library archived to a different service; representing countless hours of photo processing work spanning over a decade. Haven’t touched that one in years. Neither are crap. (Yes, both have other backups and, yes, I’ve probably forgotten to sync one of them).

eloisius

Sometimes I wish there was a feasible system to audit and reduce redundant photos in my iCloud. I have an embarassing ton of pictures that I never look at just like everyone else. I wouldn’t mind a tool that said “here’s 10^x photos that are the most similar to all your other photos would you like to put them in the trash?”

mrexroad

If you’re using Apple Photos, the feature is there. It will detect both exact copies as well as (nearly) identical visual duplicates. Look under utilities.

makeitdouble

Thing is, the tool needs you to pick the one instance that survives the purge. Which means looking through 10^x of mostly similar photos to pick one.

Removing duplicates isn't that complex already, and these tools exist so anyone can try it. It's just a truely grueling process.

econ

I download them 200 at a time then delete the selection.

prawn

I have about the same in iCloud and then dozens of drives of video footage. Beyond the "keep it in case you need it" aspect, another angle is that it's often cheaper to just keep everything than spend all the time it would take to do a thorough cull. Video is much slower to review, but also chews up far more space, so maybe that balances out.

snapplebobapple

Crap is in the eye of the beholder...

mlinhares

If I knew the data i was going to need in the future i'd be in the future predicting business.

But there's an important piece there that is about data that should not have been stored in the first place. All the big data bullshit made people believe every single piece of data you could possibly store from a user is data that you should store and this is both a huge waste of resources and a huge liability, because now a data leak with useless to you PII could completely bankrupt your business.

1vuio0pswjnm7

"Photos of kids? Maybe that one will be The One that we end up framing? Miscellaneous business records? Maybe those will be the ones we have to dig out for a tax audit? Web pages on government sites? Maybe there will suddenly be an interest in obscure pages on public health policy if a global pandemic happens."

Perhaps I am missing something, but these examples all sound like candidates for _offline_ storage, for which no third party custodian or data center is required.

The price of large capacity NVMe SSDs continues to fall.

The amount of energy, the resource requirements, not to mention the environmental and community impact, of running a car insurance company are miniscule in comparison to that of running a data center.

wongarsu

Offline storage is only easy on the very small and the very large scale. On the very small scale you can get archival-grade CD/DVD/Blurays. On the very large scale you dedicate climate-controlled rooms to an automatic tape library.

In between is a vast gulf where your only good options are disks that you have to occasionally spin up to check integrity, have the disk trigger a data refresh if it's an SSD, and replace any disks that failed and rebuild their data from redundancy (raid or whatever you prefer). HDDs die even when off [1] and flash storage in SSDs is only rated to hold data for three years without power.

Sure, you can roll this yourself, but it can easily go wrong. Easier to either keep the disks spinning or pay someone with a tape archive (e.g. AWS deep glacier)

1 https://www.tomshardware.com/pc-components/storage/twenty-pe...

Pooge

Are you going to be able to find the data you're looking for if judges demand it?

Data that ends up in that sea of crap is very often poorly labeled.

Data that you cannot find again is useless.

If you take more than 3 minutes to find a picture you wanted to show me, it doesn't deserve to be showed anymore.

userbinator

"Better to have and not need, than need and not have."

Storage is cheap, very cheap.

jocaal

For your car note, insurance is only worth it if the thing being insured can ruin you if something were to happen to it. In the case of cars, you can potentially get ruined from the value of the other party's car. But if you live somewhere where most people drive normal cars, it might be worth it not having insurance. Our culture of insure everything, from your iphone to house is a market failure. If your house were to burn down, the value of the asset is still mostly concentrated in the land.

manyturtles

About a decade and a half ago I worked on a large data migration project at a FAANG. Multi-exabyte scale, many clusters across many countries. Once everyone was moved the old storage platform wasn't completely empty, because the number of migrations was large and users were (naturally) more focused on ensuring their data was in place and available on the target platform rather than ensuring every last thing was deleted on the legacy platform. We weren't initially concerned about it because it would all get deleted when we turned down the old setup.

As we were gearing up to declare victory and start turning down the several dozen legacy storage clusters someone mused that given some users were subject to litigation holds -- not allowed to delete any data -- that at least some of the leftover data on the old system might be subject to litigation hold, and we'd need to figure that out before we could delete it or incur legal risk. IIRC the leftover 'junk' data amounted to a few dozen petabytes spread across multiple clusters around the world, in different jurisdictions. We spent several months talking with the lawyers figuring that out. It was an interesting dance, because on the one hand we were quite confident that there was unlikely to be anything in the leftovers which was both meaningful and not migrated to the new platform, while on the other hand explaining that it wasn't practical to just "go and look" through a few dozen PB of data. I recall we ended up somewhere in between, coming up with ways to distinguish categories of data like caches and working data from various pipelines. It added over six months to the project, but was quite an interesting problem to work through that hadn't occurred to any of us earlier on, as we were thinking entirely in technical terms about infrastructure migration.

isaacremuant

That does sound very interesting. Any insights on what would you do differently if you had to do it again? Any way to accelerate things now that you know the pain or do you think it's quite unavoidable and "legal Time"?

manyturtles

Doing it in parallel, which of course is only an option if you know about it. And templating that into a general approach and having legal folks sign off on that.

mrb

Some fun math: according to some estimates there is 175 zettabytes of data worldwide. Assuming 20-terabyte harddrives, this could be stored on 8.8 billion drives. Assuming 10 drives per rack unit, 42 RU per rack cabinet, 16 square feet per cabinet (including aisle space), that means you need about 330 million square feet of data center space to host this data. If it was hosted in a single square data center, it would be 3.5 miles long and wide. (I always like to picture the physical space something would occupy.) And energy wise, assuming 5 watts per drive, it would consume 44 gigawatt, so it could be powered by about two large hydro dams similar to the Three Gorges Dam (22 gigawatts capacity). I am assuming a PUE close to 1.0 for simplicity. Of course one would not be able to spin up all these drives at once, since a drive spinning up consumes about three times more power (15 watts). So you would definitely staggered spin up when booting the servers :-)

If 90% of this data is "crap" and could be cut down, it would still be just a drop in the bucket compared to worldwide energy use.

vb-8448

Furthermore: a big amount of this data is stored on physical tapes and not hard drives. A LTO-9 tape is able to store up to 45TB of data and doesn't require any electricity when not mounted to be read.

andyp-kw

I wonder how it compares to electric car usage, or street lights in the USA.

bnewbold

I agree with the general sentiment here, but don't like the examples. 200 photos per person per year isn't very much! That is all fine.

What really bloats things out is surveillance (video and online behavioral) and logging/tracking/tracing data. Some of this ends up cold, but a lot of it is also warm, for analytics. It bloats CPU/RAM/network, which is pretty resource intensive.

The cost is justified because the margins of big tech companies are so wildly large. I'd argue those profits are mostly because of network effects and rentier behavior, not the actual value in the data being stored. If there was more competition pressure, these systems could be orders of magnitude more efficient without any significant different in value/quality/outcome, or really even productivity.

boznz

Don't forget emails.. I have everything I ever sent or received, and I have it backed up. I expect 90% of my inbox is the jpg signature logo they attach to the bottom of my clients email rather than hyperlink.

duxup

I ended up working on some software and I was deemed the email guy (it's a very small % of my job but it is the biggest pain).

"I need an email when this happens.. and when this happens."

The requests are endless and I'm convinced there are people who if they could would do their entire job from their inbox and get everything and anything an application can do via email.

The insidious problem is that it never solves anything. "I didn't get the email!" is a constant refrain (yes they did, they always did). "Oh someone didn't do the thing so can you add this to the email too." and so on.

It is such an abused tool.

mrweasel

We had a similar request for a sales team once: "If this fails we want an email".... Okay, but we never seen this fail, so you'll never get the email. As expected, they never got an email, because no failures. So instead they wanted an email every time the job ran (once every night), so they'd know that the job had failed, if they DIDN'T get the email. Only time that happened was because the email got trapped in a spamfilter I think.

There must be thousands of copies of that email sitting in inboxes say: Job X ran successfully @ 04:30.

Dylan16807

> The requests are endless and I'm convinced there are people who if they could would do their entire job from their inbox and get everything and anything an application can do via email.

That sounds like a reasonable goal for a whole lot of job duties. And yes some entire office jobs. (Excluding some direct human communication but a lot of jobs already have too much of that in situations that could have been an email.)

> "I didn't get the email!" is a constant refrain (yes they did, they always did).

Well having to manually check wouldn't improve that, would it?

Cthulhu_

I can see how that would work (email centric workflow), not unlike how some people now try to have a chat / Slack centric workflow.

saalweachter

Yeah, everyone knows the proper thing to do is orchestrate it all through Emacs.

jeffbee

Well "the cloud" will generally store exactly one logical copy of your static jpg, one of the reasons why clouds are pretty good for efficiency.

There is really no sense at all in the article's claims that "we are destroying the environment" to do x y and z thing the author whines about. We are destroying the environment to drive a Dodge Ram to the Circle K to buy a 52oz Polar Pop. Information sector doesn't even show up on top ten lists of things that are destroying the environment.

HPsquared

It's probably deduplicated on the server though, so the millions and millions of messages with that logo likely share the same piece of disk space. Probably one reason why free providers don't tend to offer End-to-End encryption. It prevents deduplication (and otherwise compressing redundant information).

palata

> Probably one reason why free providers don't tend to offer End-to-End encryption.

I would think that email providers tend to not offer E2EE just because it fundamentally isn't practical with email. Providers like Proton try to do it, but it works only if you talk to someone else on Proton (at which point you may as well do it on Signal, which has better encryption).

justsomehnguy

Nope.

Back in the day Exchange offered SIS but in 2010 they ditched it. It's plainly not effective any more. Even regarding the OP' "the jpg signature logo" - it's a part of multipart in the message, not a separate file.

And one more thing - you can't just turn the dedup and be dandy, now you need to check against the hashes to determine if this chunk is unique or you already have it. And with TBs of data you need TBs of hashes to check. Until you have like 99% dedup efficiency, ie 99% of your incoming data is literally the same data you already have - it doesn't worth it.

https://techcommunity.microsoft.com/blog/exchange/dude-where...

Sprite_tm

Not sure about Microsoft solutions, but modern file systems like zfs and btrfs can do this on the filesystem level, no support from the mail server needed.

LegionMammal978

Also, I can imagine such schemes getting ditched for the same reasons of "now people can detect whether we already have a copy of the file, which is a privacy hole!" that killed cross-origin caching in browsers.

Or if everything in the year 20XX gets pushed into using E2E encryption, since that's pretty much antithetical to deduplication.

hobs

That's not even crap data - that's archived data that might be useful someday (though de-dupe is probably a great idea and email sigs are definitely wasteful trash) - most of the crap data is things that would never have been useful under any circumstances.

I have cleaned up dozens of product databases in cost management efforts and have found anywhere from 50-99% of data stored in product databases is crap, because they are not well managed and any single mistake can lead to a huge outsized impact on storage.

What to log all those http requests for just a day? Might as well turn that on for all time...

somat

Isn't this just a specific case of sturgeons law?

https://en.wikipedia.org/wiki/Sturgeon%27s_law

BrenBarn

Yeah it seems like Sturgeon's law combined with a bit of Pareto principle.

foobahify

[flagged]

kdamica

To repurpose an old saying: "90% of my data is crap. I just don't know which 90%"

hoseja

"We’re destroying our environment to (...)"

No we're not. I really dislike this "environmental" anti-technologist angle. A single steel plant in china has tenfold "environmental impact" than all photos stored on a platter everywhere.

Would you prefer the photos are a cocktail of weird chemicals on a negative and printed on glossy photo paper?

Digital data is the most ephemeral we are able to make it through vast effort.

tbrownaw

Storage being cheap enough that it's not worth policing doesn't seem very consistent with it being expensive enough to include much energy use (what I assume the "destroying the environment" hyperbole is referencing).

ein0p

A gross underestimation, IMO. When I was in big data, fewer than 5% of data written was ever touched again, and only a single digit number of our large customers (out of tens of thousands) actually made real use of their "big data", and created most of the load. That's the trouble with "checkbox driven development" - 10 years ago you were required to have a "big data strategy" for anyone to take you seriously, even if your strategy boiled down to just ETL-ing a bunch of crap you're never going to need into the cloud and never touching it again. Now I'm in AI, and the same thing is happening to AI. It's great if you're selling shovels, so to speak, but not so great if you plan on selling them for an extended period of time.

This, by the way, has implications on storage systems design. You want something that's cheap yet dense to encode, potentially at the slight expense of decode speed. Normally people really lose sleep about decode speed first and foremost, which, while important, does not minimize the overall resource bill.

precommunicator

We recently discovered we store 500MiB of email tokens (expired). Then a copy of them in history table. Then this data is replicated. Then there are total of 4 backups of it. Backups that are done every 2 hours for half of each day. And we store those backup for up to half a year. I don't even want to add that up...

392

If you believe adding it up linearly is valid, and you care about the results, you should be using more efficient storage software.

precommunicator

It's "nice go have" i.e. cheaper to just pay for more storage than do it properly

Jean-Papoulos

The other day I had to go through 15 years old PowerPoint files to grab the originals of pictures made by a guy extremely proefficient at creating detailed artwork from PowerPoint forms that's now retired. Was can now render them to full-hd PNGs instead of the 256x256 BMP files we were using before.

Storing "useless" data makes financial sense.

HN

Data centers contain 90% crap data

Data centers contain 90% crap data