Skip to content(if available)orjump to list(if available)

A 16TB Mirror of Data.gov on Source.Coop

haunter

I'm surprised there is no torrent available

r3trohack3r

If I had to guess this is the problem

> This is a regularly updated mirror

p2p software really struggles with mutability.

We need stable chunking (I.e. rolling hash chunking around file boundaries) and seeders to seed addressable chunks (similar to bitswap) to preserve seeders from the previous revision when a 16TB archive partially changes.

This is something my team is actively researching alongside other p2p problem statements.

If it interests you, you know rust, and you’re looking for a job, email me.

extraduder_ire

There's a BIP for updatable torrents, and a number of features of bittorrent v2 make it easier.

Most archive torrents I see like this usually freeze the older contents so they don't change in later updated releases. Or, release new torrents with new data and keep using the old ones.

RIMR

For real, I have a 20TB drive available, and I would seed this indefinitely.

monkeyfun

Yo just since you mention that -- got any recs for a 20tb drive? Currently the biggest thing i rock is just a seagate 8TB nas HDD, which i nabbed pretty cheap, but these days the archiving i do kinda makes me wanna step up even bigger.

doctoboggan

I got this on Black Friday for $270:

https://www.newegg.com/wd-elements-20tb-black-usb-3-0/p/N82E...

Its $350 now however, but that's still not bad for 20TB

ChrisArchitect

SubiculumCode

That is the point. "redundant" duplication of data is needed.

ck2

The public facing data was never the problem.

We have no idea what is being nuked off the backend, copied offline and into AI

Employee records, Tax records, medical records, things like that.

Then there is the problem of private servers plugged into government networks that are obviously bypassing firewalls and prime targets for foreign governments because they aren't secured and designed to be remote accessed.

ethagnawl

> Then there is the problem of private servers plugged into government networks that are obviously bypassing firewalls and prime targets for foreign governments because they aren't secured and designed to be remote accessed.

This isn't getting ... any attention. (Though, I've contacted my congressional representative and I urge everyone else to do the same.) Any concerns have been dismissed as FUD and allayed by reassurances that _their access is read-only_, which is a wet fucking blanket if ever there was one. I know if I were a nation state with access to Pegasus or similar systems, I'd be actively targeting these DOGEngineers. But, sure, cry me a fucking river about DeepSeek.

bongodongobob

Do you have any examples of being able to get sensitive info from an LLM? I've never heard of it.

frontalier

one of the kids was asking on twitter which llm to use for pdf parsing

m3kw9

How do they guarantee none of it is modified?

badlibrarian

BagIt and sha256 and dropping Harvard's name and "look, squirrel!"

https://en.wikipedia.org/wiki/BagIt

nsriv

Comment history certainly living up to your name.

null

[deleted]

null

[deleted]

null

[deleted]