A 16TB Mirror of Data.gov on Source.Coop
19 comments
·February 7, 2025haunter
r3trohack3r
If I had to guess this is the problem
> This is a regularly updated mirror
p2p software really struggles with mutability.
We need stable chunking (I.e. rolling hash chunking around file boundaries) and seeders to seed addressable chunks (similar to bitswap) to preserve seeders from the previous revision when a 16TB archive partially changes.
This is something my team is actively researching alongside other p2p problem statements.
If it interests you, you know rust, and you’re looking for a job, email me.
extraduder_ire
There's a BIP for updatable torrents, and a number of features of bittorrent v2 make it easier.
Most archive torrents I see like this usually freeze the older contents so they don't change in later updated releases. Or, release new torrents with new data and keep using the old ones.
RIMR
For real, I have a 20TB drive available, and I would seed this indefinitely.
monkeyfun
Yo just since you mention that -- got any recs for a 20tb drive? Currently the biggest thing i rock is just a seagate 8TB nas HDD, which i nabbed pretty cheap, but these days the archiving i do kinda makes me wanna step up even bigger.
doctoboggan
I got this on Black Friday for $270:
https://www.newegg.com/wd-elements-20tb-black-usb-3-0/p/N82E...
Its $350 now however, but that's still not bad for 20TB
ChrisArchitect
[dupe]
Discussion: https://news.ycombinator.com/item?id=42970039
SubiculumCode
That is the point. "redundant" duplication of data is needed.
ck2
The public facing data was never the problem.
We have no idea what is being nuked off the backend, copied offline and into AI
Employee records, Tax records, medical records, things like that.
Then there is the problem of private servers plugged into government networks that are obviously bypassing firewalls and prime targets for foreign governments because they aren't secured and designed to be remote accessed.
ethagnawl
> Then there is the problem of private servers plugged into government networks that are obviously bypassing firewalls and prime targets for foreign governments because they aren't secured and designed to be remote accessed.
This isn't getting ... any attention. (Though, I've contacted my congressional representative and I urge everyone else to do the same.) Any concerns have been dismissed as FUD and allayed by reassurances that _their access is read-only_, which is a wet fucking blanket if ever there was one. I know if I were a nation state with access to Pegasus or similar systems, I'd be actively targeting these DOGEngineers. But, sure, cry me a fucking river about DeepSeek.
bongodongobob
Do you have any examples of being able to get sensitive info from an LLM? I've never heard of it.
frontalier
one of the kids was asking on twitter which llm to use for pdf parsing
m3kw9
How do they guarantee none of it is modified?
badlibrarian
BagIt and sha256 and dropping Harvard's name and "look, squirrel!"
nsriv
Comment history certainly living up to your name.
null
null
null
I'm surprised there is no torrent available