Announcing the data.gov archive
147 comments
·February 7, 2025black_puppydog
lloeki
> What I'm missing from this announcement though is any mention of how they intend to secure this "vault" against the current government.
Maybe this?
> In addition to the data collection, we are releasing open source software and documentation for replicating our work and creating similar repositories. With these tools, we aim not only to preserve knowledge ourselves but also to empower others to save and access the data that matters to them.
https://github.com/harvard-lil/data-vault
And since the data lives here: https://source.coop/repositories/harvard-lil/gov-data/descri...
Combined with this:
> To download an individual dataset by name you can construct its URL, such as:
> https://source.coop/harvard-lil/gov-data/collections/data_go...
> https://source.coop/harvard-lil/gov-data/metadata/data_gov/f...
> To download large numbers of files, we recommend the aws or rclone command line tools:
> aws s3 cp s3://us-west-2.opendata.source.coop/harvard-lil/gov-data/collections/data_gov/<name>/v1.zip --no-sign-request
So one could "easily" mirror the whole thing, making it distributed.
beej71
And 16 TB isn't what it used to be (though it is a big transfer). I'm hopeful that it's already being mirrored in multiple places including overseas.
bjackman
I think a fully distributed storage system must be the way here. There must be some IPFS type system where Harvard could say "we designated a set of data that we can add to as needed but only delete from with a critical mass of storage providers' consent, here are some instructions for you to add your spare capacity to become a storage provider".
BorisMelnik
I was just thinking IPFS or possibly even some blockchain solution for this. Bram Cohen (the creator of bittorrent) also has a new project I think in the DePin sector.
Sensitive information just can't be hosted on a centralized server anymore, it has to be distributed for the good of the project.
skeeter2020
This sounds an awful lot like Pied Piper!
tlb
It might not be ironclad, but the ease with which federal workers can fiddle with data they're hosting themselves vs. fiddling with data in Harvard's library is a pretty big difference. And if it ever came to demands for censorship, it wouldn't be Harvard Library's first rodeo.
Thorrez
>how they intend to secure this "vault" against the current government
Is there any risk of the government ordering them to take it down? That seems unlikely to me. The US has strong free speech protection, stronger than European free speech protection.
>keeping this data online against the express will of the government is gonna cost (political) capital.
Costing them political capital (aka the government is unhappy) is different from the government ordering them to take it down. Also, when you say "express will", are you saying the government has explicitly publicly stated that they don't like that Harvard is hosting this data?
dmd
> The US has strong free speech protection
The US literally just told (e.g., [0]) all scientists working for it they are not permitted to publish papers or speak at conferences or travel. What the US "has" is irrelevant when laws are being ignored.
[0] https://www.science.org/content/blog-post/revised-and-extend...
marcinzm
That's an employer telling employees what to do? Google can tell the same thing to it's employees and does. I don't see the relevance to the government telling other entities what to do.
saghm
> Is there any risk of the government ordering them to take it down? That seems unlikely to me. The US has strong free speech protection, stronger than European free speech protection.
In other words, you don't think we have to worry about Congress bringing in university presidents to grill them over political activities against the government's current policies occurring on their campuses? Certainly we wouldn't see any of those whose didn't give testimony that the government liked would end up being forced to resign in the aftermath; we're not in the dark ages anymore like we were in 2023...
Thorrez
I hadn't heard of that case before. However, some government officials saying that someone should resign is a bit different from the government actually forcing someone to resign or be fired.
The comment I replied to said "What I'm missing from this announcement though is any mention of how they intend to secure this "vault" against the current government.", which I believe is talking about a case where the government orders Harvard to take the info down. If Harvard will delete the data because a few government officials say they should (but don't order them to do so), then I don't see what can be done to secure the vault against the government. E.g. hosting it in Europe won't be any help, because Harvard could just delete it from the European hosting.
CalRobert
Whether the US still has this protection of free speech remains to be seen
sofixa
> The US has strong free speech protection, stronger than European free speech protection
Citation needed. "European free speech protection" doesn't exist, each country has its own rules and freedoms. Hungary is drastically less free than e.g. France... but overall, as a rule of thumb, unless for you freedom of speech includes freedom to be a Nazi, European countries are pretty free. And don't take stances such as "spending money is free speech which is sacred so no campaign financing rules". You might get arrested for a Nazi salute though... which, if you think is a bad thing, you haven't been paying attention in history classes nor modern news.
Thorrez
The US values free speech higher than privacy. The EU values privacy higher than free speech.
>Opinions on the right to be forgotten differ greatly between the United States and EU countries. In the United States, accessibility, the right of free speech according to the First Amendment, and the "right to know" are typically favored over removing or increasing difficulty to access truthfully published information regarding individuals and corporations. Although the term "right to be forgotten" is a relatively new idea, the European Court of Justice legally solidified that the "right to be forgotten" is a human right when they ruled against Google in the Costeja case on May 13, 2014.[20]
>The European Union has been advocating for the delinkings requested by EU citizens to be implemented by Google not just in European versions of Google (as in google.co.uk, google.fr, etc.), but on google.com and other international subdomains. Regulators want delinkings to be implemented so that the law cannot be circumvented in any way.
https://en.wikipedia.org/wiki/Right_to_be_forgotten#European...
MollyRealized
> The US has strong free speech protection, stronger than European free speech protection.
For this to be said in the current circumstances shows either a political viewpoint or a lack of knowledge.
Thorrez
See this comment: https://news.ycombinator.com/item?id=42983898
Additionally:
* In some European countries newspapers cannot publish the names of people on trial. In the US they can.
* In some European countries you cannot take pictures of people in public and publish them. In the US you can.
fragmede
What's the name of the law in Germany that puts you in jail for Holocaust denial?
rswail
From the announcement: "This work is made possible with support from the Filecoin Foundation for the Decentralized Web and the Rockefeller Brothers Fund."
headcanon
> how they intend to secure this "vault" against the current government
Definitely a concern, if they want to harass Harvard and the other universities they could, but I don't think they'll bother. They know the data will be backed up, that's not the point.
Taking it off of data.gov accomplishes two things:
1) Makes it look like they're doing something, playing to the base. Easy to do
2) Delegitmize any insights the data might have. "Sure you have 'data', is it official data? I don't see it on data.gov. How do we know its not fraudulent?" It makes it harder to use it to justify policy changes. It adds one more tool to the denial crowd.
zombot
> ...how they intend to secure this "vault" against the current government.
Yup, I was about to ask whether Trump could still force them to delete what he doesn't like. Time will tell, I guess.
LadyCailin
In general, no. By withholding federal funds, and with owning congress and scotus, yes.
bilekas
I'll admit I'm not privy to US educational system etc, but isn't Harvard a private university ? Why would they be receiving significant federal funds in the first place ?
tomrod
Public data. License violations might cause some consternation but anything on data.gov is public and intended for the public. Deleting these are potentially legal violations requiring their publication.
EnnEmmEss
If I remember correctly, Harvard has immunity to eminent domain under the Massachusetts constitution. Maybe it has a similar right which would make it immune to such attacks?
lou1306
I beg you, please stop applying rule-of-law mindset against might-makes-right adversaries. It creates blind spots giving the illusion that the attack surface is way smaller than it actually is.
Muskolites are taking on the SSN system without any Congressional oversight as we speak. The President is attacking ius soli which is a Constitutional right. If they decide that sending their sleuths to Cambridge MA to physically destroy this data is in their best interest, they will do so and handle the courts later. Just stop pretending they will play by the book.
squigz
How do you recommend engaging with the situation then? Throw out the book too?
globalnode
Im predicting no more elections for you guys in 4 years. Something makes me think theres gonna be some "reason" to turn them off.
milesrout
[flagged]
black_puppydog
That may be so, but given what Trump and Musk have been up to, the situation of the courts, and how they blatantly don't give a f*k about what's constitutional or not, I wouldn't rely on this so-called "immunity".
0xEF
I really hope I am wrong, but I'm planning on seeing some headlines about Musk shutting down Harvard next week over this.
For anyone who still thinks the existing laws, constitutions and policies mean anything to this current regime, prepare to get some whiplash. They are proving that none of that matters if they simply ignore it and do what they want to do anyway.
cyberlimerence
Is anyone out there archiving USGS/NOAA datasets ? It sounds ridiculous, but this appears to be where we are now. There is a submission about NOAA on the frontpage now: "Scientists on alert as NOAA restricts contact with foreign nationals" [1]
fs111
ArchiveTeam is working on everything US-Government and you can help https://tracker.archiveteam.org/
Rebuff5007
I find it assuming that the might of the American government -- in trying to take a bunch of data offline -- is being resisted by a digital "militia" of hobbyist archivers and non profits.
Theres something that about this that just rings second amendment. Personally I think the concept of civilians having weapons to be a check on a nation state is absurd, but in this case it feels pretty empowering.
jppope
Well I wouldn't really call it the "American Government" per say... Its a Geriatric former reality TV show host elected to the presidency by offering to do for America what he did for steak or private education. That guy and his cronies really aren't the American Government. They were just elected to be in charge of the American Government.
p3rls
Larping the larpers.
fnands
Will does this include all USGS data?
This is a topic that came up at work today as we rely on this data and are considering backing up most of the Lidar data from there ourselves (100s of TB probably)
EDIT: no, looks like it is only the footprints
mindcrime
Very happy this is happening. There's a ridiculous amount of incredibly valuable data, scientific documents, etc. "out there" that are at risk.
I haven't had much time to look at this yet and see what all is there, but whether currently included or not, a couple of things I really hope get archived are the contents of the DTIC (Defense Technical Information Center) document repository (lots of really interesting older scientific publications) and the NASA TRS (Technical Report Server).
I'm working on my own archive of at least some portion of the DTIC stuff just to be on the safe side. So far everything I've tried to access is still there, but who knows how long that will last.
fredoliveira
Honestly a shame it has to come to this. Sure, people elected this administration and I guess with that comes with a bunch things I disagree with. But the removal of years of scientific research and data from the web (paid for by citizens with their taxes) is absolutely unacceptable. Ravaging CDC data, climate data, etc is horrendous and unforgivable.
zombot
It's today's equivalent of book burnings.
"where they burn books, they will ultimately burn people as well."
Those who delete research will ultimately delete people as well.
nxm
[flagged]
d3nj4l
Getting upset that USAID spent the government equivalent of pocket change on programs that nominally helped other countries and practically bound them to US soft power is (1) cutting off your nose to spite your face and (2) not equivalent to deleting data that money was already spent on.
fredoliveira
At face value I agree with everything USAID stands for, but it sounds like you have some examples in mind you want to provide.
matwood
Good luck getting any responses that have been corroborated outside of Musk's ramblings. So much for 'transparency'.
If someone was truly looking to audit places to save money, it's always been odd to me that USAID was first. Now this story comes out...
https://www.newsweek.com/usaid-elon-musk-starlink-probe-ukra...
MeetingsBrowser
We cancelled $100,000 in life saving donations because half a penny was spent on something considered “woke”.
null
frontalier
this archive is going to disappear before the summer comes
govideo
From the post: Today we released our archive of data.gov on Source Cooperative. The 16TB collection includes over 311,000 datasets harvested during 2024 and 2025, a complete archive of federal public datasets linked by data.gov. It will be updated daily as new datasets are added to data.gov. This is the first release in our new data vault project to preserve and authenticate vital public datasets for academic research, policymaking, and public use.
null
null
Great to see there's some resistance. What I'm missing from this announcement though is any mention of how they intend to secure this "vault" against the current government. I'm assuming good intentions on the part of Harvard, but keeping this data online against the express will of the government is gonna cost (political) capital. And from what I can see, the archive is hosted by US entities on US-controlled servers on US soil?
This is the same thing that's been bothering me with archive.org lately, by the way. I haven't found a good way to simply (for some reasonable definition definition of "simple") contribute 10 TiB or so of redundant storage on my (european) home server either. That kind of thing might (have to) serve to ensure tamper-resistance for that data, given the current political climate on both sides of the pond. Any pointers welcome.