Archivists work to save disappearing data.gov datasets
262 comments
·January 30, 2025JackC
josh-sematic
A common metric for how much actual content has changed is the Jaccard Index. Even for large numbers of datasets that are too large to fit in memory it can be approximated with various forms of MinHash algorithms. Some write up here: https://blog.nelhage.com/post/fuzzy-dedup/
codetrotter
> sign archives with email/domain/document certificates
I do a bit of web archival for fun, and have been thinking about something.
Currently I save both response body and response headers and request headers for the data I save from the net.
But I was thinking that maybe if instead of just saving that, I could go a level deeper and preserve actual TCP packets and TLS key exchange stuff.
And then, I might be able to get a lot of data provenance “for free”. Because if in some decades when we look back at the saved TCP packets and TLS stuff, we would see that these packets were signed with a certificate chain that matches what that website was serving at the time. Assuming of course that they haven’t accidentally leaked their private keys in the meantime and that the CA hasn’t gone rogue since etc.
To me I think that would make sense to build out web archival infra that preserves the CA chain and enough to be able to see later that it was valid. And if many people across the world save the right parts we don’t have to trust each other in order to verify that data that the other saved was also really sent by the website our archives say it was from.
For example maybe I only archived a single page from some domain, and you saved a whole bunch of other pages from that domain around the same time so the same certificate chain was used in the responses to both of us. Then I can know that the data you are saying you archived from them really was served by their server because I have the certificate chain I saved to verify that.
tomatocracy
In terms of tooling there's scoop[0] which does a lot of the capture part of what you're thinking about. The files it creates include request headers and responses, TLS certificates, PDF and screenshots and it has support for signing the whole thing as proof of provenance.
Overall though I think archive.org is probably sufficient proof that a specific page had certain content on a certain day for most purposes today.
kro
The idea is good, as far as I understand TLS however, the cert / asymmetric key is only used prove the identity/authenticity of the cert and thus the host for this session.
But the main content is not signed / checksummed with it, but rather a symmetrical session key, so one could probably manipulate this in the packet dump anyway.
I read about a Google project named SXG (Signed HTTP exchanges) that might do related stuff, albeit likely requiring the assistance of the publisher
Intralexical
"TLS-N", "TLS Sign", and maybe a couple others were supposed to add non-repudiation.
But they didn't really go anywhere:
https://security.stackexchange.com/questions/52135/tls-with-...
https://security.stackexchange.com/questions/103645/does-ssl...
There are some special cases, like I think certain headers for signing e-mails, that do provide non-repudiation.
For that, `tcpdump` with `SSLKEYLOGFILE` will probably get you started on capturing what you need.
ethbr1
To extend this to archival integrity without cooperation from the server/host, you'd need the client to sign the received bytes.
But then you need the client to be trusted, which clashes with distributing.
Hypothetically, what about trusted orgs standing up an endpoint that you could feed a URL, then receive back attestation from them as to the content, then include that in your own archive?
Compute and network traffic are pretty cheap, no?
So if it's just grabbing the same content you are, signing it, then throwing away all the data and returning you the signed hash, that seems pretty scalable?
Then anyone could append that to their archive as a certificate of authenticity.
catlifeonmars
Reminds me of timestamp protocol and timestamp authorities.
Not quite the same problem, but similar enough to have a similar solution. https://www.ietf.org/rfc/rfc3161.txt
A1kmm
Unfortunately, the standard TLS protocol does not provide a non-repudiation mechanism.
It works by using public key cryptography and key agreement to get both parties to agree on a symmetric key, and then uses the symmetric key to encrypt the actual session data.
Any party who knows the symmetric key can forge arbitrary data, and so a transcript of a TLS session, coupled with the symmetric key, is not proof of provenance.
There are interactive protocols that use multi-party computation (see for example https://tlsnotary.org/) where there are two parties on the client side, plus an unmodified server. tlsnotary only works for TLS1.2. One party controls and can see the content, but neither party has direct access to the symmetric key. At the end, the second party can, by virtue of interactively being part of the protocol, provably know a hash of the transaction. If the second party is a trusted third party, they could sign a certificate.
However, there is not a non-interactive version of the same protocol - you either need to have been in the loop when the data was archived, or trust someone who was.
The trusted third party can be a program running in a trusted execution environment (but note pretty much all current TEEs have known fault injection flaws), or in a cloud provider that offers vTPM attestation and a certificate for the state (e.g. Google signs a certificate saying an endorsement key is authentically from Google, and the vTPM signs a certificate saying a particular key is restricted to the vTPM and only available when the compute instance is running particular known binary code, and that key is used to sign a certificate attesting to a TLS transcript).
I'm working on a simpler solution that doesn't use multiparty computation, and provides cloud attestation - https://lemmy.amxl.com/c/project_uniquonym https://github.com/uniquonym/tls-attestproxy - but it's not usable yet.
Another solution is if the server will cooperate with a TLS extension. TLS-N (https://eprint.iacr.org/2017/578.pdf) provides a solution for this. That provides a trivial solution for provenance.
Intralexical
As important as cryptography is, I also wonder how much of it is trying to find technical solutions for social problems.
People are still going to be suspicious of each other, and service providers are still going to leak their private keys, and whatnot.
3np
You may be interested in Reclaim Protocol and perhaps zkTLS. They have something very similar going and the sources are free.
https://github.com/reclaimprotocol
https://drive.google.com/file/d/1wmfdtIGPaN9uJBI1DHqN903tP9c...
whatevermom
It’s an interesting idea for sure. Some drawbacks I can think off:
- bigger resource usage. You will need to maintain a dump of the TLS session AND an easily extractable version
- difficulty of verification. OpenSSL / BoringSSL / etc. will all evolve and say, completely remove support for TLS versions, ciphers, TLS extensions… This might make many dumps unreadable in the future, or requiring the exact same version of a given software to read it. Perhaps adding the decoding binary to the dump would help, but then, you’d get Linux retro-compatibility issues.
- compression issues: new compression algorithms will be discovered and could reduce data usage. You’ll have a hard time doing that since TLS streams will look random to the compression software.
I don’t know. I feel like it’s a bit overkill — what are the incentives for tampering with this kind of data?
Maybe a simpler way of going about it would be to build a separate system that does the « certification » after the data is dumped; combined with multiple orgs actually dumping the data (reproducibility), this should be enough the prove that a dataset is really what it claims to be.
mrshadowgoose
Just commenting to double-down on the need for cryptographic timestamping - especially in the current era of generative AI.
sunaookami
What about https://opentimestamps.org/ ?
_heimdall
How does that work exactly? Does it all still hinge on trusting a know Time Stamp Authority, or is there some way of time stamping in a trustless manner?
yencabulator
I'm so sad roughtime never got popular. It can be used to piggyback a "proof of known hash at time" mechanism, without blockchain waste.
https://www.imperialviolet.org/2016/09/19/roughtime.html
https://int08h.com/post/to-catch-a-lying-timeserver/
mlyle
You can publish the hash in some durable medium, like the classified section of a newspaper.
This proves you generated it before this time.
You can also include in the hash the close of the stock market and all the sports scores from the previous day. That proves you generated it after that time.
sebmellen
This is the one thing blockchains are truly good for.
mrshadowgoose
You make use of several independent authorities for each timestamped document.
The chance is exceedingly low that the PKI infrastructure of all the authorities becomes compromised.
chrishoyle
I'd love to learn more about what is in scope of the Library Innovation Lab projects. Is it targeting data.gov specifically or all government agency websites?
Given the rapid take downs of websites (cdc, usaid) do you have a prioritization framework for which website pages to prioritize or do you have "comprehensive" coverage of pages (in scope of the project)?
As you allude to, I've been having a hard time learn about what sort of duplicate work might be happening given that there isn't a great "archived coverage" source of truth for government websites (between projects such as End of Term archive, Internet archive, research labs, and independent archivists).
Your open questions are interesting. Content hashes for each page/resource would be a way to do quick comparisons, but I assume you might want to set some threshold to determine how much it's changed vs if it changed?
Is the second question about figuring out how to prioritize valuable stuff behind two depth traversals? (ex data.gov links to another website and that website has a csv download)
JackC
As a library, the very high level prioritization framework is "what would patrons find useful." That's how we started with data.gov and federal Github repos as broad but principled collections; there's likely to be something in there that's useful and gets lost. Going forward I think we'll be looking for patron stories along the lines of "if you could get this couple of TB of stuff it would cover the core of what my research field depends on."
In practice it's some mix of, there aren't already lots of copies, it's valuable to people, and it's achievable to preserve.
> Is the second question about figuring out how to prioritize valuable stuff behind two depth traversals?
Right -- how do you look at the 300,000 entries and figure out what's not at depth one, is archivable, and is worth preserving? If we started with everything it would be petabytes of raw datasets that probably shouldn't be at the top of the list.
smrtinsert
Thank you for this effort.
jszymborski
Hi! Is there any one place that would be easiest for folks to grab these snapshots from? Would love to try my hand at finding documents that moved/documents that were removed.
JackC
Hmm, I can put them here for now: https://source.coop/harvard-lil/data-gov-metadata
Unfortunately it's a bit messy because we weren't initially thinking about tracking deletions. data_20241119.jsonl.zip (301k rows) and data_20250130.jsonl.zip (305k rows) are simple captures of the API on those dates. data_db_dump_20250130.jsonl.zip (311k rows) is a sqlite dump of all the entries we saw at some point between those dates. My hunch is there's something like 4,000 false positives and 2,000 deletions between the 311k and 305k set, but that could be way off.
jszymborski
Very cool! I take a look :)
LastTrain
How can people help? Sounds like a global index of sources is needed and the work to validate those sources, over time, parceled out. Without something coordinated I feel like it is futile to even jump in.
JackC
I spent a bunch of time on this project feeling like it was futile to jump in and then just jumped in; messing with data is fun even if it turns out someone else has your data. But the government is huge; if you find an interesting report and then poke around for the .gov data catalog or directory index structure or whatever that contains it, you're likely to find a data gathering approach no one else is working on yet.
There's coordinated efforts starting to come together in a bunch of places -- some on r/datahoarders, some around specific topics like climate data (EDGI) or CDC data, there's datasets being posted on archive.org. I think one way is to find a topic or kind of data that seems important and search around for who's already doing it. Eventually maybe there'll be one answer to rule them all, but maybe not; it's just so big.
glitchcrab
Very tangentially related, but it always makes me smile to see rclone mentioned in the wild - its creator ncw was the CEO of the previous company I worked at.
0n0n0m0uz
One of the USA greatest strengths is the almost unprecedented degree of transparency of governments records going back decades. We can actually see the true facts including when our government has lied to us or covered things up. Many other nations do not have this luxury and it has provided the evidentiary basis for both legal cases and "progress" in general. Not surprising that authoritarians would target and destroy data as it makes their objective of a post-truth society that much easier
bamboozled
It’s also the driver for a great economy imo. Why can’t Russia and China build and innovate like the USA can?
Because they spend a lot of time censoring and covering things up.
RIP USA.
johnnyanmac
China as of late is indeed running circles around the US. We spend 3 decades depending on them for manufacturing, and surprise. They got really good at manufacturing stuff for themselves when the chips are down.
I guess great economies also give it away. RIP USA.
stackedinserter
> Why can’t Russia and China build and innovate like the USA can? > Because they spend a lot of time censoring and covering things up.
What a statement! How did you come to this conclusion?
bamboozled
Who built all Russia’s oil infrastructure ? Who has the strongest economy? Who builds the airliners ? Has the most advanced space programs ? Landed men on the moon ? Etc ?
It’s not just America either. Europe too. Australia.
Authoritarians copy and steal. Democracies innovate and adapt.
Democracies are more transparent and honest, which leads to better outcomes in all areas.
Time spent hiding information to appease the dictator and protecting “the party” is time wasted.
chrishoyle
Beyond federal websites (.gov, .mil) there are lot of gov contractor websites that are being taken down (presumably at the demand of agencies) that contain a wealth of information and years of project research.
Some below of contractors that work with US AID:
- https://www.edu-links.org/ (taken down)
- https://www.genderlinks.org/ (taken down)
- https://usaidlearninglab.org/ (taken down)
- https://agrilinks.org/ (presumably at risk)
- https://www.climatelinks.org/ (presumably at risk)
- https://biodiversitylinks.org/ (presumably at risk)
honestSysAdmin
The pedestrian "right", which I encounter on a day-to-day basis the months I visit client sites a couple hundred miles inland of the Gulf of America, will look at climatelinks.org and say something like: "all I see are foreign countries, why are we spending money on this instead of citizens of the United States?".
bamboozled
Yeah, what has avoiding another plague ever done for the USA.
johnnyanmac
"We're America, we wait until it's too late and then react!"
A rough paraphrasing from Boondocks, said by the richest man in that neighborhood.
cscurmudgeon
[flagged]
harimau777
IMHO, we should do it because the person who pays tends to have more power over what happens. Just like how in high school the kid who drives everyone tends to have a higher than normal say in what the friend group does.
neaden
The US provided 14% of the WHO funding but is 25% of global GDP, so proportionately we don't contribute as much as many other countries.
sanp
We wouldn’t know this if the information isn’t shared? So, aren’t you making a case for not removing this information?
kergonath
> Why should US fund WHO ~5-6 times more than China [0] (and more than EU)
The base contributions are a function of GDP. The extra contributions are voluntary, and the US did it because it was in the US’ interests. It’s a founding error in the US foreign policy budget and was a good investment in terms of goodwill and data for American health research institutions.
WHO must focus where it is needed most. Public health is much better in the EU (and even in Europe, accounting for places like Belarus and Ukraine) than in China, and there are much fewer epidemics that emerge in Europe in general.
The whole idea is that if we limit the emergence of epidemics where they are likely to happen, we end up with fewer pandemics after these epidemics spread worldwide (which includes Europe and North America). The whole world is better without another COVID, Ebola, or Polio.
> only to have the WHO be controlled by China
This is bullshit. The WHO is not controlled by China any more than other UN institutions. What is certain, though, is that the US won’t have any say whatsoever once they are out.
Slava_Propanei
[dead]
cle
I’ve been archiving data.gov for over a year now and it’s not unusual to see large fluctuations on the order of hundreds or thousands of datasets. I’ve never bothered trying to figure out what exactly is changing, maybe I should build a tool for that…
nemofoo
Do you mirror these data sets anywhere?
cle
It's not in any sort of format to do this kind of analysis unfortunately. I'm also missing some data b/c I throw away certain kinds of datasets that are not useful for me. I can probably write some scripts to diff my archives with the current data.gov and see what's missing, but it won't be "complete". But it might still be useful...
I did however just write a Python script to pull data.gov from archive.org and check the dataset count on the front page for all of 2024, here are the results:
As you can see, there were multiple drops on the order of ~10,000 during 2024. So it's not that unusual. There could be something bad going on right now, but just from the numbers I can't conclude that yet.
(Specifically it takes the first snapshot of every Wednesday of 2024).
If I get around to re-formatting my archives this week, I'll follow up on HN :).
jl6
> The outlet reports that deleted datasets "disproportionately" come from environmental science agencies like the Department of Energy, National Oceanic and Atmospheric Administration (NOAA), and the Environmental Protection Agency (EPA).
Was there an EO targeting these areas?
_DeadFred_
Looks like the EPA is being targeted (Even though ninety-five percent of the funding going to EPA has not only been appropriated, but is locked in, legally obligated grant funding. The Constitution does not give the president a line item veto over Congress's spending decisions):
https://www.cbsnews.com/news/epa-employees-warned-of-immedia...
arcbyte
The President's ability to affect spending is definitely limited, and hasn't been exercised really since Reagan, but still exists.
Congress rarely makes spending money it's goal, rather it appropriates money to accomplish some goal. Which is to say that if Congress wants a bridge across a river and appropriate 10 billion to build it, the President is not obligated to spend $10 billion if 7 or 8 or 9 will do. In some cases, Congress does appropriate money toward causes and intends all the money to be spent in furtherance of some guiding principle and in these cases all the money must be spent.
rhinoceraptor
It's not limited if he hijacks the entire government from the inside and installs loyalists everywhere
_DeadFred_
If that money has already been awarded to be given out the executive can not arbritrality withdraw it. The Administrative Procedure Act (APA) of 1946 contains the rule that prevents the U.S. executive branch from acting in an arbitrary and capricious manner. Trump was already slapped for doing this by the Supreme Court in Department of Homeland Security v. Regents of the University of California.
null
johnnyanmac
Clinton did try to pass something but it was vetoed. Pretty much all case law in this area has ruled thst Congress's budget is final and very hard to modify.
Trump's case was a completely constitutional overreach. Thankfully they shot it down fast in court.
ks2048
I really don't think "Trump can't do it, if it's illegal" is going to fly this time. I hope so, but I'm not very confident.
Of course, he can't get away with too much alone, but with the right appointees, judges, etc, who knows.
ks2048
And people saying "Congress controls spending, not the President" - there are already reports of Musk trying to take control of the system that sends out money.
https://newrepublic.com/post/190983/top-treasury-official-qu...
EDIT: apparently Musk succeeded: https://bsky.app/profile/wyden.senate.gov/post/3lh5ejpwncc23
Nuzzerino
Hmm, I'm not sure if "is it legal" would be at the top of my priority list if I was working (in good faith) on the team to recover the country from a nose dive and prevent world war 3. But it is ok to disagree on whether those things actually are real imminent risks to fix.
softwaredoug
I don't get why there's not legal action then? Maybe its a matter of nobody having standing?
toyg
There is some action , certain states have already started suing. But it takes some time and effort to file suits that are watertight, it's harder and slower than just writing illegal, unconstitutional, or inapplicable acts
johnnyanmac
Trump is basically doing a blitzkrieg of EO's. Lawsuits are happening slowly between judges injuctioning Trump's budget freeze, employees suing the government over the OP issues, and states suing over various orders. But so much is happening at once that it's hard to keep track of your full time job isn't politics.
TheBlight
[flagged]
dangrossman
> "While 5 CFR 315 does permit immediate termination, it does not permit arbitrary termination. The termination must be related to unsatisfactory performance or conduct (section 804) or conditions arising before employment, which usually means something from your background investigation (section 805)..."
https://www.reddit.com/r/fednews/comments/1id7ud2/comment/m9...
_DeadFred_
The Administrative Procedure Act (APA) of 1946 contains the rule that prevents the U.S. executive branch from acting in an arbitrary and capricious manner. Definitely not legit.
johnneville
Politico reports that USDA landing pages regarding climate change were ordered to be deleted by a directive from the USDA's office of communications.
I think it is likely that orders to these other agencies follows this model. Many other datasets are being targeted via EO 14168 which has quite wide impacts but doesn't appear at first glance to apply to what i would expect to be a part of NOAA and EPA reports.
https://www.politico.com/news/2025/01/31/usda-climate-change...
null
bilbo0s
Don’t worry, it is a matter of great doctrinal import that all scientific datasets be replaced with datasets that have been properly refined in accordance with scripture. /s
Maybe this administration will get better over time?
uni_rule
Nah, the whole executive branch is getting Jack Welch'ed. Hopefully your tap water cleanliness regulations are strong on a state level.
BeefWellington
After the utter bullshit pulled in California, better hope your state is willing to defend its water reservoirs or for some places clean tap water may be the least of their problems.
dang
Related ongoing thread:
CDC data are disappearing - https://news.ycombinator.com/item?id=42897696 - Feb 2025 (216 comments)
eh_why_not
What's a good way to be an "Archivist" on a low budget these days?
Say you have a few TBs of disk space, and you're willing to capture some public datasets (or parts of them) that interest you, and publish them in a friendly jurisdiction - keyed by their MD5/SHA1 - or make them available upon request. I.e. be part of a large open-source storage network, but only for objects/datasets you're willing to store (so there are no illegal shenanigans).
Is this a use case for Torrents? What's the most suitable architecture available today for this?
josh-sematic
I’m not an expert in such things, but this seems like a good use case for IPFS. Kinda similar to a torrent except that it is natively content-addressed (essentially the key to access is a hash of the data).
smallerize
https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem
Set up a scrape using ArchiveTeam's fork of wget. It can save all the requests and responses into a single WARC file. Then you can use https://replayweb.page/ or some other tool to browse the contents.
honestSysAdmin
In my experience, to archive effectively you need a physical datacenter footprint, or to rent capacity of someone who does. Over a longer timespan (even just 6 months), having your own footprint is a lower total cost of ownership, provided you have the skills or access to someone with the skills to run Kubernetes + Ceph (or something similar).
.
> Is this a use case for Torrents?
Yes, provided you have a good way to dynamically append a distributed index of torrents and users willing to run that software in addition to the torrent software. Should be easy enough to define in container-compose.
crowcroft
Still, even with best efforts this is such a shame. There is always going to be a question around governance over the data, integrity, and potentially chain of custody as well. If the goal is to muddy the waters and create a narrative that whatever might be in this data isn't reliable or accurate then mission accomplished. I don't see how anything can stop that.
Not to say the data isn't incredibly valuable and should be preserved for many other reasons of course. All the best to anyone archiving this, this is important work.
chrishoyle
Related ongoing discussion
The government information crisis is bigger than you think it is - https://news.ycombinator.com/item?id=42895331
debeloo
Is this normal when there's change in presidency?
meesles
From the article:
> Changes in presidential administrations have led to datasets being deleted in the past, either on purpose or by accident. When Biden took office, 1,000 datasets were deleted according to the Wayback Machine, via 404 Media's reporting.
derbOac
I think the question is the nature of the losses in the two cases, the transparency circumstances about them, and who exactly is making the decisions about specific datasets.
Time will tell but loss of public datasets is probably not usually good in general.
doener
Yes, the context in which this happens could provide clues as to the nature of these losses: https://news.ycombinator.com/item?id=42898165
animal_spirits
This is not a direct quote, the actual quote from the article is
> But archivists who have been working on analyzing the deletions and archiving the data it held say that while some of the deletions are surely malicious information scrubbing, some are likely routine artifacts of an administration change, and they are working to determine which is which. For example, in the days after Joe Biden was inaugurated, data.gov showed about 1,000 datasets being deleted as compared to a day before his inauguration, according to the Wayback Machine.
akudha
Why are the datasets deleted though? Biden or Trump, Democrats or Republicans - What do they gain?
smrtinsert
Are datasets mirrored anywhere where the govt doesn't automatically have a take down authority? If not there should be a mirroring effort.
lhl
There's been a lot of discussion in https://www.reddit.com/r/DataHoarder/
Here's documentation on independent backup efforts of various government websites: https://www.reddit.com/r/DataHoarder/comments/1ifalwe/us_gov...
Also here: https://www.reddit.com/r/DataHoarder/comments/1idj6dm/all_us...
Apparently, much of the data has been back up here: https://eotarchive.org/
Here's also a discussion on whether the Internet Archive is sufficiently backed up/decentralized (it is not): https://www.reddit.com/r/DataHoarder/comments/1if32iq/does_i...
null
sunk1st
I don’t see a list of the datasets that have gone missing. Is there a list?
mistrial9
[flagged]
NortySpock
And you could also run your own archive bot (x86 only). I've got one running in a docker container, it downloads a webpage and auto-uploads it to archive.org
https://tracker.archiveteam.org/
Edit to add:
docker_compose.yml example:
services:
archiveteam:
image: atdr.meo.ws/archiveteam/warrior-dockerfile
ports:
- '8101:8001'
mem_limit: 4G
cpus: 3
dns:
- 9.9.9.10
- 8.8.8.8
labels:
- com.centurylinklabs.watchtower.enable=true
container_name: archiveteam-warrior
environment:
- DOWNLOADER=asdf # Change this to your nickname
- SELECTED_PROJECT=auto # Change this to your project of preference or let the archiveteam decide with 'auto'
- CONCURRENT_ITEMS=6 # Change this to the amount of concurrent download threads you can handle
watchtower:
command: '--label-enable --include-restarting --cleanup --interval 3600'
cpu_shares: 128
mem_limit: 1G
cpus: 1
image: containrrr/watchtower
volumes:
- '/var/run/docker.sock:/var/run/docker.sock'
container_name: watchtower
I'm quoted in this article. Happy to discuss what we're working on at the Library Innovation Lab if anyone has questions.
There's lots of people making copies of things right now, which is great -- Lots Of Copies Keeps Stuff Safe. It's your data, why not have a copy?
One thing I think we can contribute here as an institution is timestamping and provenance. Our copy of data.gov is made with https://github.com/harvard-lil/bag-nabit , which extends BagIt format to sign archives with email/domain/document certificates. That way (once we have a public endpoint) you can make your own copy with rclone, pass it around, but still verify it hasn't been modified since we made it.
Some open questions we'd love help on --
* One is that it's hard to tell what's disappearing and what's just moving. If you do a raw comparison of snapshots, there's things like 2011-glass-buttes-exploration-and-drilling-535cf being replaced by 2011-glass-buttes-exploration-and-drilling-236cf, but it's still exactly the same data; it's a rename rather than a delete and add. We need some data munging to work out what's actually changing.
* Another is how to find the most valuable things to preserve that aren't directly linked from the catalog. If a data.gov entry links to a csv, we have it. If it links to an html landing page, we have the landing page. It would be great to do some analysis to figure out the most valuable stuff behind the landing pages.