Skip to content(if available)orjump to list(if available)

Compiler Explorer and the promise of URLs that last forever

kccqzy

Before 2010 I had this unquestioned assumption that links are supposed to last forever. I used the bookmark feature of my browser extensively. Some time afterwards, I discovered that a large fraction of my bookmarks were essentially unusable due to linkrot. My modus operandi after that was to print the webpage as a PDF. A bit afterwards when reader views became popular reliable, I just copy-pasted the content from the reader view into an RTF file.

lappa

I use the SingleFile extension to archive every page I visit.

It's easy to set up, but be warned, it takes up a lot of disk space.

    $ du -h ~/archive/webpages
    1.1T /home/andrew/archive/webpages
https://github.com/gildas-lormeau/SingleFile

davidcollantes

How do you manage those? Do you have a way to search them, or a specific way to catalogue them, which will make it easy to find exactly what you need from them?

nirav72

KaraKeep is a decent self hostable app that has support for receiving singlefile pages via singlefile browser extension and pointing to karakeep API. This allows me to search for archived pages. (Plus auto summarization and tagging via LLM).

snthpy

Thanks. I didn't know about this and it looks great.

A couple of questions:

- do you store them compressed or plain?

- what about private info like bank accounts or health issuance?

I guess for privacy one could train oneself to use private browsing mode.

Regarding compression, for thousands of files don't all those self-extraction headers add up? Wouldn't there be space savings by having a global compression dictionary and only storing the encoded data?

d4mi3n

> do you store them compressed or plain?

Can’t speak to your other issues but I would think the right file system will save you here. Hopefully someone with more insight can provide color here, but my understanding is that file systems like ZFS were specifically built for use cases like this where you have a large set of data you want to store in a space efficient manner. Rather than a compression dictionary, I believe tech like ZFS simply looks at bytes on disk and compresses those.

genewitch

By default, singlefile only saves when you tell it to, so there's no worry about leaking personal information.

I haven't put the effort in to make a "bookmark server" that will accomplish what singlefile does but on the internet because of how well singlefile works.

internetter

storage is cheap, but if you wanted to improve this:

1. find a way to dedup media

2. ensure content blockers are doing well

3. for news articles, put it through readability and store the markdown instead. if you wanted to be really fancy, instead you could attempt to programatically create a "template" of sites you've visited with multiple endpoints so the style is retained but you're not storing the content. alternatively a good compression algo could do this, if you had your directory like /home/andrew/archive/boehs.org.tar.gz and inside of the tar all the boehs.org pages you visited are saved

4. add fts and embeddings over the pages

windward

>storage is cheap

It is. 1.1TB is both:

- objectively an incredibly huge amount of information

- something that can be stored for the cost of less than a day of this industry's work

Half my reluctance to store big files is just an irrational fear of the effort of managing it.

ashirviskas

1 and partly 3 - I use btrfs with compression and deduping for games and other stuff. Works really well and is "invisible" to you.

nyarlathotep_

Are you automating this in some fashion? Is there another extension you've authored or similar to invoke SingleFile functionality on a new page load or similar?

shwouchk

i was considering a similar setup, but i don’t really trust extensions. Im curious;

- Do you also archive logged in pages, infinite scrollers, banking sites, fb etc? - How many entries is that? - How often do you go back to the archive? is stuff easy to find? - do you have any organization or additional process (eg bookmarks)?

did you try integrating it with llms/rag etc yet?

eddd-ddde

You can just fork it, audit the code, add your own changes, and self host / publish.

dataflow

Have you tried MHTML?

RiverCrochet

SingleFile is way more convenient as it saves to a standard HTML file. The only thing I know that easily reads MHTML/.mht files is Internet Explorer.

90s_dev

You must have several TB of the internet on disk by now...

flexagoon

By the way, if you install the official Web Archive browser extension, you can configure it to automatically archive every page you visit

petethomas

This a good suggestion with the caveat that entire domains can and do disappear: https://help.archive.org/help/how-do-i-request-to-remove-som...

Akronymus

That's especially annoying when a formerly useful site gets abandoned, a new owner picks up the domain, then gets IA to delete the old archives as well.

Or even worse, when a domain parking company does that: https://archive.org/post/423432/domainsponsorcom-erasing-pri...

internetter

recently I've come to believe even IA and especially archive.is are ephermal. I've watched sites I've saved disappear without a trace, except in my selfhosted archives.

A technological conundrum, however, is the fact that I have no way to prove that my archive is an accurate representation of a site at a point in time. Hmmm, or maybe I do? Maybe something funky with cert chains could be done.

shwouchk

sign it with gpg and upload the sig to bitcoin

edit: sorry, that would only prove when it was taken, not that it wasn’t fabricated.

akoboldfrying

There are timestamping services out there, some of which may be free. It should (I think) be possible to basically submit the target site's URL to the timestamping service, and get back a certificate saying "I, Timestamps-R-US, assert that the contents of https://targetsite.com/foo/bar downloaded at 12:34pm on 29/5/2025 hashes to abc12345 with SHA-1", signed with their private key and verifiable (by anyone) with their public key. Then you download the same URL, and check that the hashes match.

IIUC the timestamping service needs to independently download the contents itself in order to hash it, so if you need to be logged in to see the content there might be complications, and if there's a lot of content they'll probably want to charge you.

vitorsr

> you can configure it to automatically archive every page you visit

What?? I am a heavy user of the Internet Archive services, not just the Wayback Machine, including official and "unofficial" clients and endpoints, and I had absolutely no idea the extension could do this.

To bulk archive I would manually do it via the web interface or batch automate it. The limitations of manually doing it one by one are obvious, and the limitations of doing it in batches requires, well, keeping batches (lists).

90s_dev

My solution has been to just remember the important stuff, or at least where to find it. I'm not dead yet so I guess it works.

TeMPOraL

It was my solution too, and I liked it, but over the past decade or so, I noticed that even when I remember where to find some stuff, hell, even if I just remember how to find it, when I actually try and find it, it often isn't there anymore. "Search rot" is just as big a problem as link rot.

As for being still alive, by that measure hardly anything anyone does is important in the modern world. It's pretty hard to fail at thinking or remembering so badly that it becomes a life-or-death thing.

90s_dev

> hardly anything anyone does is important

Agreed.

mock-possum

I’ve found that whenever I think “why don’t other people just do X” it’s because I’m misunderstanding what’s involved in X for them, and that generally if they could ‘just’ do X then they would.

“Why don’t you just” is a red flag now for me.

90s_dev

Not always. I love it when people offer me a much simpler solution to a problem I overengineered, so I can throw away my solution and use the simpler one.

Half the time people are suggested a better way, it's because they're actually doing it wrong, they've gotten the solution's requirements all wrong in the first place, and this perspective helps.

chii

this applies to basically any suggested solution to any problem.

"Why don't you just ..." is just lazy idea suggestion from armchair internet warriors.

mycall

Is there some browser extension that automatically goes to web.archive.org if the link timesout?

theblazehen

I use the Resurrect Pages addon

macawfish

shwouchk

warc is not a panacea; for example, gemini makes it super annoying to get a transcript of your conversation, so i started saving those as pdf and warc.

turns out that unlike most webpages, the pdf version is only a single page of what is visible on screen.

turns out also that opening the warc immediately triggers a js redirect that is planted in the page. i can still extract the text manually - it’s embedded there - but i cannot “just open” the warc in my browser and expect an offline “archive” version - im interacting with a live webpage! this sucks from all sides - usability, privacy, security.

Admittedly, i don’t use webrecorder - does it solve this problem? did you verify?

weinzierl

Not sure if you tried that. Chrome has a take full page screenshot command. Just open the command bar in dev tools and search for "full" and you will fund it. Firefox has it right in the context menu, no need for dev tools.

Unfortunately there are sites where it does not work.

andai

Is there some kind of thing that turns a web page into a text file? I know you can do it with beautiful soup (or like 4 lines of python stdlib), but I usually need it on my phone, where I don't know a good option.

My phone browser has a "reader view" popup but it only appears sometimes, and usually not on pages that need it!

Edit: Just installed w3m in Termux... the things we can do nowadays!

XorNot

You want Zotero.

It's for bibliographies, but it also archives and stores web pages locally with a browser integration.

_huayra_

I frankly don't know how I'd collect any useful info without it.

I'm sure there are bookmark services that also allow notes, but the tagging, linking related things, etc, all in the app is awesome, plus the ability to export bib tex for writing a paper!

nonethewiser

A reference is a bet on continuity.

At a fundamental level, broken website links and dangling pointers in C are the same.

taeric

That assumption isn't true of any sources? Things flat out change. Some literally, others more in meaning. Some because they are corrected, but there are other reasons.

Not that I don't think there is some benefit in what you are attempting, of course. A similar thing I still wish I could do is to "archive" someone's phone number from my contact list. Be it a number that used to be ours, or family/friends that have passed.

mananaysiempre

May be worth cooperating with ArchiveTeam’s project[1] on Goo.gl?

> url shortening was a fucking awful idea[2]

[1] https://wiki.archiveteam.org/index.php/Goo.gl

[2] https://wiki.archiveteam.org/index.php/URLTeam

tech234a

Real-time status for that project indicates 7.5 billion goo.gl URLs found out of 42 billion goo.gl URLs scanned: https://tracker.archiveteam.org:1338/status

MallocVoidstar

IIRC ArchiveTeam were bruteforcing Goo.gl short URLs, not going through 'known' links, so I'd assume they have many/all of Compiler Explorer's URLs. (So, good idea to contact them)

s17n

URLs lasting forever was a beautiful dream but in reality, it seems that 99% of URLs don't in fact last forever. Rather than endlessly fighting a losing battle, maybe we should build the technology around the assumption that infrastructure isn't permanent?

nonethewiser

>maybe we should build the technology around the assumption that infrastructure isn't permanent?

Yes. Also not using a url shortener as infrastructure.

dreamcompiler

URNs were supposed to solve that problem by separating the identity of the thing from the location of the thing.

But they never became popular and then link shorteners reimplemented the idea, badly.

https://en.m.wikipedia.org/wiki/Uniform_Resource_Name

hoppp

Yes.

domain names often exchange hands and a URL that is supposed to last forever can turn into malicious phishing link over time.

emaro

In theory a content-addressed system like IPFS would be the best: if someone online still has a copy, you can get it too.

mananaysiempre

It feels as though, much like cryptography in general reduces almost all confidentiality-adjacent problems to key distribution (which is damn near unsolvable in large uncoordinated deployments like Web PKI or PGP), content-addressable storage reduces almost all data-persistence-adjacent problems to maintenance of mutable name-to-hash mappings (which is damn near unsolvable in large uncoordinated deployments like BitTorrent, Git, or IP[FN]S).

immibis

Note that IPFS is now on the EU Piracy Watchlist which may be a precursor to making it illegal.

jjmarr

URL identify the location of a resource on a network, not the resource itself, and so are not required to be permanent or unique. That's why they're called "uniform resource locators".

This problem was recognized in 1997 and is why the Digital Object Identifier was invented.

creatonez

There's something poetic about abusing a link shortener as a database and then later having to retrieve all your precious links from random corners of the internet because you've lost the original reference.

rs186

Shortening long URLs is the intended use case for a ... URL shortener.

The real abusers are the people who use a shortener to hide scam/spam/illegal websites behind a common domain and post it everywhere.

creatonez

These are not just "long URLs". These are URLs where the entire content is stored in the fragment suffix of the URL. They are blobs, and always have been.

nonethewiser

Didnt they just use the link shortener to compress the url? They used their url as the "database" (ie holding the compiler state).

Arcuru

They didn't store anything themselves since they encoded the full state in the urls that were given out. So the link shortener was the only place where the "database", the urls, were being stored.

nonethewiser

Yeah but the purpose of the url shortener was not to store the data, it was to shorten the url. The fact that the data was persisted on google's sever somewhere is incidental.

In other words, every shortened url is "using the url shortener as a database" in that sense. Taking a url with a long query parameter and using a url shortener to shorten it does not constitute "abusing a link shortener as a database."

amiga386

https://killedbygoogle.com/

> Google Go Links (2010–2021)

> Killed about 4 years ago, (also known as Google Short Links) was a URL shortening service. It also supported custom domain for customers of Google Workspace (formerly G Suite (formerly Google Apps)). It was about 11 years old.

zerocrates

"Killing" the service in the sense of minting new ones is no big deal and hardly merits mention.

Killing the existing ones is much more of a jerk move. Particularly so since Google is still keeping it around in some form for internal use by their own apps.

ruune

Don't they use https://g.co now? Or are there still new internal goo.gl links created?

Edit: Google is using a g.co link on the "Your device is booting another OS" screen that appears when booting up my Pixel running GrapheneOS. Will be awkward when they kill that service and the hard coded link in the phones bios is just dead

zerocrates

Google Maps creates "maps.app.goo.gl" links; I don't know if there are others, they called Maps out specifically in their message.

Possibly those other ones are just using the domain name and the underlying service is totally different, not sure.

olalonde

> This article was written by a human, but links were suggested by and grammar checked by an LLM.

This is the second time today I've seen a disclaimer like this. Looks like we're witnessing the start of a new trend.

tester756

It's crazy that people feel that they need to put such disclaimers

actuallyalys

It makes sense to me. After seeing a bunch of AI slop, people started putting no AI buttons and disclaimers. Then some people using AI for little things wanted to clarify it wasn’t AI generated wholesale without falsely claiming AI wasn’t involved at all.

layer8

It’s more a claimer than a disclaimer. ;)

danadam

I'd probably call it "disclosure".

psychoslave

This comment was written by a human with no check by any automaton, but how will you check that?

acquisitionsilk

Business emails, other comments here and there of a more throwaway or ephemeral nature - who cares if LLMs helped?

Personal blogs, essays, articles, creative writing, "serious work" - please tell us if LLMs were used, if they were, and to what extent. If I read a blog and it seems human and there's no mention of LLMs, I'd like to be able to safely assume it's a human who wrote it. Is that so much to ask?

qingcharles

That's exactly what a bot would say!

chii

i dont find the need to have such a disclaimer at all.

If the content can stand on its own, then it is sufficient. If the content is slop, then why does it matter that it is an ai generated slop vs human generated slop?

The only reason anyone wants to know/have the disclaimer is if they cannot themselves discern the quality of the contents, and is using ai generation as a proxy for (bad) quality.

johannes1234321

For the author it matters. To which degree do they want to be associated with the resulting text.

And I differentiate between "Matt Godbolt" who is an expert in some areas and in my experience careful about avoiding wrong information and an LLM which may produce additional depth, but may also make up things.

And well, "discern the quality of the contents" - I often read texts to learn new things. On new things I don't have enough knowledge to qualify the statements, but I may have experience with regards to the author or publisher.

chii

and what do you do to make this differentiation if what you're reading is a scientific paper?

layer8

I find it somewhat surprising that it’s worth the effort for Google to shut down the read-only version. Unless they fear some legal risks of leaving redirects to private links online.

actuallyalys

Hard to say from the outside, but it’s possible the service relies on some outdated or insecure library, runtime, service, etc. they want to stop running. Although frankly it seems just as possible it’s a trivial expense and they’re cutting it because it’s still a net expense, goodwill and past promises be dammed.

Scaevolus

Typically services like these are side projects of just a few Google employees, and when the last one leaves they are shut down.

mbac32768

yeah but nobody wants to put "spent two months migrating goo.gl url shortener to work with Sisyphus release manager and Dante 7 SRE monitoring" in their perf packet

that's a negative credit activity

mmooss

Another possibility is that it's a distraction - whatever the marginal costs, there's a fixed cost to each system in terms of cognitive overhead, if not documentation, legal issues (which can change as laws and regulations change), etc. Removing distractions is basic management.

2YwaZHXV

Presumably there's no way to get someone at Google to query their database and find all the shortened links that go to godbolt.org?

wrs

I hate to say it, but unless there’s a really well-funded foundation involved, Compiler Explorer and godbolt.org won’t last forever either. (Maybe by then all the info will have been distilled into the 487 quadrillion parameter model of everything…)

mattgodbolt

We've done alright so far: 13 years this week. I have funding for another year and change even assuming growth and all our current sponsors pull out.

I /am/ thinking about a foundation or similar though: the single point of failure is not funding but "me".

badmintonbaseba

Well, that's true, but at least now compiler explorer links will stop working when compiler explorer vanishes, but not before that.

I think the most valuable long-living compiler explorer links are in bug reports. I like to link to compiler explorer in bug reports for convenience, but I also include the code in the report itself, and specify what compiler I used with what version to reproduce the bug. I don't expect compiler explorer to vanish anytime soon, but making bug reports self-contained like this protects against that.

layer8

Thanks to the no-hiding theorem, the information will live forever. ;)

swyx

idk man how can URLs last forever if it costs money to keep a domain name alive?

i also wonder if url death could be a good thing. humanity makes special effort to keep around the good stuff. the rest goes into the garbage collection of history.

johannes1234321

Historians however would love to have more garbage from history, to get more insights on "real" life rather than just the parts one considered worth keeping.

If I could time jump it would be interesting to see how historians inna thousand years will look back at our period where a lot of information will just disappear without traces as digital media rots.

swyx

we'd keep the curiosities around, like so much Ea Nasir Sells Shit Copper. we have room for like 5-10 of those per century. not like 8 billion. much of life is mundane.

johannes1234321

Yes, at the same time we'd be excited about more mundane sources from history. The legends about the mighty are interesting, but what do we actually know about every day love from people a thousand years ago? Very little. Most things are speculation based on objects (tools etc.), on structure of buildings and so on. If we go back just few hundred years there is (using European perspective) a somewhat interesting source in court cases from legal conflicts between "average" people, but in older times more or less all written material is on the powerful, be it worldly or religious power, which often describes the rulers in an extra positive way (from their perspective) and their opponents extra weak.

Having more average sources certainly helps and we now aren't good judges on what will be relevant in future. We can only try to keep some of everything.

woodruffw

> much of life is mundane.

The things that make (or fail to make) life mundane at some point in history are themselves subjects of significant academic interest.

(And of course we have no way to tell what things are "curiosities" or not. Preservation can be seen as a way to minimize survivorship bias.)

cortesoft

Today’s mundane is tomorrow’s fascination

shakna

We also have rooms full of footprints. In a thousand years, your mundane is the fascination of the world.

rightbyte

Imagine being judged 1000s of year later by some Yelp reviews like poor Nasir.

mrguyorama

I regularly wonder if modern educated people do not journal as much as previous century educated people who were kind of rare.

Maybe we should get a journaling boom going.

But it has to be written, because pen and paper is literally ten times more durable than even good digital storage.

swyx

> pen and paper is literally ten times more durable than even good digital storage.

citation needed lol. data replication >>>> paper's single point of failure.

internetter

> i also wonder if url death could be a good thing. humanity makes special effort to keep around the good stuff. the rest goes into the garbage collection of history.

agreed. formerly wrote some thoughts here: https://boehs.org/node/internet-evanescence

sedatk

Surprisingly, purl.org URLs still work after a quarter century, thanks to Internet Archive.

sebstefan

>Over the last few days, I’ve been scraping everywhere I can think of, collating the links I can find out in the wild, and compiling my own database of links1 – and importantly, the URLs they redirect to. So far, I’ve found 12,000 links from scraping:

>Google (using their web search API)

>GitHub (using their API)

>Our own (somewhat limited) web logs

>The archive.org Stack Overflow data dumps

>Archive.org’s own list of archived webpages

You're an angel Matt