Proof-of-work to protect lore.kernel.org and git.kernel.org against AI crawlers

83 comments

·April 2, 2025

cowboylowrez

>Difficulty is set at 4 leading zeroes, unless you're coming from US in which case there's also a tariff of 5 more leading zeroes.

isn't linux afraid of retaliatory tariffs? should I stock up on linuxes just in case? I've already beefed up toilet paper reserves.

perihelions

If they go overboard people will start switching to FreeTrade-BSD

BrenBarn

If tariffs are imposed on Antarctica, penguin prices will go up. This will in turn raise Linux prices because penguins are crucial mascot components in all Linux systems.

flowerthoughts

I haven't heard penguins being deported by ICE or having their visas revoked, so I assume that at least the penguin tourism industry is still doing well in the US?

Of course, they're not allowed to work as mascots while touristing. wink wink

lionkor

Sanctioning Linux, due to its totalitarian government (bdfl)

notherhack

See https://news.ycombinator.com/item?id=43556521

wherein a hosting company sees AI bot scans that appear to be coming from millions of unique addresses, thousands of ASNs, many residential and often with a single connection from an IP. The AI bots are proxying through either hacked IoT devices or apps that pay people pennies to let their phone be used as a proxy.

Likely your proof of work will be distributed to the proxies. It'll just make millions of webcams and phones run a little hotter without slowing down the AI bots at all.

throwCPjuh9kR

I don't know about the implementation details, but there are a lot of cryptocurrency proof of work algorithms that requires a lot of memory access like Monero's RandomX. Those can't be realistically run on most underpowered devices.

null

[deleted]

perihelions

Any chance there's some way, going forwards, to dual-purpose these webserver PoW's, so they solve some socially beneficial compute problem at the same time? I recall reading ideas like that in the early days of cryptocurrency, before humans ruined it.

- Server: here's a bit of a cancer protein

- Client: okay, here's some compute

- Verifier: the compute checks out

- Server: okay, you are authorized to access cat.gif

xmcqdpt2

You could do like a captcha but for computers. Here are some molecule, find the ground state geometry of all of them. You give some you already solved just to root out whether the solver is actually solving or faking it.

wizzwizz4

It's difficult to break important problems down into NP-hard problems. Search problems are, afaik, the current state-of-the-art; but to my knowledge "a bit of a cancer protein" isn't useful, and "an entire cancer protein" would take a few hours at least.

xnx

Very true. Perhaps a better system would be to credit "points" for solving fewer/larger problems that could be spent a bit at a time? That sounds even more complex than charging regular money though.

unsnap_biceps

It appears they're using https://github.com/TecharoHQ/anubis for the proof of work proxy

stevenhuang

I enjoyed their succinct project description:

> Weighs the soul of incoming HTTP requests using proof-of-work to stop AI crawlers

Havoc

oh that's fantastic

xena

This is absolutely surreal to see in action! I hope that I can manage to afford to not have to do my dayjob anymore.

dharmab

Context for others: Xe is the author of the software used for this (https://anubis.techaro.lol/docs/)

tym0

Is it possible to resolve the PoW "manually" (i.e. Without browser JS execution) for personal use?

Like picking up the problem from the http headers and returning it as an follow up query.

Or is running a browser part of the PoW essentially?

xena

Yes, but I'm not going to implement that to avoid the implementation "leaking" and ossifying the current transitionary hack.

tym0

Cool, thanks. Will keep an eye on it.

I've not been able to read your blog with my personal news reader so I was hoping to implement that.

xxprogamerxy

I'm a bit skeptical if this will do the trick. These PoW challenges can be parallelized across different websites and may not be as off-putting as intended. Here some quick back-of-the-napkin math:

DeepMind's MassiveText dataset was sourced from ~2.35B documents. A difficulty of 4 leading zeros requires an expected 16^4 SHA-256 hashes per site. Benchmarks [1] show an H100 at ~12k MH/s, meaning it would take just ~3.5 hours to solve for all 2.35B pages.

[1] https://gist.github.com/Chick3nman/e1417339accfbb0b040bcd0a0...

xena

SHA-256 is a hack to buy time. This will be replaced with something better. It would be faster for me to replace it if I didn't have to do my dayjob: https://patreon.com/cadey

xxprogamerxy

Not meant as a criticism of the project in general. I appreciate people working on this.

I'm curious, what other approaches are you currently considering? In my mind, all roads lead to rate-limiting identifiers with privacy through zk-proofs.

xena

I'm looking at using equi-x, but failing that I may unironically do protein folding.

chr15m

This is infinitely better than using CloudFlare. I hope it works and more people adopt it.

ranger_danger

This does not help against real DDoS attacks (that don't even speak HTTP most of the time) or full-browser headless bots, besides warming the planet more. It also only looks at Mozilla user agents (despite one of the reasons given for its development was bots changing user agents), so it's extremely easy to bypass. But solutions like CF's or similar are better tailored for anti-DDoS purposes where the threat is from massive amounts of bandwidth, not well-behaved AI crawler bots clogging up your logs.

And if your argument is that it helps DDoS by being a frontend proxy, well, you still need more bandwidth than the DDoS uses, in which case you could do this with a simple "click here" page just as easily.

But please prove me wrong if I've misunderstood something.

ToucanLoucan

Genuine question: How? Is there a downside to CloudFlare I'm not aware of?

Rebelgecko

Cloudflare will just straight up block me sometimes, with no way to see the page. For whatever reason this used to happen to me a lot with car dealer websites. Maybe checking lots of different dealerships' inventory looking for a specific car made me look like a bot.

And even in cases where Cloudflare forces a captcha, this POW ran much more quickly than I could solve one by hand

nosioptar

It was nearly instant on my shitty old phone.

chr15m

If you're not aware of the downsides I don't have time to explain them to you. If you genuinely want to know, 5 minutes research will give you answers.

thayne

Besides the downsides mentioned by others, cloudflare heavily punished anyone using a browser that isn't chrome, especially if it is something other than chrome/safari/edge/firefox.

null

[deleted]

megous

Blocking me from contributing to any gitlab hosted project for ~4 years already. I wanted to send a glib2 patch today, again realized that, no, I can't still sign up to CF protected gitlab instances. :)

Makes me appretiate the Linux kernel mailing list based contribution method. Very open, very simple.

At this point I guess CF will never fix compatibility bugs in their interstitial pages, and in captcha, with non-default setup of Firefox.

g-b-r

For what it's worth, ensuring that JIT is enabled for challenges.cloudflare.com can help a lot.

No, not to the point of making it bearable, but at least it becomes rarer for it to take minutes.

g-b-r

It routinely takes at least a minute overall on gitlab, from a budget phone.

Other sites with Cloudflare only take some nice twenty seconds, others just never ever let you go through.

Those checks are a serious contender for worse thing ever happened to the web.

skeptrune

I am really enjoying seeing this use-case for PoW gain popularity. Hopefully it normalizes the technique and it can start to become more common for anti-spam systems.

codetrotter

https://blog.torproject.org/introducing-proof-of-work-defens...

Tor has similarly been using Proof of Work as part of the defense for onion services for around a year and a half now.

I’ve also seen some clear net web sites that use PoW to slow down account creation. Some websites will even adjust the difficulty for individual visitors depending on the recent number of sign-ups coming from their IP block. More signups from an IP block -> higher PoW difficulty for anyone from that IP block -> fewer accounts created by anyone in that IP block over a span of time.

Sesse__

Why do you assume that spammers and AI crawlers do not have access to large amounts of compute? You can make it more expensive, but these crawlers already have made it clear that they do not care particularly about cost (or they would not crawl so completely indiscriminately).

arccy

um no? sending an http request is quite a bit different than some forced pow calculation

Sesse__

Why? Don't you think these companies can use Puppeteer or similar and just take that second or so of compute to get a cookie for lore.kernel.org?

charcircuit

Spammers are willing to dedicate more processing power than regular users. It doesn't make sense to do. It's either meaningless or ruins the user experience for normal people.

losvedir

Regular users aren't trying to load billions of pages.

charcircuit

What's your point?

sva_

> Difficulty is set at 4 leading zeroes, unless you're coming from US in which case there's also a tariff of 5 more leading zeroes.

> You can see it in action on this recently decommissioned system I'm using for testing purposes: https://ams.source.kernel.org/

Something seriously wrong with it. When I run it with my normal German/EU home connection, it does ~17k iterations. When I run it with a US Atlanta VPN, it only takes ~6k iterations.

xena

It's luck-based, I'm working on making a check that's more deterministic, but I'm also trying to figure out how to not lock out big-endian systems in the process.

I may have to just give up on that though :(

Rebelgecko

I think that part was a joke

lionkor

I think OP, like me, wishes it wasn't (it would be very funny)

thayne

Not for those of us who live in the US. It would basically lock out real people in the US, while doing nothing to block bots, which could just use a different source ip.

Rebelgecko

I think 1 or 2 extra 0s would be funny but 5 seems excessive

sakras

Maybe I'm missing something, but why do people expect PoW to be effective against companies who's whole existence revolves around acquiring more compute?

xmcqdpt2

I was under the impression that the bad crawlers exist because it's cheaper to reload the data all the time than to cache it somewhere. If this changes the cost balance, those companies might decide to download only once instead of over and over again, which would probably be satisfactory to everyone.

kklisura

So, market/companies refused to regulate themselves (by adhering to the robots.txt) so we're now forced to innovate some solutions against them.

neurostimulant

I thought Anubis author doesn't want you to remove the anime girl images? I guess kernel.org is exempted. gitlab.gnome.org still has the anime girl though.

https://anubis.techaro.lol/docs/funding

mariusor

Seeing as it's open-source under an MIT license, I would think that everyone is allowed to modify the source and do whatever they want with it.

The monetary payment for removal of the logo is just for companies that don't posses the "know how" to do that and can instead pay a consulting fee.

HN

Proof-of-work to protect lore.kernel.org and git.kernel.org against AI crawlers

Proof-of-work to protect lore.kernel.org and git.kernel.org against AI crawlers