FOSS infrastructure is under attack by AI companies
570 comments
·March 20, 2025ericholscher
pjc50
> just burning their goodwill to the ground
AI firms seem to be leading from a position that goodwill is irrelevant: a $100bn pile of capital, like an 800lb gorilla, does what it wants. AI will be incorporated into all products whether you like it or not; it will absorb all data whether you like it or not.
UncleMeat
Yep. And it is much more far reaching than that. Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet. The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained. All intellectual property belongs to them. All labor belongs to them. Why would they need good will when they own everything?
"Why should we care about open source maintainers" is just a microcosm of the much larger "why should we care about literally anybody" mindset.
rectang
> Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet.
And this is why AI training is not "fair use". The AI companies seek to train models in order to compete with the authors of the content used to train the models.
A possible eventual downfall of AI is that the risk of losing a copyright infringement lawsuit is not going away. If a court determines that the AI output you've used is close enough to be considered a derivative work, it's infringement.
maaaaattttt
I have this line of thought as well but then I wonder, if we are all out of jobs and out of substantial capital to spend, how do these owners make money ultimately? It's a genuine question and I'm probably missing something obvious. I can see a benevolant/post-scarcity spin to this but the non-benevolant one seems self defeating.
pkdpic
And how could they possibly base their actions on good when their technology is more important than fire? History is depending on them to do everything possible to increase their market cap.
chii
> remake the entire world into one where the owners of these companies own everything and are completely unconstrained
how has this been any different from the past 10,000 years of human conquest and domination?
DrillShopper
> The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained.
I agree with you in the case of AI companies, but the desire to own everything an bee completely unconstrained is the dream of every large corporation.
outside1234
The thing is that this will be their destruction as well. If workers don't have any money (because they don't have jobs), nobody can afford what the owners have to sell?
yubblegum
They are also gutting the profession of software engineering. It's a clever scam actually: to develop software a company will need to pay utility fees to A"I" companies and since their products are error prone voila use more A"I" tools to correct the errors of the other tools. Meanwhile software knowledge will atrophy and soon ala WALE we'll have software "developers" with 'soft bones' floating around on conveyed seats slurping 'sugar water' and getting fat and not knowing even how to tie their software shoelaces.
davidmurdoch
We, the people, might need to come up with a few proverbial tranquilizer guns here soon
Sharlin
Maxim 1: "Pillage, then burn."
Coffeewine
Another Schlock Mercenary fan? Or does this adage have many adherents?
ferguess_k
That's pretty much what our future would look like -- you are irrelevant. Well I mean we are already pretty much irrelevant nowadays, but the more so in the "progressive" future of AI.
anthk
AI tarpits && lim (human curated contant/mediocre AI answers -> 0) = AI's crumbling into dust by themselves.
bbarnett
Yes, like the Pixel camera app, which mangles photos with AI processing, and users complain that it won't let people take pics.
One issue was a pic with text in it, like a store sign. Users were complaining that it kept asking for better focus on the text in the background, before allowing a photo. Alpha quality junk.
Which is what AI is, really.
asveikau
Rules and laws are for other people. A lot of people reading this comment having mistaken "fake it til you make it" or "better to not ask permission" for good life advice are responsible for perpetrating these attitudes, which are fundamentally narcissistic.
lgeek
> One crawler downloaded 73 TB of zipped HTML files in May 2024 [...] This cost us over $5,000 in bandwidth charges
I had to do a double take here. I run (mostly using dedicated servers) infrastructure that handles a few hundred TB of traffic per month, and my traffic costs are on the order of $0.50 to $3 per TB (mostly depending on the geographical location). AWS egress costs are just nuts.
Ray20
I think uncontrolled price of cloud traffic - is a real fraud and way bigger problem then some AI companies that ignore robot.txt. One time we went over limit on Netlify or something, and they charged over thousand for a couple TB.
Suppafly
>which I then emailed 3x and never got a reply.
Send a bill to their accounts payable team instead.
ldoughty
Detect AI scraper and inject an in-page notice that by continuing they accept your terms of use.
Terms of use charges them per page load in some terminology of abuse.
Profit... By sending them invoices :-)
TuringNYC
>> which I then emailed 3x and never got a reply.
At which point does the crawling cease to be a bug/oversight and constitute a DDOS?
ferguess_k
Maybe just feed them dynamically generated garbage information? More fun than no information.
gnz11
OP’s linked blog post mentioned they got hit with a large spike in bandwidth charges. Sending them garbage information costs money.
ferguess_k
Yeah you have a point, hmmm, wish there were a way to somehow generate those garbages with minimum bandwidth. Something like, I can send you a very compressed 256 bytes of data which expands to something like 1 mega bytes.
Steltek
Tarpit instead? Trickle out a dead end response (no links) at bytes-per-second speeds until the bot times out.
InfamousRece
It does not even have to be dynamically generated. Just pre-generate a few thousand static pages of AI slop and serve that. Probably cheaper than dynamic generation.
m463
I kind of suspect some of these companies probably have more horsepower and bandwidth in one crawler than a lot of these projects have in their entire infrastructure.
spenczar5
Thanks for writing about this. Is it clear that this is from crawlers, as opposed to dynamic requests triggered by LLM tools, like Claude Code fetching docs on the fly?
Freebytes
Along with having block lists, perhaps you could add poison to your results that generates random bad code that will not work, and that is only seen by bots (display: none when rendered), and the bots will use it, but a human never would.
aspir
Just a callout that Fastly provides free bot detection, CDN, and other security services for FOSS projects, and has been for 10+ years https://www.fastly.com/fast-forward (disclaimer, I work for Fastly and help with this program)
Without going into too much detail, this tracks with the trends in inquiries we're getting from new programs and existing members. A few years ago, the requests were almost exclusively related to performance, uptime, implementing OWASP rules in a WAF, or more generic volumetric impact. Now, AI scraping is increasingly something that FOSS orgs come to us for help with.
Aachen
I've been running into bot detection on at least five different websites in the past two months (not even including captcha walls)
Not sure what to tell you but I surely feel quite human
Three of the pages told me to contact customer support and the other two were a hard and useless block wall. Only from Codeberg did I get a useful response, the other two customer supports were the typical "have you tried clearing your cookies" and restart the router advice — which is counterproductive because cookie tracking is often what lets one pass. Support is not prepared to deal with this, which means I can't shop at the stores that have blocking algorithms erroneously going off. I also don't think any normal person would ever contact support, I only do it to help them realise there's a problem and they're blocking legitimate people from using the internet normally
Beware if you employ this...
RVuRnvbM2e
Were the walls you hit caused by Fastly's bot detection? I've found it to be quite accurate.
On the other hand CloudFlare and Akamai mistakenly block me all the damn time.
Aachen
It's not like they say, but it's at least three different implementations and I don't think any were cloudflare because I've been running into those pages for years and they've got captchas (functional or not). One of them was Akamai I think indeed
999900000999
To be fair,
>I'm Not a Robot (film) https://en.m.wikipedia.org/wiki/I%27m_Not_a_Robot_(film)
Aachen
Oh my, a Dutch film that actually sounds good?! I get to watch a movie that's originally in my native language for perhaps the second time in my life, thanks for linking this :D
Edit: and it's on YouTube in full! Was wondering which streaming service I'd have to buy for this niche genre of Dutch sci-fi but that makes life easy: https://www.youtube.com/watch?v=4VrLQXR7mKU
Final update: well, that was certainly special. Favorite moment was 10:26–10:36 ^^. Don't think that comes fully across in the baked-in subtitles in English though. Overall it could have been an episode of Dark Mirror, just shorter. Thanks again for the tip :)
xena
It's really surreal to see my project in the preview image like this. That's wild! If you want to try it: https://github.com/TecharoHQ/anubis. So far I've noticed that it seems to actually work. I just deployed it to xeiaso.net as a way to see how it fails in prod for my blog.
diggan
Nice work :)
One piece of feedback: Could you add some explanation (for humans) what we're supposed to do and what is happening when met by that page?
I know there is a loading animation widget thingy, but the first time I saw that page (some weeks ago at the Gnome issue tracker), it was proof-of-work'ing for like 20 seconds, and I wasn't sure what was going on, I initially thought I got blocked or that the captcha failed to load.
Of course, now I understand what it is, but I'm not sure it's 100% clear when you just see the "checking if you're a bot" page in isolation.
xena
> One piece of feedback: Could you add some explanation (for humans) what we're supposed to do and what is happening when met by that page?
ranger_danger
also if you're using JShelter, which blocks Worker by default, there is no indication that it's never going to work, and the spinner just goes on forever doing nothing
clvx
I really like this. I don't mind Internet acting like the Wild Wild West but I do mind there's no accountability. This is a nice way to pass the economic burden to the crawlers for sites who still want to stay freely available. You want the data, spend money on your side to get it. Even though the downside is your site could be delisted from search engines, there's no reason why you cannot register your service in a global or p2p indexer.
lukan
"why you cannot register your service in a global or p2p indexer"
Network effects anyone? So yes, we should work on a different way of indexing the web again, than via google, but easier said than done I think ..
null
isoprophlex
Loving it, great work as always.
Also
> https://news.ycombinator.com/item?id=43422781
Integrate a way to calculate micro-amounts of the shitcoin of your choice and we might have the another actually legitimately useful application of cryptocurrencies on our hands..!
reginald78
Maybe I'm missing something, but doesn't this mean the work has to be done by the client AND the server every time a challenge is issued? I think ideally you'd want work that was easy for the server and difficult for the server. And what is to stop being DDoS'd by clients that are challenged but neglect to perform the challenge?
Regardless, I think something like this is the way forward if one doesn't want to throw privacy entirely out the window.
client
xena
The magic of proof of work is that it's something that's really hard to do but easy to validate. Anubis' proof of work works like this:
A sha256 hash is a bunch of bytes like this:
394d1cc82924c2368d4e34fa450c6b30d5d02f8ae4bb6310e2296593008ff89f
We usually write it out in hex form, but that's literally what the bytes in ram look like. In a proof of work validation system, you take some base value (the "challenge") and a rapidly incrementing number (the "nonce"), so the thing you end up hashing is this: await sha256(`${challenge}${nonce}`);
The "difficulty" is how many leading zeroes the generated hash needs to have. When a client requests to pass the challenge, they include the nonce they used. The server then only has to do one sha256 operation: the one that confirms that the challenge (generated from request metadata) and the nonce (provided by the client) match the difficulty number of leading zeroes.The other trick is that presenting the challenge page is super cheap. I wrote that page with templ (https://templ.guide) so it compiles to native Go. This makes it as optimized as Go is modulo things like variable replacement. If this becomes a problem I plan to prerender things as much as possible. Rendering the challenge page from binary code or ram is always always always going to be so much cheaper than your webapp ever will be.
I'm planning on adding things like changing out the hash in use, but right now sha256 is the best option because most CPUs in active deployment have instructions to accelerate sha256 hashing. This combined with webcrypto jumping to heavily optimized C++ and the JIT in JS being shockingly good means that this super naïve approach is probably the most efficient way to do things right now.
I'm shocked that this all works so well and I'm so glad to see it take off like it has.
k1tanaka
I am sorry if this question is dumb, but how does proof of work deter bots/scrappers from accessing a website?
I imagine it costs more resource to access the protected website but would this stop the bots? Wouldn't they be able to pass the challenge and scrap the data after? Or normal scrapbots usually timeout after a small amount of time/ resources is used?
diggan
> I think ideally you'd want work that was easy for the server and difficult for the server.
That's exactly how it works (easy for server, hard for client). Once the client completed the Proof-of-Work challenge, the server doesn't need to complete the same challenge, it only needs to validate that the results checks out.
Similar to how in Proof-of-Work blockchains where coming up with the block hashes is difficult, but validating them isn't nearly as compute-intensive.
This asymmetric computation requirement is probably the most fundamental property of Proof-of-Work, Wikipedia has more details if you're curious: https://en.wikipedia.org/wiki/Proof_of_work
Fun fact: it seems Proof-of-Work was used as a DoS preventing technique before it was used in Bitcoin/blockchains, so seems we've gone full circle :)
namaria
I think going full circle would be something like bitcoin being created on top of DoS prevention software and then eventually DoS prevention starting to use bitcoin. A tool being used for something than something else than the first something again is just... nothing? Happens all the time?
vhcr
Anubis is only going to work as long as it doesn't gets famous, if that happens crawlers will start using GPUs / ASICs for the proof of work and it's game over.
bashfulpup
The entire reason bots are so agressive is because they are cheap to run.
If a GPU was required per scrape then >90% simply couldn't afford it at scale.
GaggiX
The AI anime girl has 6 fingers btw, combating AI bot with AI girls.
Edit: I will probably send a pull request to fix it.
xena
I'm commissioning an artist to make better assets. These are the placeholders that I used with the original rageware implementation. I never thought it would take off like this!
knowaveragejoe
I love that I seem to stumble upon something by you randomly every so often. I'd just like to say that I enjoy your approach to explanations in blog form and will look further into Anubis!
brushfoot
At this rate, it's more than FOSS infrastructure -- although that's a canary in the coalmine I especially sympathize with -- it's anonymous Internet access altogether.
Because you can put your site behind an auth wall, but these new bots can solve the captchas and imitate real users like never before. Particularly if they're hitting you from residential IPs and with fake user agents like the ones in the article -- or even real user agents because they're wired up to something like Playwright.
What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.
Freak_NL
We're half way there already. It always hits me whenever I am doing some mapping for OpenStreetMap and I'm looking up local businesses without their own internet presence. They use Facebook, Instagram, X, etc. for their digital calling card. I normally don't use Facebook (or Instagram, and gave up on X) and have no account there, and every time I follow one of those links, you get some info, and then you get a dialogue screen telling you to make an account or get lost, or you just get some obscure error.
I don't mind registering an account for private communities, but for stuff which people put up thinking it is just going to be publicly visible it's really annoying.
yurishimo
> ... but for stuff which people put up thinking it is just going to be publicly visible ...
I don't think these business owners really understand. Most normies just think everyone has a Facebook/Instagram account and can't even imagine a world where that is not the case.
I agree with you that it is extremely frustrating.
Suppafly
>Most normies just think everyone has a Facebook/Instagram account and can't even imagine a world where that is not the case.
The people without a basic internet presence aren't likely to be customers anyway so it's not a huge loss. It's trivial to setup a basic account for any site that doesn't contain any personal data you want to keep hidden, if you aren't willing to do that, you're in a tiny minority.
photonthug
> What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.
Just to say the quiet part out loud here.. one of the biggest reasons this is depressing is that it's not only vandalism but actually vandalism with huge compounding benefits for the assholes involved and grabbing the content is just the beginning. If they take down the site forever due to increasing costs? Great, because people have to use AI for all documentation. If we retreat from captcha and force people to put in credit cards or telephone numbers? Great, because the internet is that much less anonymous. Data exfiltration leads to extra fraud? Great, you're gonna need AI to combat that. It's all pretty win-win for the bad actors.
People have discussed things like the balkanization of the internet for a long time now. One might think that the threat of that and/or the fact that it's such an unmitigated dumpster fire already might lead to some caution about making it worse! But pushing the bounds of harassment and friction that people are willing to deal with is moot anyway, because of course they have no real choice in the matter.
clvx
For your next X requests you need to process these many tokens. I mean that sounds utopian for sure.
vhcr
That already existed a few years ago, Coinhive, and everybody hated it.
noosphr
You don't need an authorization wall to have your stuff behind, you can just as easily use an anonymous micropayment service for each request.
That we live in an internet where getting too many visitors is an existential crisis for websites should tell you that our internet is not one that can survive long.
nonchalantsui
Are there any reputable anonymous micropayment services?
danaris
I dunno. I run a small browser-game, and while my server has been periodically getting absolutely pulverized by LLM scrapers, I have yet to see a single new account that looks remotely like it was created by a bot. (Also, the rate of new signups hasn't changed notably.) This is true for both the game and its Wiki—which is where most of the scraping traffic has been. (And which I will almost certainly have to set to be almost-completely authwalled if the scraping doesn't let up.)
Cthulhu_
Back when search engines caused this, the industry made an agreement and designed the robots.txt spec in order to avoid legal frameworks being made to stop them. Because of that, legal frameworks weren't being made.
Now there's a new generation of hungry hungry hippo indexers that didn't agree to that and who feel intense pressure from competition to scoop up as much data as they can, who just ignore it.
Legislation should have been made anyway, and those that ignore robots.txt blocked / fined / throttled / etc.
bitmasher9
I’m not sure that I like this plan. We shouldn’t let the illegal AIs gain more knowledge and usefulness than the legal ones.
There’s other options besides a blanket ban.
aqfamnzc
As it is now, unethical AIs have a huge advantage over ethical ones.
tremon
Unethical behaviour always has a huge advantage over ethical behaviour, that's nothing new and pretty much by definition. The only way to prevent a race to the bottom is to make the unethical behaviour illegal or unprofitable.
chneu
I'd be shocked if a single member of the US House or Congress know what robots.txt is.
diggan
I'm pleased to declare you shocked: https://www.google.com/search?q=site%3Acongress.gov+%22robot...
skyyler
I don't know if "robots.txt" appearing in congressional record really counts. Do any of the decision makers appear to have a command of what the file does? Or do they typically relegate to industry professionals, as they often do?
TheRealPomax
How would legislation in the US or EU stop traffic from China or Thailand or Russia? At best you'd be fragmenting the internet, which isn't really a "best", that's a terrible idea.
j2kun
This is the key point, but if US laws are being violated and AI is considered part of national security, that could be used by the US government in international negotiations, and for justification for sanctions, etc. It would be a good deterrent.
Ndymium
I was also under attack recently [0]. The little Forgejo instance where I host my code (of several open source packages so it needs to be open) was run into the ground and the disk was filled with generated zip archives. I'm not the only one who has suffered the same fate. For me, the attacks subsided (for now) when I banned Alibaba Cloud's IP range.
If you are hosting a Forgejo instance, I strongly recommend setting DISABLE_DOWNLOAD_SOURCE_ARCHIVES to true. The crawlers will still peg your CPU but at least your disk won't be filled with zip files.
zoobab
"disk was filled with generated zip archives"
That's bad software design to generate ZIP files on the fly.
abound
They could very well just be temp files, I know the Go standard library will write large multi-part file uploads to temp disk, for example.
It'd be better to totally stream it of course, but that's not always an option for one reason or another.
Ndymium
They're deleted by default every 24 hours and that time is configurable. Not useful when you get 60 requests per second though.
diggan
> They're deleted by default every 24 hours
Hm, so it's a cache then? Requesting the same tarball 100 times shouldn't create 100 zip files if they're cached, and if they aren't cached they shouldn't fill up the disk.
devit
Or perhaps switch to well-engineered software actually properly designed to be served on the public Internet.
Clearly generating zip files, writing them fully to disk and then sending them to the client all at once is a completely awful and unusable design, compared to the proper design of incrementally generating and transmitting them to the client with minimal memory consumption and no disk usage at all.
The fact that such an absurd design is present is a sign that most likely the developers completely disregarded efficiency when making the software, and it's thus probably full of similar catastrophic issues.
For example, from a cursory look at the Forgejo source code, it appears that it spawns "git" processes to perform all git operations rather than using a dedicated library and while I haven't checked, I wouldn't be surprised if those operations were extremely far from the most efficient way of performing a given operation.
It's not surprising that the CPU is pegged at 100% load and the server is unavailable when running such extremely poor software.
Ndymium
Just noting that the archives are written to disk on purpose, as they are cached for 24 hours (by default). But when you have a several thousand commit repository, and the bots tend to generate all the archive formats for every commit…
But Forgejo is not the only piece of software that can have CPU intensive endpoints. If I can't fence those off with robots.txt, should I just not be allowed to have them in the open? And if I forced people to have an account to view my packages, then surely I'd have close to 0 users for them.
devit
Well then such a cache needs obviously to have limit to the disk space it uses and some sort of cache replacement policy, since if one can generate a zip file for each tag, that means that the total disk space of the cache is O(n^2) where n is the disk usage of the git repositories (imagine a single repository where each commit is tagged and adds a new file of constant size), so unless one's total disk space is a million/billion times larger than the disk space used by the repositories, it's guaranteed to fill the disk without such a limit.
lelanthran
The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.
This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.
There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.
It'll all burn down.
djha-skin
I actually envision Liapunov stability, like wolf and rabbit populations. In this scenario, we're the rabbits. Human content will increase when AI populations decrease, this providing more food for AI, which will then increase. This drowns out human expression, and the humans will grow quieter. This provides less fodder for the AI, and they decrease. This means less noise and the humans grow louder. The cycle repeats and nauseam.
GolfPopper
Until broken by the Butlerian Jihad, "Though shalt not make a machine in the likeness of the mind of man."
keyringlight
I've thought along similar lines for art, what ecological niches are there where AI can't participate, are harder to pull training data from or not economical, where humans can flourish.
bashfulpup
Anything we humans deem private in nature from other humans.
InfamousRece
If the logistic driving parameter is large enough it can also lead to complete chaos.
cle
IMO this was one of the real motives for Web Environment Integrity. Allow Google to index but nobody else.
We're kind of stuck between a rock and a hard place here. Which do you prefer, entrenched incumbents or affordable/open hosting?
lelandfe
I’m supremely confident that attestation will arrive in one form or another in the near future.
Anonymous browsing and potentially-malicious bots look identical. This was sort of OK up until now.
cle
Agreed, it seems inevitable. Unfortunately I think it will also result in further centralization & consolidation into a handful of "trusted" megacorps.
If you thought browser fingerprinting for ad tracking was creepy, just wait until they're using your actual fingerprint.
breckenedge
Google is already scraping your site and presenting answers directly in search results. If I cared about traffic (hence selling ad space), why would I want my site indexed by Google at all anymore? Lots of advertising-supported sites are going to go dark because only bots will visit them.
sgc
It will entrench established search engines even more if they have to move to auth-based crawling, so that the only crawlers will be those you invite. Most people will do this for google, bing, and maybe one or two others if there is a simple tool to do so.
ethan_smith
What about the next-gen of AI that would be able to signup autonomously? Even if implemented auth-walls everywhere right now, whats stopping the companies to get some real cheap labor to create accounts on websites and use them to scrape the content?
Is it going to become another race like the adblocker -> detect adblocker -> bypass adblocker detector and so on...?
nicce
> The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.
AI companies with best anti-captcha mechanics will win and will inject ads to LLM output in more sophisticated way.
renegat0x0
This cannot be further from the truth. Ad business is not going anywhere. It will grow even bigger.
OpenAI goes through initial cycle of enshittification. Google is too big right now. Once they establish dominance you will have to see 5 unskippable ads between prompts, even for paid plan.
I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.
Example 'search' project: https://rumca-js.github.io/search
nicce
The stated problem was about indexing, accessing content and advertising in that context.
> I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.
> Example 'search' project: https://rumca-js.github.io/search
That is not really solution. Since typical indexing still works for masses, your approach is currently unique. But in the end, bots will be capable of reading on web page context if human is capable on reading them. And we get back to the original problem where we try to detect bots from humans. It's the only way.
sir-alien
Can we not just have a whitelist for allowed crawlers and ban the rest by default? Then places like DuckDuckGo and Google can provide a list of IP addresses that their crawlers will come from. Then simply just don't include major LLM providers like OpenAI
danieldk
How do you distinguish crawlers from regular visitors using a whitelist? As stated in the article, the crawlers show up with seemingly unique IP addresses and seemingly real user agents. It's a cat and mouse game.
Only if you operate on the scale of Cloudflare, etc. you can see which IP addresses are hitting a large number of servers in a short time span.
(I am pretty sure next they will hand out N free LLM requests per month in exchange of user machines doing the scraping if blocking gets more succesful.)
I fear the only solution in the end are CDNs, making visits expensive using challenges, or requiring users to log in.
regularfry
How are the crawlers identifying themselves? If it's user agent strings then they can be faked. If it's cryptographically secured then you create a situation where newcomers can't get into the market.
what
Google publishes the ip addresses that google bot uses. If someone claims to be google bot but is not from one of those addresses, it’s a fake.
Thorrez
The problem is many crawlers pretend to be humans. So to ban the rest of the crawlers by default, you'll have to ban humans.
prmoustache
I am pretty sure a number of crawlers are running inside mobile apps of mobile phone users so they can get residential ip pools.
ATechGuy
This is scary!
__MatrixMan__
You can have a whitelist for allowed users and ban everyone else by default, which I think is where this will eventually take us.
insane_dreamer
Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)
nonrandomstring
This sort of positive security model with behavioural analysis is the future. We need to get it built-in to Apache,Nginx,Caddy etc. The trick is spotting crawlers from users. It can be done though.
xyzal
In case anyone is interested in a tiny bit of sabotage, I am under the impression I managed to 'drown' true information on my microblog by generating contradicting posts with LLaMa (tens of them for each real post) and invisibly linking them, so a human would not click through.
You know, flood the zone with s***, Bannon-style ...
RobKohr
Yep, just make sure to add them to your robots.txt file so that only the bad robots are harmed.
ethan_smith
Temporary solution and only works if only some of us are doing it. What if these bots have a "manager" LLM agent that takes the decision on what pages to scrape?
kod
Besides flooding them with junk, what about outright sabotage in the form of serving zip bombs or other ways to waste computational resources?
knowaveragejoe
This is an approach I've seen used and I'm not sure what success it has had. But logically it seems sound: explicitly reference paths that no human would actually see - traffic hitting those paths are bots. They can't help themselves.
jbk
VideoLAN here.
Same for us, our forum and our Gitlab are getting hammered by AI companies bots.
Most of them don’t respect robots.txt…
greybox
Insane, I wonder if we eventually end up with a non-search-engine indexed version of the web that's more like browsing in the 90s where websites just had to link to oneanother to get noticed . . . .
I love that the solution to LLM scraping is to serve the browser a proof of work, before they allow access - I wonder if things like new sites start to do this . . . It would mean they won't be indexed by search engines, but it would help to protect the IP
xena
Hi! I do this! See https://github.com/TecharoHQ/anubis for more info!
hmry
I hope lots of websites adopt this, mainly because I want to see more happy jackal girls while browsing.
xena
My monetization strategy is unironically to offer a de-anime'd version under the name Techaro BotStopper or something.
noveltyaccount
Does the PoW make money via crypto mining? Or is it just to waste the caller's CPU cycles? If you could monetize the PoW then you could re-challenge at an interval tuned so that the caller pays for their usage.
xena
It's to waste CPU cycles. I don't want to touch cryptocurrency with a 20 foot pole. I realize I'm leaving money on the table by doing this, but I don't want to alienate the kinds of communities I want to protect.
greybox
thank you for your contribution to society!
corytheboyd
I never thought about it until now, but it’s insane that the companies who offer both LLM products and cloud compute services are double dipping— they get the LLM product to sell, as well as the elevated load egress (and compute, etc.) money. When you look at it that way, where’s the incentive to even care about inefficient LLM scraping, leaving it terrible makes you money from your other empire, cloud storage egress costs.
nzeid
We need a project in the spirit of Spamhaus to actively maintain a list of perpetrating IPs. If they're cycling through IPs and IP blocks I don't know how sustainable a CAPTCHA-like solution is.
mrweasel
Just block all of AWS, Alibaba, GCP and Azure, or throttle them aggressively. If you have clients/customers that need more requests per second then have them provide you with their IPs.
The problem is that these companies are fairly well funded and renting infrastructure isn't an issue.
kijin
Exactly. They're renting infrastructure on well-known clouds, not cycling through consumer IPs like yesterday's botnets. Block all web traffic from well-known cloud IPs, and you can keep 99% of the LLM bots away. Alibaba seems to be the most common source of bot traffic on my infrastructure lately, and I also see Huawei Cloud from time to time. Not much AWS, probably because of their high IPv4 pricing.
You can allow API access from cloud IPs, as long as you don't do anything expensive before you've authenticated the client.
cgh
From the article:
“…they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses - mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure - actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.”
So it looks like much of the traffic, particularly from China, is indeed using consumer ips to disguise itself. That’s why they blocked based on browser type (MS Edge, in this case).
blueflow
Why only in the "spirit of Spamhaus"? Spamhaus still exists. Add Google and Microsoft AS to the DROP/NOROUTE list, that would be hilarious.
danaris
Because while this is clearly related to spam, it's not the same thing, and presumably if Spamhaus themselves felt it was within their wheelhouse, they'd already be doing it.
null
voidUpdate
This sounds backwards to me, if you maintain a list of IPs but they are constantly cycling them, it'll get out of date quickly, but a captcha-like system will (hopefully) always stop bot traffic
pavon
While some of the residential IPs are from malware, a lot of it is from residential IP proxies, where people are paid to run proxy software from their home. If it starts getting around that people who run this software quickly become blocked by the majority of the internet that will lessen that part of the problem.
nzeid
Only if your CAPTCHA-like is hurled at every client indiscriminately. Otherwise you'll end up right back where Spamhaus started: maintaining your own list of good and bad actors.
The advantage of a third party service is that you're sharing intel of bad actors.
voidUpdate
I can't confirm but I believe it is applied to every client
Yep -- our story here: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse... (quoted in the OP) -- everyone I know has a similar story who is running large internet infrastructure -- this post does a great job of rounding a bunch of them up in 1 place.
I called it when I wrote it, they are just burning their goodwill to the ground.
I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 -- an engineer at the company saw our post and reached out, giving me the right email -- which I then emailed 3x and never got a reply.