FOSS infrastructure is under attack by AI companies

630 comments

·March 20, 2025

ericholscher

Yep -- our story here: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse... (quoted in the OP) -- everyone I know has a similar story who is running large internet infrastructure -- this post does a great job of rounding a bunch of them up in 1 place.

I called it when I wrote it, they are just burning their goodwill to the ground.

I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 -- an engineer at the company saw our post and reached out, giving me the right email -- which I then emailed 3x and never got a reply.

pjc50

> just burning their goodwill to the ground

AI firms seem to be leading from a position that goodwill is irrelevant: a $100bn pile of capital, like an 800lb gorilla, does what it wants. AI will be incorporated into all products whether you like it or not; it will absorb all data whether you like it or not.

UncleMeat

Yep. And it is much more far reaching than that. Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet. The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained. All intellectual property belongs to them. All labor belongs to them. Why would they need good will when they own everything?

"Why should we care about open source maintainers" is just a microcosm of the much larger "why should we care about literally anybody" mindset.

rectang

> Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet.

And this is why AI training is not "fair use". The AI companies seek to train models in order to compete with the authors of the content used to train the models.

A possible eventual downfall of AI is that the risk of losing a copyright infringement lawsuit is not going away. If a court determines that the AI output you've used is close enough to be considered a derivative work, it's infringement.

maaaaattttt

I have this line of thought as well but then I wonder, if we are all out of jobs and out of substantial capital to spend, how do these owners make money ultimately? It's a genuine question and I'm probably missing something obvious. I can see a benevolant/post-scarcity spin to this but the non-benevolant one seems self defeating.

pkdpic

And how could they possibly base their actions on good when their technology is more important than fire? History is depending on them to do everything possible to increase their market cap.

DrillShopper

> The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained.

I agree with you in the case of AI companies, but the desire to own everything an bee completely unconstrained is the dream of every large corporation.

chii

> remake the entire world into one where the owners of these companies own everything and are completely unconstrained

how has this been any different from the past 10,000 years of human conquest and domination?

outside1234

The thing is that this will be their destruction as well. If workers don't have any money (because they don't have jobs), nobody can afford what the owners have to sell?

yubblegum

They are also gutting the profession of software engineering. It's a clever scam actually: to develop software a company will need to pay utility fees to A"I" companies and since their products are error prone voila use more A"I" tools to correct the errors of the other tools. Meanwhile software knowledge will atrophy and soon ala WALE we'll have software "developers" with 'soft bones' floating around on conveyed seats slurping 'sugar water' and getting fat and not knowing even how to tie their software shoelaces.

bbarnett

Yes, like the Pixel camera app, which mangles photos with AI processing, and users complain that it won't let people take pics.

One issue was a pic with text in it, like a store sign. Users were complaining that it kept asking for better focus on the text in the background, before allowing a photo. Alpha quality junk.

Which is what AI is, really.

anthk

AI tarpits && lim (human curated contant/mediocre AI answers -> 0) = AI's crumbling into dust by themselves.

davidmurdoch

We, the people, might need to come up with a few proverbial tranquilizer guns here soon

Sharlin

Maxim 1: "Pillage, then burn."

Coffeewine

Another Schlock Mercenary fan? Or does this adage have many adherents?

ferguess_k

That's pretty much what our future would look like -- you are irrelevant. Well I mean we are already pretty much irrelevant nowadays, but the more so in the "progressive" future of AI.

speerer

https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-domi...

links to this comment.

lgeek

> One crawler downloaded 73 TB of zipped HTML files in May 2024 [...] This cost us over $5,000 in bandwidth charges

I had to do a double take here. I run (mostly using dedicated servers) infrastructure that handles a few hundred TB of traffic per month, and my traffic costs are on the order of $0.50 to $3 per TB (mostly depending on the geographical location). AWS egress costs are just nuts.

Ray20

I think uncontrolled price of cloud traffic - is a real fraud and way bigger problem then some AI companies that ignore robot.txt. One time we went over limit on Netlify or something, and they charged over thousand for a couple TB.

joepie91_

> I think uncontrolled price of cloud traffic - is a real fraud

Yes, it is.

> and way bigger problem then some AI companies that ignore robot.txt.

No, it absolutely is not. I think you underestimate just how hard these AI companies hammer services - it is bringing down systems that have weathered significant past traffic spikes with no issues, and the traffic volumes are at the level where literally any other kind of company would've been banned by their upstream for "carrying out DDoS attacks" months ago.

Suppafly

>which I then emailed 3x and never got a reply.

Send a bill to their accounts payable team instead.

ldoughty

Detect AI scraper and inject an in-page notice that by continuing they accept your terms of use.

Profit... By sending them invoices :-)

dabockster

Honestly this is crazy enough to work. Bonus points if both you and the scraping company reside in the same state.

TuringNYC

>> which I then emailed 3x and never got a reply.

At which point does the crawling cease to be a bug/oversight and constitute a DDOS?

ferguess_k

Maybe just feed them dynamically generated garbage information? More fun than no information.

gnz11

OP’s linked blog post mentioned they got hit with a large spike in bandwidth charges. Sending them garbage information costs money.

ferguess_k

Yeah you have a point, hmmm, wish there were a way to somehow generate those garbages with minimum bandwidth. Something like, I can send you a very compressed 256 bytes of data which expands to something like 1 mega bytes.

Steltek

Tarpit instead? Trickle out a dead end response (no links) at bytes-per-second speeds until the bot times out.

https://en.wikipedia.org/wiki/Tarpit_(networking)

InfamousRece

It does not even have to be dynamically generated. Just pre-generate a few thousand static pages of AI slop and serve that. Probably cheaper than dynamic generation.

m463

I kind of suspect some of these companies probably have more horsepower and bandwidth in one crawler than a lot of these projects have in their entire infrastructure.

spenczar5

Thanks for writing about this. Is it clear that this is from crawlers, as opposed to dynamic requests triggered by LLM tools, like Claude Code fetching docs on the fly?

Freebytes

Along with having block lists, perhaps you could add poison to your results that generates random bad code that will not work, and that is only seen by bots (display: none when rendered), and the bots will use it, but a human never would.

xena

It's really surreal to see my project in the preview image like this. That's wild! If you want to try it: https://github.com/TecharoHQ/anubis. So far I've noticed that it seems to actually work. I just deployed it to xeiaso.net as a way to see how it fails in prod for my blog.

diggan

Nice work :)

One piece of feedback: Could you add some explanation (for humans) what we're supposed to do and what is happening when met by that page?

I know there is a loading animation widget thingy, but the first time I saw that page (some weeks ago at the Gnome issue tracker), it was proof-of-work'ing for like 20 seconds, and I wasn't sure what was going on, I initially thought I got blocked or that the captcha failed to load.

Of course, now I understand what it is, but I'm not sure it's 100% clear when you just see the "checking if you're a bot" page in isolation.

xena

> One piece of feedback: Could you add some explanation (for humans) what we're supposed to do and what is happening when met by that page?

Will do! https://github.com/TecharoHQ/anubis/issues/25

ranger_danger

also if you're using JShelter, which blocks Worker by default, there is no indication that it's never going to work, and the spinner just goes on forever doing nothing

hartator

Maybe a progress bar?

xena

There's no way to really make a progress bar make sense, it's a luck-based mechanic.

clvx

I really like this. I don't mind Internet acting like the Wild Wild West but I do mind there's no accountability. This is a nice way to pass the economic burden to the crawlers for sites who still want to stay freely available. You want the data, spend money on your side to get it. Even though the downside is your site could be delisted from search engines, there's no reason why you cannot register your service in a global or p2p indexer.

lukan

"why you cannot register your service in a global or p2p indexer"

Network effects anyone? So yes, we should work on a different way of indexing the web again, than via google, but easier said than done I think ..

isoprophlex

Loving it, great work as always.

Also

> https://news.ycombinator.com/item?id=43422781

Integrate a way to calculate micro-amounts of the shitcoin of your choice and we might have the another actually legitimately useful application of cryptocurrencies on our hands..!

vhcr

Anubis is only going to work as long as it doesn't gets famous, if that happens crawlers will start using GPUs / ASICs for the proof of work and it's game over.

bashfulpup

The entire reason bots are so agressive is because they are cheap to run.

If a GPU was required per scrape then >90% simply couldn't afford it at scale.

xena

Author of Anubis here. If that happens, I win.

eb0la

If that happens, count with me to use Anubis to factor large primes or whatever science needs as a background task.

knowaveragejoe

I love that I seem to stumble upon something by you randomly every so often. I'd just like to say that I enjoy your approach to explanations in blog form and will look further into Anubis!

reginald78

Maybe I'm missing something, but doesn't this mean the work has to be done by the client AND the server every time a challenge is issued? I think ideally you'd want work that was easy for the server and difficult for the server. And what is to stop being DDoS'd by clients that are challenged but neglect to perform the challenge?

Regardless, I think something like this is the way forward if one doesn't want to throw privacy entirely out the window.

client

xena

The magic of proof of work is that it's something that's really hard to do but easy to validate. Anubis' proof of work works like this:

A sha256 hash is a bunch of bytes like this:

  394d1cc82924c2368d4e34fa450c6b30d5d02f8ae4bb6310e2296593008ff89f

We usually write it out in hex form, but that's literally what the bytes in ram look like. In a proof of work validation system, you take some base value (the "challenge") and a rapidly incrementing number (the "nonce"), so the thing you end up hashing is this:

  await sha256(`${challenge}${nonce}`);

The "difficulty" is how many leading zeroes the generated hash needs to have. When a client requests to pass the challenge, they include the nonce they used. The server then only has to do one sha256 operation: the one that confirms that the challenge (generated from request metadata) and the nonce (provided by the client) match the difficulty number of leading zeroes.

The other trick is that presenting the challenge page is super cheap. I wrote that page with templ (https://templ.guide) so it compiles to native Go. This makes it as optimized as Go is modulo things like variable replacement. If this becomes a problem I plan to prerender things as much as possible. Rendering the challenge page from binary code or ram is always always always going to be so much cheaper than your webapp ever will be.

I'm planning on adding things like changing out the hash in use, but right now sha256 is the best option because most CPUs in active deployment have instructions to accelerate sha256 hashing. This combined with webcrypto jumping to heavily optimized C++ and the JIT in JS being shockingly good means that this super naïve approach is probably the most efficient way to do things right now.

I'm shocked that this all works so well and I'm so glad to see it take off like it has.

k1tanaka

I am sorry if this question is dumb, but how does proof of work deter bots/scrappers from accessing a website?

I imagine it costs more resource to access the protected website but would this stop the bots? Wouldn't they be able to pass the challenge and scrap the data after? Or normal scrapbots usually timeout after a small amount of time/ resources is used?

diggan

> I think ideally you'd want work that was easy for the server and difficult for the server.

That's exactly how it works (easy for server, hard for client). Once the client completed the Proof-of-Work challenge, the server doesn't need to complete the same challenge, it only needs to validate that the results checks out.

Similar to how in Proof-of-Work blockchains where coming up with the block hashes is difficult, but validating them isn't nearly as compute-intensive.

This asymmetric computation requirement is probably the most fundamental property of Proof-of-Work, Wikipedia has more details if you're curious: https://en.wikipedia.org/wiki/Proof_of_work

Fun fact: it seems Proof-of-Work was used as a DoS preventing technique before it was used in Bitcoin/blockchains, so seems we've gone full circle :)

namaria

I think going full circle would be something like bitcoin being created on top of DoS prevention software and then eventually DoS prevention starting to use bitcoin. A tool being used for something than something else than the first something again is just... nothing? Happens all the time?

GaggiX

The AI anime girl has 6 fingers btw, combating AI bot with AI girls.

Edit: I will probably send a pull request to fix it.

xena

I'm commissioning an artist to make better assets. These are the placeholders that I used with the original rageware implementation. I never thought it would take off like this!

null

[deleted]

brushfoot

At this rate, it's more than FOSS infrastructure -- although that's a canary in the coalmine I especially sympathize with -- it's anonymous Internet access altogether.

Because you can put your site behind an auth wall, but these new bots can solve the captchas and imitate real users like never before. Particularly if they're hitting you from residential IPs and with fake user agents like the ones in the article -- or even real user agents because they're wired up to something like Playwright.

What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.

Freak_NL

We're half way there already. It always hits me whenever I am doing some mapping for OpenStreetMap and I'm looking up local businesses without their own internet presence. They use Facebook, Instagram, X, etc. for their digital calling card. I normally don't use Facebook (or Instagram, and gave up on X) and have no account there, and every time I follow one of those links, you get some info, and then you get a dialogue screen telling you to make an account or get lost, or you just get some obscure error.

I don't mind registering an account for private communities, but for stuff which people put up thinking it is just going to be publicly visible it's really annoying.

yurishimo

> ... but for stuff which people put up thinking it is just going to be publicly visible ...

I don't think these business owners really understand. Most normies just think everyone has a Facebook/Instagram account and can't even imagine a world where that is not the case.

I agree with you that it is extremely frustrating.

Suppafly

>Most normies just think everyone has a Facebook/Instagram account and can't even imagine a world where that is not the case.

The people without a basic internet presence aren't likely to be customers anyway so it's not a huge loss. It's trivial to setup a basic account for any site that doesn't contain any personal data you want to keep hidden, if you aren't willing to do that, you're in a tiny minority.

photonthug

> What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.

Just to say the quiet part out loud here.. one of the biggest reasons this is depressing is that it's not only vandalism but actually vandalism with huge compounding benefits for the assholes involved and grabbing the content is just the beginning. If they take down the site forever due to increasing costs? Great, because people have to use AI for all documentation. If we retreat from captcha and force people to put in credit cards or telephone numbers? Great, because the internet is that much less anonymous. Data exfiltration leads to extra fraud? Great, you're gonna need AI to combat that. It's all pretty win-win for the bad actors.

People have discussed things like the balkanization of the internet for a long time now. One might think that the threat of that and/or the fact that it's such an unmitigated dumpster fire already might lead to some caution about making it worse! But pushing the bounds of harassment and friction that people are willing to deal with is moot anyway, because of course they have no real choice in the matter.

danaris

I dunno. I run a small browser-game, and while my server has been periodically getting absolutely pulverized by LLM scrapers, I have yet to see a single new account that looks remotely like it was created by a bot. (Also, the rate of new signups hasn't changed notably.) This is true for both the game and its Wiki—which is where most of the scraping traffic has been. (And which I will almost certainly have to set to be almost-completely authwalled if the scraping doesn't let up.)

clvx

For your next X requests you need to process these many tokens. I mean that sounds utopian for sure.

vhcr

That already existed a few years ago, Coinhive, and everybody hated it.

noosphr

You don't need an authorization wall to have your stuff behind, you can just as easily use an anonymous micropayment service for each request.

That we live in an internet where getting too many visitors is an existential crisis for websites should tell you that our internet is not one that can survive long.

nonchalantsui

Are there any reputable anonymous micropayment services?

Cthulhu_

Back when search engines caused this, the industry made an agreement and designed the robots.txt spec in order to avoid legal frameworks being made to stop them. Because of that, legal frameworks weren't being made.

Now there's a new generation of hungry hungry hippo indexers that didn't agree to that and who feel intense pressure from competition to scoop up as much data as they can, who just ignore it.

Legislation should have been made anyway, and those that ignore robots.txt blocked / fined / throttled / etc.

bitmasher9

I’m not sure that I like this plan. We shouldn’t let the illegal AIs gain more knowledge and usefulness than the legal ones.

There’s other options besides a blanket ban.

aqfamnzc

As it is now, unethical AIs have a huge advantage over ethical ones.

tremon

Unethical behaviour always has a huge advantage over ethical behaviour, that's nothing new and pretty much by definition. The only way to prevent a race to the bottom is to make the unethical behaviour illegal or unprofitable.

joepie91_

None of the AIs have any 'knowledge' to begin with, so that's an easy one to satisfy.

chneu

I'd be shocked if a single member of the US House or Congress know what robots.txt is.

diggan

I'm pleased to declare you shocked: https://www.google.com/search?q=site%3Acongress.gov+%22robot...

skyyler

I don't know if "robots.txt" appearing in congressional record really counts. Do any of the decision makers appear to have a command of what the file does? Or do they typically relegate to industry professionals, as they often do?

TheRealPomax

How would legislation in the US or EU stop traffic from China or Thailand or Russia? At best you'd be fragmenting the internet, which isn't really a "best", that's a terrible idea.

j2kun

This is the key point, but if US laws are being violated and AI is considered part of national security, that could be used by the US government in international negotiations, and for justification for sanctions, etc. It would be a good deterrent.

Ndymium

I was also under attack recently [0]. The little Forgejo instance where I host my code (of several open source packages so it needs to be open) was run into the ground and the disk was filled with generated zip archives. I'm not the only one who has suffered the same fate. For me, the attacks subsided (for now) when I banned Alibaba Cloud's IP range.

If you are hosting a Forgejo instance, I strongly recommend setting DISABLE_DOWNLOAD_SOURCE_ARCHIVES to true. The crawlers will still peg your CPU but at least your disk won't be filled with zip files.

[0] https://blog.nytsoi.net/2025/03/01/obliterated-by-ai

zoobab

"disk was filled with generated zip archives"

That's bad software design to generate ZIP files on the fly.

abound

They could very well just be temp files, I know the Go standard library will write large multi-part file uploads to temp disk, for example.

It'd be better to totally stream it of course, but that's not always an option for one reason or another.

Ndymium

They're deleted by default every 24 hours and that time is configurable. Not useful when you get 60 requests per second though.

diggan

> They're deleted by default every 24 hours

Hm, so it's a cache then? Requesting the same tarball 100 times shouldn't create 100 zip files if they're cached, and if they aren't cached they shouldn't fill up the disk.

devit

Or perhaps switch to well-engineered software actually properly designed to be served on the public Internet.

Clearly generating zip files, writing them fully to disk and then sending them to the client all at once is a completely awful and unusable design, compared to the proper design of incrementally generating and transmitting them to the client with minimal memory consumption and no disk usage at all.

The fact that such an absurd design is present is a sign that most likely the developers completely disregarded efficiency when making the software, and it's thus probably full of similar catastrophic issues.

For example, from a cursory look at the Forgejo source code, it appears that it spawns "git" processes to perform all git operations rather than using a dedicated library and while I haven't checked, I wouldn't be surprised if those operations were extremely far from the most efficient way of performing a given operation.

It's not surprising that the CPU is pegged at 100% load and the server is unavailable when running such extremely poor software.

Ndymium

Just noting that the archives are written to disk on purpose, as they are cached for 24 hours (by default). But when you have a several thousand commit repository, and the bots tend to generate all the archive formats for every commit…

But Forgejo is not the only piece of software that can have CPU intensive endpoints. If I can't fence those off with robots.txt, should I just not be allowed to have them in the open? And if I forced people to have an account to view my packages, then surely I'd have close to 0 users for them.

devit

Well then such a cache needs obviously to have limit to the disk space it uses and some sort of cache replacement policy, since if one can generate a zip file for each tag, that means that the total disk space of the cache is O(n^2) where n is the disk usage of the git repositories (imagine a single repository where each commit is tagged and adds a new file of constant size), so unless one's total disk space is a million/billion times larger than the disk space used by the repositories, it's guaranteed to fill the disk without such a limit.

lelanthran

The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.

This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.

There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.

It'll all burn down.

djha-skin

I actually envision Liapunov stability, like wolf and rabbit populations. In this scenario, we're the rabbits. Human content will increase when AI populations decrease, this providing more food for AI, which will then increase. This drowns out human expression, and the humans will grow quieter. This provides less fodder for the AI, and they decrease. This means less noise and the humans grow louder. The cycle repeats and nauseam.

GolfPopper

Until broken by the Butlerian Jihad, "Though shalt not make a machine in the likeness of the mind of man."

keyringlight

I've thought along similar lines for art, what ecological niches are there where AI can't participate, are harder to pull training data from or not economical, where humans can flourish.

bashfulpup

Anything we humans deem private in nature from other humans.

Y_Y

See e.g. https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equatio...

InfamousRece

If the logistic driving parameter is large enough it can also lead to complete chaos.

cle

IMO this was one of the real motives for Web Environment Integrity. Allow Google to index but nobody else.

We're kind of stuck between a rock and a hard place here. Which do you prefer, entrenched incumbents or affordable/open hosting?

lelandfe

I’m supremely confident that attestation will arrive in one form or another in the near future.

Anonymous browsing and potentially-malicious bots look identical. This was sort of OK up until now.

cle

Agreed, it seems inevitable. Unfortunately I think it will also result in further centralization & consolidation into a handful of "trusted" megacorps.

If you thought browser fingerprinting for ad tracking was creepy, just wait until they're using your actual fingerprint.

breckenedge

Google is already scraping your site and presenting answers directly in search results. If I cared about traffic (hence selling ad space), why would I want my site indexed by Google at all anymore? Lots of advertising-supported sites are going to go dark because only bots will visit them.

sgc

It will entrench established search engines even more if they have to move to auth-based crawling, so that the only crawlers will be those you invite. Most people will do this for google, bing, and maybe one or two others if there is a simple tool to do so.

nicce

> The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.

AI companies with best anti-captcha mechanics will win and will inject ads to LLM output in more sophisticated way.

renegat0x0

This cannot be further from the truth. Ad business is not going anywhere. It will grow even bigger.

OpenAI goes through initial cycle of enshittification. Google is too big right now. Once they establish dominance you will have to see 5 unskippable ads between prompts, even for paid plan.

I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.

Example 'search' project: https://rumca-js.github.io/search

nicce

The stated problem was about indexing, accessing content and advertising in that context.

> I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.

> Example 'search' project: https://rumca-js.github.io/search

That is not really solution. Since typical indexing still works for masses, your approach is currently unique. But in the end, bots will be capable of reading on web page context if human is capable on reading them. And we get back to the original problem where we try to detect bots from humans. It's the only way.

ethan_smith

What about the next-gen of AI that would be able to signup autonomously? Even if implemented auth-walls everywhere right now, whats stopping the companies to get some real cheap labor to create accounts on websites and use them to scrape the content?

Is it going to become another race like the adblocker -> detect adblocker -> bypass adblocker detector and so on...?

sir-alien

Can we not just have a whitelist for allowed crawlers and ban the rest by default? Then places like DuckDuckGo and Google can provide a list of IP addresses that their crawlers will come from. Then simply just don't include major LLM providers like OpenAI

microtonal

How do you distinguish crawlers from regular visitors using a whitelist? As stated in the article, the crawlers show up with seemingly unique IP addresses and seemingly real user agents. It's a cat and mouse game.

Only if you operate on the scale of Cloudflare, etc. you can see which IP addresses are hitting a large number of servers in a short time span.

(I am pretty sure next they will hand out N free LLM requests per month in exchange of user machines doing the scraping if blocking gets more succesful.)

I fear the only solution in the end are CDNs, making visits expensive using challenges, or requiring users to log in.

regularfry

How are the crawlers identifying themselves? If it's user agent strings then they can be faked. If it's cryptographically secured then you create a situation where newcomers can't get into the market.

what

Google publishes the ip addresses that google bot uses. If someone claims to be google bot but is not from one of those addresses, it’s a fake.

KTibow

We actually can do this already.

https://duckduckgo.com/duckduckgo-help-pages/results/duckduc...

https://developers.google.com/search/docs/crawling-indexing/...

https://www.bing.com/webmasters/help/how-to-verify-bingbot-3...

prmoustache

I am pretty sure a number of crawlers are running inside mobile apps of mobile phone users so they can get residential ip pools.

ATechGuy

This is scary!

Thorrez

The problem is many crawlers pretend to be humans. So to ban the rest of the crawlers by default, you'll have to ban humans.

nonrandomstring

This sort of positive security model with behavioural analysis is the future. We need to get it built-in to Apache,Nginx,Caddy etc. The trick is spotting crawlers from users. It can be done though.

insane_dreamer

Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)

lelanthran

> Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)

I don't really know about this proposal; the majority of bots are going to be coming from residential IPs the minute you do this.[1]

[1] The AI SaaS will simply run a background worker on the client to do their search indexing.

__MatrixMan__

You can have a whitelist for allowed users and ban everyone else by default, which I think is where this will eventually take us.

xyzal

In case anyone is interested in a tiny bit of sabotage, I am under the impression I managed to 'drown' true information on my microblog by generating contradicting posts with LLaMa (tens of them for each real post) and invisibly linking them, so a human would not click through.

You know, flood the zone with s***, Bannon-style ...

knowaveragejoe

This is an approach I've seen used and I'm not sure what success it has had. But logically it seems sound: explicitly reference paths that no human would actually see - traffic hitting those paths are bots. They can't help themselves.

xyzal

Regarding pollution of the LLM weights, at least the Russian propaganda machine seems to prove it works, see https://www.france24.com/en/live-news/20250310-russian-disin...

RobKohr

Yep, just make sure to add them to your robots.txt file so that only the bad robots are harmed.

ethan_smith

Temporary solution and only works if only some of us are doing it. What if these bots have a "manager" LLM agent that takes the decision on what pages to scrape?

joquarky

When I read this yesterday, I was contemplating one possible way to mitigate this at a larger scale is if websites could create random virtual paths/endpoints that drive the bot into a locally served Library of Babel[0] that poisons the spiders with lots of useless text.

It won't work for well-structured sites where the bots know the exact endpoint they want to scrape, but might slow down the more exploratory spider threads.

[0] https://libraryofbabel.info/

kod

Besides flooding them with junk, what about outright sabotage in the form of serving zip bombs or other ways to waste computational resources?

xyzal

I think we need to aim for the bots to get _negative_ utility value from visiting our traps, not just zero value.

Did you try to GET a canary page forbidden in robots.txt? Very well, have a bucket load of articles on the benefits of drinking bleach.

Is your user-agent too suspicious? No problem, feel free to scrape my insecure code (google "emergent misalignment" for more info).

A request rate too inhuman? Here, take those generated articles about positive effect of catching measles on performance in bed.

And so on, and so forth ...

dabockster

Even though I agree with what you're doing in principle, I feel it's necessary to remind and warn everyone here that sabotaging bots could be viewed as a violation of laws such as the US's Computer Fraud and Abuse Act[1]. I mean, unless the Second Amendment is suddenly interpreted to include cyberweapons.

[1] https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act

jbk

VideoLAN here.

Same for us, our forum and our Gitlab are getting hammered by AI companies bots.

Most of them don’t respect robots.txt…

factchecking

Did you document the measures you took to remedy this?

dabockster

I doubt they'd post it since it would mean those AI firms could use it to adapt to their countermeasures.

greybox

Insane, I wonder if we eventually end up with a non-search-engine indexed version of the web that's more like browsing in the 90s where websites just had to link to oneanother to get noticed . . . .

I love that the solution to LLM scraping is to serve the browser a proof of work, before they allow access - I wonder if things like new sites start to do this . . . It would mean they won't be indexed by search engines, but it would help to protect the IP

xena

Hi! I do this! See https://github.com/TecharoHQ/anubis for more info!

hmry

I hope lots of websites adopt this, mainly because I want to see more happy jackal girls while browsing.

xena

My monetization strategy is unironically to offer a de-anime'd version under the name Techaro BotStopper or something.

noveltyaccount

Does the PoW make money via crypto mining? Or is it just to waste the caller's CPU cycles? If you could monetize the PoW then you could re-challenge at an interval tuned so that the caller pays for their usage.

xena

It's to waste CPU cycles. I don't want to touch cryptocurrency with a 20 foot pole. I realize I'm leaving money on the table by doing this, but I don't want to alienate the kinds of communities I want to protect.

joquarky

This feels like one of the few ways to potentially avoid what seems like the inevitability of attestation.

greybox

thank you for your contribution to society!

corytheboyd

I never thought about it until now, but it’s insane that the companies who offer both LLM products and cloud compute services are double dipping— they get the LLM product to sell, as well as the elevated load egress (and compute, etc.) money. When you look at it that way, where’s the incentive to even care about inefficient LLM scraping, leaving it terrible makes you money from your other empire, cloud storage egress costs.

nzeid

We need a project in the spirit of Spamhaus to actively maintain a list of perpetrating IPs. If they're cycling through IPs and IP blocks I don't know how sustainable a CAPTCHA-like solution is.

mrweasel

Just block all of AWS, Alibaba, GCP and Azure, or throttle them aggressively. If you have clients/customers that need more requests per second then have them provide you with their IPs.

The problem is that these companies are fairly well funded and renting infrastructure isn't an issue.

kijin

Exactly. They're renting infrastructure on well-known clouds, not cycling through consumer IPs like yesterday's botnets. Block all web traffic from well-known cloud IPs, and you can keep 99% of the LLM bots away. Alibaba seems to be the most common source of bot traffic on my infrastructure lately, and I also see Huawei Cloud from time to time. Not much AWS, probably because of their high IPv4 pricing.

You can allow API access from cloud IPs, as long as you don't do anything expensive before you've authenticated the client.

cgh

From the article:

“…they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses - mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure - actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.”

So it looks like much of the traffic, particularly from China, is indeed using consumer ips to disguise itself. That’s why they blocked based on browser type (MS Edge, in this case).

blueflow

Why only in the "spirit of Spamhaus"? Spamhaus still exists. Add Google and Microsoft AS to the DROP/NOROUTE list, that would be hilarious.

danaris

Because while this is clearly related to spam, it's not the same thing, and presumably if Spamhaus themselves felt it was within their wheelhouse, they'd already be doing it.

null

[deleted]

voidUpdate

This sounds backwards to me, if you maintain a list of IPs but they are constantly cycling them, it'll get out of date quickly, but a captcha-like system will (hopefully) always stop bot traffic

pavon

While some of the residential IPs are from malware, a lot of it is from residential IP proxies, where people are paid to run proxy software from their home. If it starts getting around that people who run this software quickly become blocked by the majority of the internet that will lessen that part of the problem.

nzeid

Only if your CAPTCHA-like is hurled at every client indiscriminately. Otherwise you'll end up right back where Spamhaus started: maintaining your own list of good and bad actors.

The advantage of a third party service is that you're sharing intel of bad actors.

voidUpdate

I can't confirm but I believe it is applied to every client

diggan

> According to Drew, LLM crawlers don't respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.

How do they know that these are LLM crawlers and not anything else?

TonyTrapp

As someone that is also affected by this: We see a manifold increase in requests since this LLM crap is going on. Many of these IPs come from companies that obviously work with LLM technology, but the problem is that it's 100s of IPs doing 1 request, not 1 IP doing 100s of requests. It's just extremely unlikely that anyone else is responsible for this.

diggan

> IPs come from companies that obviously work with LLM technology

Like from their own ASNs you're saying? Or how are you connecting the IPs with the company?

> is that it's 100s of IPs doing 1 request

Are all of those IPs within the same ranges or scattered?

Thanks a lot for taking the time to talk about your experience btw, as someone who hasn't been hit by this it's interesting to have more details about it before it eventually happens.

TonyTrapp

> Like from their own ASNs you're saying? Or how are you connecting the IPs with the company?

Those are the ones that make it obvious, yes. It's not exclusive, though, but enough to connect the dots.

> Are all of those IPs within the same ranges or scattered?

The IP ranges are all over the place. Alibaba seems to have tons of small ASNs, for instance.

boris

> How do they know that these are LLM crawlers and not anything else?

I can tell you what it looks like in case of a git web interface like cgit: you get a burst of one or two isolated requests from a large number of IPs each for very obscure (but different) URLs, like a file contents at a specific commit id. And the user agent suggesting it's coming from IPhone or Android.

JimDabell

That was my reaction. It seems like the article is saying two mutually exclusive things:

- We cannot block them because we can’t differentiate legitimate traffic from illegitimate traffic…

- …but we can conclusively identify this traffic as coming from AI crawlers.

xena

It's a situation where it's difficult to tell for individual requests at request handling time, but easy to see when you look at the total request volume.

chneu

It's the behavior of the traffic in hindsight that's obvious. It's difficult to identify in the moment. This is by design.

Getting caught isn't a big deal. Getting caught in the act is. As long as they get their data, it doesn't matter if they're caught afterwards.

shadowfacts

In my case, no small fraction of the traffic was from OpenAI and Anthropic. There were also other user agents that literally said "AI".

HN

FOSS infrastructure is under attack by AI companies

FOSS infrastructure is under attack by AI companies