Devs say AI crawlers dominate traffic, forcing blocks on entire countries

290 comments

·March 25, 2025

xyzal

I think the proper answer is to aim for the bots to get _negative_ utility value from visiting our sites, that is poisoning their well, not just zero value, that is to block them.

Did you try to GET a canary page forbidden in robots.txt? Very well, have a bucket load of articles on the benefits of drinking bleach.

Is your user-agent too suspicious? No problem, feel free to scrape my insecure code (google "emergent misalignment" for more info).

A request rate too inhuman? Here, take those generated articles about positive effect of catching measles on performance in bed.

And so on, and so forth ...

Nepenthes is nice, but word salad can be detected easily. It needs a feature that pre-generates linguistically plausible but factically garbage text via open models

sigmoid10

This relies a lot on being able to detect bots. Everything you said could be easily bypassed with a small to moderate amount of effort on the side of crawler's creators. Distinguishing genuine traffic has always been hard and it will not get easier in the age of AI.

hec126

You can sprinkle your site with almost-invisible hyperlinks. Bots will follow, humans will not.

rustc

This would be terrible for accessibility for users using a screen reader.

soco

But the very comment you answered explains how to do it: a page forbidden in robots.txt. Does this method need explanation why it's ideal for sorting humans and google, from malicious crawlers?

majewsky

robots.txt is a somewhat useful tool for keeping search engines in line, because it's rather easy to prove that a search engine ignores robots.txt: when a noindex page shows up in SERPs. This evidence trail does not exist for AI crawlers.

sigmoid10

The detection and bypass is trivial: Access the site from two IPs, one disrespecting robots.txt. If the content changes, you know it's garbage.

delichon

Yes, please explain. How does an entry in robots.txt distinguish humans from bots that ignore it?

gdcbe

That’s vaguely what https://blog.cloudflare.com/ai-labyrinth/ is about

TonyTrapp

We're affected by this. The only thing that would realistically work is the first suggestion. The most unscrupulous AI crawlers distribute their inhuman request rate over dozens of IPs, so every IP just makes 1-2 requests in total. And they use real-world browser user agents, so blocking those could lock out real users. However, sometimes they claim to be using really old Chrome versions, so I feel less bad about locking those out.

sokoloff

> dozens of IPs, so every IP just makes 1-2 requests in total

Dozens of IPs making 1-2 requests per IP hardly seems like something to spend time worrying about.

lucb1e

I'm also affected. I presume that this is per day, not just once, yet it's fewer requests than a human would often do so you cannot block it. I blocked 15 IP ranges containing 37 million IP addresses (most of them from Huawei's Singapore and mobile divisions, according to the IP address WHOIS data) because they did not respect robots.txt and didn't set a user agent identifier. This is not including several other scrapers that did set a user agent string but do not respect robots.txt (again, including Huawei's PetalBot). (Note, I've only blocked them from one specific service that proxies+caches data from a third party, which I'm caching precisely because the third party site struggled with load, so more load from these bots isn't helping)

That's up to 37e6/24/60/60 = 430 requests per second if they all do 1 request per day on average. Each active IP address actually does more, some of them a few thousand per year, some of them a few dozen per year; but thankfully they don't unleash the whole IP range on me at once, it occasionally rotates through to new ranges to bypass blocks

aorth

Parent probably meant hundreds or thousands of IPs.

Last week I had a web server with a high load. After some log analysis I found 66,000 unique IPs from residential ISPs in Brazil had made requests to the server in a few hours. I have broad rate limits on data center ISPs, but this kinda shocked me. Botnet? News coverage of the site in Brazil? No clue.

Edit: LOL didn't read the article unity after posting—they mention the Fedora Pagure server getting this traffic from Brazil last week too!

Rate limiting vast swathes of Google Cloud, Amazon EC2, Digital Ocean, Hetzner, Huawei, Alibaba, Tencent, and a dozen other data center ISPs by subnet has really helped keep the load on my web servers down.

Last year I had one incident with 14,000 unique IPs in Amazon Singapore making requests in one day. What the hell is that?

I don't even bother trusting user agents any more. My nginx config has gotten too complex over the years and I wish I didn't need all this arcane mapping and whatnot.

TonyTrapp

If you are serving a Git repository browser and all of those IPs are hitting all the expensive endpoints such as git blame, it becomes something to worry about very quickly.

giantg2

That's probably per day, per bot. Now how does it look when there are thousands of bots? In most cases I think you're right, but I can also see how it can add up.

gchamonlive

If I'm hosting my site independently, with a rented machine and a cloudflare CDN hosting my code on a self managed gitlab instance, how should I go about implementing this? Is there something plug and play I can drop into nginx that would do this work for me of serving bogus content and leaving my gitlab instance unscathed by bots?

lomonosov

Russia is already doing poisoning with success, so it is a viable tactic!

https://www.heise.de/en/news/Poisoning-training-data-Russian...

lucb1e

> Is your user-agent too suspicious?

Hello, it's me, your legitimate user who doesn't use one of the 4 main browsers. The internet gets more annoying every day on a simple android webview browser, I guess I'll have to go back to the fully-featured browser I migrated away from because it was much slower

> A request rate too inhuman?

I've run into this on five websites in the past 2 months, usually just from a single request that didn't have a referrer (because I clicked it from a chat, not because I block anything). When I email the owners, it's the typical "have you tried turning it off and on again" from the bigger sites, or on smaller sites "dunno but you somehow triggered our bot protection, should have expired by now [a day later] good luck using the internet further". Only one of the sites, Codeberg, actually gave a useful response and said they'd consider how to resolve when someone follows a link directly to a subpage and thus doesn't have a cookie set yet. Either way, blocks left and right are fun! More please!

null

[deleted]

jajko

Thats absolutely brilliant f*cked up idea, poisoning AI while fending them off.

Gotta get my daily dose of bleach for enhanced performance, chatgpt said so.

usrnm

Where are you going to get all that content? If it's static, it will get filtered out very fast, if it's dynamic and autogenerated, it might cost even more than just letting the crawler through

hec126

Generate it once every few weeks with LLaMa and then serve as static content?

tedunangst

> It remains unclear why these companies don't adopt more collaborative approaches and, at a minimum, rate-limit their data harvesting runs so they don't overwhelm source websites.

If the target goes down after you scrape it, that's a feature.

prisenco

This has me wondering what it would take to do a bcrypt style slow hashing requirement to retrieve data from a site. Something fast enough that a single mobile client for a user wouldn't really feel the difference. But an automated scraper would get bogged down in the calculations.

Data is presented to the user with multiple layers of encryption that they use their personal key to decrypt. This might add an extra 200ms to decrypt. Degrades the user experience slightly but creates a bottleneck for large-scale bots.

tschwimmer

Check out Anubis - it's not quite what you're suggesting but similar in concept: https://anubis.techaro.lol/

LPisGood

How does it work? I don’t have time to read the code, and the website/docs seem to be under construction.

Does it have the client do a bunch of SHA-256 hashes?

prisenco

Interesting! Not how I'd approach it but certainly thinking along the same lines.

userbinator

I just hit a site with this --- and hit the back button immediately.

joeblubaugh

Maybe once you can use something more professional as an interstitial page

puchatek

If we are able to detect AI scrapers then I would welcome a more strategic solution: feed them garbage data instead of the real content. If enough sites did that then the inference quality would take a hit and eventually the perpetrators, too.

But of course this is the more expensive option that can't really be asked of sites that already provide public services (even if those are paid for by ads).

brookst

I really don’t think we have such a lack of misinformation that we need to invest in creating more of it, no matter the motive.

kevin_thibedeau

The problem is that the server also has to do the work. Fine for an infrequent auth challenge. Not so fine for every single data request.

tedunangst

Tons of problems are easier to verify than to solve.

saganus

Maybe there is a way for the server to ask the client to do the work?

Something similar to proof-of-work but on a much smaller scale than Bitcoin.

prisenco

Right it would need an algorithm with widely different encryption speeds vs decryption speeds. Lattice-based cryptography maybe?

baq

koakuma-chan

Companies running those bots have more than enough resources

prisenco

Nobody has unlimited resources. Everything is a cost-benefit analysis.

For highly valuable information, they might throw the GDP of a small country at scraping your site. But most information isn't worth that.

And there are a lot of bad actors who don't have the resources you're thinking of that are trying to compete with the big guys on a budget. This would cut them out of the equation.

MathMonkeyMan

Why? What is the goal of a scraper, and how does disabling the source of the data benefit them?

randmeerkat

> Why? What is the goal of a scraper, and how does disabling the source of the data benefit them?

The next scraper doesn’t get the data. People don’t realize we’re not compute limited for ai, we’re data limited. What we’re watching is the “data war”.

cyanydeez

at this point we're _good data_ limited, which has little to do with scraping.

DaSHacka

Now the only way to obtain that information is through them

TuxMark5

I guess one could make a point that competition will no longer have the access to the scraped data.

xena

Wow it is so surreal to see a project of mine on Ars Technica! It's such an honor!

shric

You're well into https://refactoringenglish.com/tools/hn-popularity/ so enjoy the fame!

seafoamteal

I've seen Anubis a couple times irl, mostly on Sourcehut, and the first time I saw it I was like, "Hey, I remember that blog post!" Congratulations on making something both useful and usable!

true_blue

On the few sites I've seen using it so far, it's been a more pleasant (and cuter) experience for me than the captchas I'd probably get otherwise. good work!

xena

Thanks! The artist I'm contracting and I are in discussions on how to make the mascot better. It will be improved. And more Canadian.

Figs

Hmm. Instead of requiring JS on the client, why don't you add a delay on the server side (e.g. 1 second default, adjustable by server admin) for requests that don't have a session cookie? For each session keep a counter and a timestamp. Every time you get a request from a session, look up the tracked entry, increment the counter (or initialize it if not found) and update the timestamp. If the counter is greater than a configured threshold, slow-walk the response (e.g. add a delay before forwarding the request to the shielded web server -- or transfer the response back out at reduced bytes/second, etc.)

You can periodically remove tracking data for entries older than a threshold -- e.g. once a minute or so (adjustable) remove tracked entries that haven't made a request in the past minute to keep memory usage down.

That'd effectively rate limit the worst offenders with minimal impact on most well-behaved edge-case users (like me running NoScript for security) while also wasting less energy globally on unnecessary computation through the proof-of-work scheme, wouldn't it? Is there some reason I'm not thinking of that would prevent that from working?

viraptor

> look up the tracked entry, increment the counter (or initialize it if not found) and update the timestamp

Not sure about author's motivation, but this part is why I don't track usage - PoW allows you to do everything statelessly and not keep any centralised database or write any data. The benefit of a system slowing down crawling should be minimal resource usage for the server.

PufPufPuf

In a "denial of service prevention" scenario, you need your cost to be lower than the cost of the attacker. "Delay on the server side" means keeping a TCP connection open for that long, and that's a limited resource.

techjamie

The ffmpeg website is also using it. First time I actually saw it in the wild.

rfurmani

After I opened up https://sugaku.net to be usable without login, it was astounding how quickly the crawlers started. I'd like the site to be accessible to all, but I've had to restrict most of the dynamic features to logged in users, restrict robots.txt, use cloudflare to block AI crawlers and bad bots, and I'm still getting ~1M automated requests per day (compared to ~1K organic), so I think I'll need to restrict the site to logged in users soon.

keyle

Has someone made honeypot for AI yet?

Take all regular papers and change their words or keywords to something outrageous and watch it feed it to users.

MIC132

This kinda fits, though it's on a personal blog level:

https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scr...

puchatek

If there was a non-profit dedicated do this, I would donate

karlgkk

One thing that worked well for me was layering obstacles

It really sucks that this is the way things are, but what I did was

10 requests for pages in a minute, you get captchad (with a little apology and the option to bypass it by logging in). asset loads don’t count

After a captcha pass, 100 requests in an hour gets you auth walled

It’s really shitty but my industry is used to content scraping.

This allows legit users to get what they need. Although my users maybe don’t need prolonged access ahem.

nomel

What happens if you use the proper rate limiting status of 429? It includes a next retry time [1]. I'm curious what (probably small) fraction would respect it.

[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

karlgkk

Probably makes sense for a b2b app where you publish status codes as part of the api

Bad actors don’t care and annoying actors would make fun of you for it on twitter

rfurmani

I've wanted to but wasn't sure how to keep track of individuals. What works for you? IP Addresses, cookies, something else?

karlgkk

I use IP addy. Users behind cgnat are already used to getting captcha the first time around

There’s some stuff you can do, like creating risk scores (if a user changes ip and uses the same captcha token, increase score). Many vendors do that, as does my captcha provider.

nukem222

> This allows legit users to get what they need.

Of course they could have just used the site directly.

karlgkk

If bots and scrapers respected the robots and tos, we wouldn’t be here

It sucks!

GoblinSlayer

Or just buy cloudflare :)

LPisGood

What is your website?

Nckpz

I recently started a side-project with a "code everything in prod" approach for fun. I've done this many times over the past 20 years and the bot traffic is usually harmless, but this has been different. I haven't advertised the hostname anywhere, and in less than 24 hours I had a bunch of spam form submissions. I've always expected this after minor publicity, but not "start server, instantly get raided by bots performing interactions"

ndiddy

This is yet another great example of the innovation that the AI industry is delivering. Why just limit your scraper bots to GET requests when there might be some juicy data to train on hidden behind that form? There's a reason why the a16z funded cracked vibe coder ninjas are taking over software engineering, they're full of wonderful ideas like this.

ohgr

I work with one of those guys. Morally bankrupt at every level of his existence.

bendangelo

There are bots that scrape https registration sites thats how they usually find you.

ipaddr

I've had a number of content sites I've shut down a few sites in the last few days because of the toll these aggressive AI bots. Alexa seems like the worst.

These were created 20 years ago and updated over the years. I use to get traffic but that's been slowed to 1,000 or less legitimate visitors over the last year. But now I have to deal with server down emails caused by these aggressive bots that don't respect the robots file.

Aurornis

> Alexa seems like the worst.

Many of the bots disguise themselves as coming from Amazon or other big company.

Amazon has a page where you can check some details to see if it’s really their crawler or someone imitating it.

spookie

yup, actually most that I've seen are impersonating amazon

null

[deleted]

svelle

It's funny how every time this topic comes up someone says "I've had this happen and x is the worst" With x being any of the big AI providers. Just a couple minutes ago I read the same in another thread and it was Anthropic. A couple weeks back it was Meta.

My conclusion is that they're all equally terrible then.

Aurornis

All of the crawlers present themselves as being from one of the major companies, even if they’re not.

Setting user-agent headers is easy.

cyanydeez

at the same time, all the AI providers have some kind of web based AI agent, so let snot pretend they're crafting their services in care of other peoples websites.

ipaddr

I agree they are all creating negative value for site owners. In my personal experience this week blocking Amazon solved my server overload issue.

rco8786

I got DoSed by ClaudeBot (Anthropic) just last week. Hitting a website I manage 700,000 times in one month and tripping our bandwidth limit with our hosting provider. What a PITA to have to investigate that, figure it out, block the user agent, and work with hosting provider support to get the limit lifted as a courtesy.

Noticed that the ChatGPT bot was 2nd in traffic to this site, just not enough to cause trouble.

i5heu

At which level of DDos one can claim damages from them?

brookst

You can claim whatever you want, but actually litigation is expensive and it’s not at all a sure thing that “I made a publicly available resource and they used it too much” is going to win damages. Maybe? Maybe not?

i5heu

So I could ddos anyone legally as long I have a some reason?

Neil44

I've got claude bot blocked too. It regularly took sites offline and ignored robots.txt. Claude bot is an asshole.

throwaway2037

robots.txt did not work?

seabird

Of course it didn't work. At best, the dorks doing this think there's a gamechanging LLM application to justify the insane valuations right around the corner if they just scrape every backwater site they can find. At worst, they're doing it because it's paying good money. Either way, they don't care, they're just going to ignore robots.txt.

hsbauauvhabzb

I have not monitored traffic in this way, but I imagine most AI companies would explicitly follow links listed in robots, even if not mentioned elsewhere on the site.

epc

I’ve been doing web sites for thirty years, robots.txt is at best a request to polite user agents to respect the server’s desires. None of the malicious crawlers respect it. None of the AI crawlers respect it.

I’ve resorted to returning xml and zip bombs in canary pages. At best it slows them down until I block their network.

lemper

bro, since when vc funded ai companies have the courtesy to respect robots.txt?

userbinator

All these JS-heavy "anti bot" measures do is further entrench the browser monopoly, making it much harder for the minority of independents, while those who pay big $$$ can still bypass them. Instead I recommend a simple HTML form that asks questions with answers that LLMs cannot yet figure out or get consistently wrong. The more related to the site's content the questions are, the better; I remember some electronics forums would have similar "skill-testing" questions on their registration forms, and while some of them may be LLM'able now, I suspect many of them are still really CAPTCHAs that only humans can solve.

IMHO the fact that this shows up at a time when the Ladybird browser is just starting to become a serious contender is suspicious.

mvdtnz

How does JS entrench a browser monopoly? If you're not using vendor-specific JS extensions or non-standard APIs any browser should be able to execute your JS. Like most web developers I don't have a lot of patience for the people who refuse to run JS on their clients.

userbinator

The effort required to implement a JS engine and keep trendchasing the latest changes with it is a huge barrier to entry, not to mention the insane amount of fingerprinting and other privacy-hostile, anti-user techniques it enables.

Seeing what used to be simple HTML forms turned into bloated invasive webapps to accomplish the exact same thing seriously angers me; and everyone else who wanted an easily accessible and freedom-preserving Internet.

Terr_

Or require every fresh "unique" visitor to run some JS that takes X seconds to compute.

It's not nice for visitors using a very old smartphone, but it's arguably less-exclusionary than some of the tests and third-party gatekeepers that exist now.

In many cases we don't actually care about telling if someone is truly a human alone, as much as ensuring that they aren't a throwaway sockpuppet of a larger automated system that doesn't care about good behavior because a replacement is so easy to make.

userbinator

that takes X seconds to compute.

Those who have the computing resources to do commercial scraping will easily get past that.

In contrast, there are still many questions which a human can easily answer, but even the best LLMs currently can't.

Terr_

It doesn't have to be bulletproof, it just has to create a cost that doesn't scale economically for them.

kristiandupont

>there are still many questions which a human can easily answer, but even the best LLMs currently can't.

I am genuinely curious: what is an example of such a question, if it's for a person you don't know (i.e. where you cannot rely on inside knowledge)?

a2128

IIRC that's basically already part of what Cloudflare Turnstile does

GoblinSlayer

The algorithm is inspired by hashcash https://raw.githubusercontent.com/TecharoHQ/anubis/refs/head...

null

[deleted]

everdrive

One more reason we're moving away from privacy. Didn't load all the javascript domains? You're probably a bot. Not signed in? You're probably a bot. The web we knew is dying step by step.

One interesting thought: do we know if these AI crawlers intentionally avoid certain topics? Is pornography totally left unscathed by these bots? How about extreme political opinions?

GoblinSlayer

No, https://www.heise.de/en/news/Poisoning-training-data-Russian...

BiteCode_dev

That's an interesting thing: put things about all dictators known currently in power that they don't want to hear, and maybe they will back off.

banq

I have blocked these ip from the country: 190.0.0.0/8 207.248.0.0/16 177.0.0.0/8 200.0.0.0/8 201.0.0.0/8 145.0.0.0/8 168.0.0.0/8 187.0.0.0/8 186.0.0.0/8 45.0.0.0/8 131.0.0.0/16 191.0.0.0/8 160.238.0.0/16 179.0.0.0/8 186.192.0.0/10 187.0.0.0/8 189.0.0.0/8

HermanMartinus

I literally just published a post on this and how it affects Bear Blog.

https://herman.bearblog.dev/the-great-scrape

zzo38computer

My issue is not to prevent others from obtaining copies of the files, using Lynx or curl, disabling JavaScripts and CSS and pictures, etc. It is to prevent others from overloading the server due to badly behaved software.

I had briefly set up port knocking for the HTTP server (and only for HTTP; other protocols are accessible without port knocking), but due to a kernel panic I removed it and now the HTTP server is not accessible. (I may later put it back on once I can fix this problem.)

As far as I can tell, the LLM scrapers do not attempt to be "smart" about it at this time; if they do in future, you might try to take advantage of that somehow.

However, even if they don't, there are probably things that can be done. For example, check that if the declared user-agent declares things that it isn't doing, and display an error message if so (users who use Lynx will then remain unaffected and will still be able to access it). Another possibility is to try to confuse the scrapers however they are working, e.g. invalid redirects, valid redirects (e.g. to internal API functions of the companies that made them), invalid UTF-8, invalid compressed data, ZIP bombs (you can use the compression functions of HTTP to serve a small file that is too big when decompressed), EICAR test files, reverse pings (if you know who they really are), etc. What will work and what doesn't work depends on what software they are using.

HN

Devs say AI crawlers dominate traffic, forcing blocks on entire countries

Devs say AI crawlers dominate traffic, forcing blocks on entire countries