Devs say AI crawlers dominate traffic, forcing blocks on entire countries
227 comments
·March 25, 2025xyzal
sigmoid10
This relies a lot on being able to detect bots. Everything you said could be easily bypassed with a small to moderate amount of effort on the side of crawler's creators. Distinguishing genuine traffic has always been hard and it will not get easier in the age of AI.
hec126
You can sprinkle your site with almost-invisible hyperlinks. Bots will follow, humans will not.
rustc
This would be terrible for accessibility for users using a screen reader.
soco
But the very comment you answered explains how to do it: a page forbidden in robots.txt. Does this method need explanation why it's ideal for sorting humans and google, from malicious crawlers?
delichon
Yes, please explain. How does an entry in robots.txt distinguish humans from bots that ignore it?
j-bos
This makes tons of sense, the AI trainers spend endless resources aligning their llms, the least we could do ia spend a few minutes aligning their owners. Fixing things at the incentive level.
gdcbe
That’s vaguely what https://blog.cloudflare.com/ai-labyrinth/ is about
TonyTrapp
We're affected by this. The only thing that would realistically work is the first suggestion. The most unscrupulous AI crawlers distribute their inhuman request rate over dozens of IPs, so every IP just makes 1-2 requests in total. And they use real-world browser user agents, so blocking those could lock out real users. However, sometimes they claim to be using really old Chrome versions, so I feel less bad about locking those out.
sokoloff
> dozens of IPs, so every IP just makes 1-2 requests in total
Dozens of IPs making 1-2 requests per IP hardly seems like something to spend time worrying about.
lucb1e
I'm also affected. I presume that this is per day, not just once, yet it's fewer requests than a human would often do so you cannot block it. I blocked 15 IP ranges containing 37 million IP addresses (most of them from Huawei's Singapore and mobile divisions, according to the IP address WHOIS data) because they did not respect robots.txt and didn't set a user agent identifier. This is not including several other scrapers that did set a user agent string but do not respect robots.txt (again, including Huawei's PetalBot). (Note, I've only blocked them from one specific service that proxies+caches data from a third party, which I'm caching precisely because the third party site struggled with load, so more load from these bots isn't helping)
That's up to 37e6/24/60/60 = 430 requests per second if they all do 1 request per day on average. Each active IP address actually does more, some of them a few thousand per year, some of them a few dozen per year; but thankfully they don't unleash the whole IP range on me at once, it occasionally rotates through to new ranges to bypass blocks
TonyTrapp
If you are serving a Git repository browser and all of those IPs are hitting all the expensive endpoints such as git blame, it becomes something to worry about very quickly.
gchamonlive
If I'm hosting my site independently, with a rented machine and a cloudflare CDN hosting my code on a self managed gitlab instance, how should I go about implementing this? Is there something plug and play I can drop into nginx that would do this work for me of serving bogus content and leaving my gitlab instance unscathed by bots?
lucb1e
> Is your user-agent too suspicious?
Hello, it's me, your legitimate user who doesn't use one of the 4 main browsers. The internet gets more annoying every day on a simple android webview browser, I guess I'll have to go back to the fully-featured browser I migrated away from because it was much slower
> A request rate too inhuman?
I've run into this on five websites in the past 2 months, usually just from a single request that didn't have a referrer (because I clicked it from a chat, not because I block anything). When I email the owners, it's the typical "have you tried turning it off and on again" from the bigger sites, or on smaller sites "dunno but you somehow triggered our bot protection, should have expired by now [a day later] good luck using the internet further". Only one of the sites, Codeberg, actually gave a useful response and said they'd consider how to resolve when someone follows a link directly to a subpage and thus doesn't have a cookie set yet. Either way, blocks left and right are fun! More please!
null
lomonosov
Russia is already doing poisoning with success, so it is a viable tactic!
https://www.heise.de/en/news/Poisoning-training-data-Russian...
zkmon
An opensource repo asking who is responsible for this AI invasions? Well, it is you, who is responsible for all this. What did you think when you helped tech to advance so rapidly, over-pacing the needs of humans? Read about panchatantra story of 4 brothers who got a dead tiger alive, just to boast of their skill and greatness.
tedunangst
> It remains unclear why these companies don't adopt more collaborative approaches and, at a minimum, rate-limit their data harvesting runs so they don't overwhelm source websites.
If the target goes down after you scrape it, that's a feature.
prisenco
This has me wondering what it would take to do a bcrypt style slow hashing requirement to retrieve data from a site. Something fast enough that a single mobile client for a user wouldn't really feel the difference. But an automated scraper would get bogged down in the calculations.
Data is presented to the user with multiple layers of encryption that they use their personal key to decrypt. This might add an extra 200ms to decrypt. Degrades the user experience slightly but creates a bottleneck for large-scale bots.
tschwimmer
Check out Anubis - it's not quite what you're suggesting but similar in concept: https://anubis.techaro.lol/
prisenco
Interesting! Not how I'd approach it but certainly thinking along the same lines.
userbinator
I just hit a site with this --- and hit the back button immediately.
LPisGood
How does it work? I don’t have time to read the code, and the website/docs seem to be under construction.
Does it have the client do a bunch of SHA-256 hashes?
joeblubaugh
Maybe once you can use something more professional as an interstitial page
puchatek
If we are able to detect AI scrapers then I would welcome a more strategic solution: feed them garbage data instead of the real content. If enough sites did that then the inference quality would take a hit and eventually the perpetrators, too.
But of course this is the more expensive option that can't really be asked of sites that already provide public services (even if those are paid for by ads).
kevin_thibedeau
The problem is that the server also has to do the work. Fine for an infrequent auth challenge. Not so fine for every single data request.
tedunangst
Tons of problems are easier to verify than to solve.
saganus
Maybe there is a way for the server to ask the client to do the work?
Something similar to proof-of-work but on a much smaller scale than Bitcoin.
prisenco
Right it would need an algorithm with widely different encryption speeds vs decryption speeds. Lattice-based cryptography maybe?
koakuma-chan
Companies running those bots have more than enough resources
prisenco
Nobody has unlimited resources. Everything is a cost-benefit analysis.
For highly valuable information, they might throw the GDP of a small country at scraping your site. But most information isn't worth that.
And there are a lot of bad actors who don't have the resources you're thinking of that are trying to compete with the big guys on a budget. This would cut them out of the equation.
MathMonkeyMan
Why? What is the goal of a scraper, and how does disabling the source of the data benefit them?
randmeerkat
> Why? What is the goal of a scraper, and how does disabling the source of the data benefit them?
The next scraper doesn’t get the data. People don’t realize we’re not compute limited for ai, we’re data limited. What we’re watching is the “data war”.
cyanydeez
at this point we're _good data_ limited, which has little to do with scraping.
DaSHacka
Now the only way to obtain that information is through them
TuxMark5
I guess one could make a point that competition will no longer have the access to the scraped data.
everdrive
One more reason we're moving away from privacy. Didn't load all the javascript domains? You're probably a bot. Not signed in? You're probably a bot. The web we knew is dying step by step.
One interesting thought: do we know if these AI crawlers intentionally avoid certain topics? Is pornography totally left unscathed by these bots? How about extreme political opinions?
BiteCode_dev
That's an interesting thing: put things about all dictators known currently in power that they don't want to hear, and maybe they will back off.
banq
I have blocked these ip from the country: 190.0.0.0/8 207.248.0.0/16 177.0.0.0/8 200.0.0.0/8 201.0.0.0/8 145.0.0.0/8 168.0.0.0/8 187.0.0.0/8 186.0.0.0/8 45.0.0.0/8 131.0.0.0/16 191.0.0.0/8 160.238.0.0/16 179.0.0.0/8 186.192.0.0/10 187.0.0.0/8 189.0.0.0/8
kh_hk
Proof of work is sufficient (although easy to bypass on targeted crawls) for protecting endpoints that are accessed via browsers, but plain public APIs have to resort to other more primitive methods like rate limiting.
Blocking by UA is stupid, an by country kind of wrong. I am currently exploring ja4 fingerprints, that together with other metrics (country, Arn, block list), might give me a good tool to stop malicious usage.
My point is, this is a lot of work, and it takes time off the budget you give to side projects.
rfurmani
After I opened up https://sugaku.net to be usable without login, it was astounding how quickly the crawlers started. I'd like the site to be accessible to all, but I've had to restrict most of the dynamic features to logged in users, restrict robots.txt, use cloudflare to block AI crawlers and bad bots, and I'm still getting ~1M automated requests per day (compared to ~1K organic), so I think I'll need to restrict the site to logged in users soon.
keyle
Has someone made honeypot for AI yet?
Take all regular papers and change their words or keywords to something outrageous and watch it feed it to users.
MIC132
This kinda fits, though it's on a personal blog level:
https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scr...
puchatek
If there was a non-profit dedicated do this, I would donate
karlgkk
One thing that worked well for me was layering obstacles
It really sucks that this is the way things are, but what I did was
10 requests for pages in a minute, you get captchad (with a little apology and the option to bypass it by logging in). asset loads don’t count
After a captcha pass, 100 requests in an hour gets you auth walled
It’s really shitty but my industry is used to content scraping.
This allows legit users to get what they need. Although my users maybe don’t need prolonged access ahem.
nomel
What happens if you use the proper rate limiting status of 429? It includes a next retry time [1]. I'm curious what (probably small) fraction would respect it.
[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
rfurmani
I've wanted to but wasn't sure how to keep track of individuals. What works for you? IP Addresses, cookies, something else?
nukem222
> This allows legit users to get what they need.
Of course they could have just used the site directly.
LPisGood
What is your website?
Nckpz
I recently started a side-project with a "code everything in prod" approach for fun. I've done this many times over the past 20 years and the bot traffic is usually harmless, but this has been different. I haven't advertised the hostname anywhere, and in less than 24 hours I had a bunch of spam form submissions. I've always expected this after minor publicity, but not "start server, instantly get raided by bots performing interactions"
ndiddy
This is yet another great example of the innovation that the AI industry is delivering. Why just limit your scraper bots to GET requests when there might be some juicy data to train on hidden behind that form? There's a reason why the a16z funded cracked vibe coder ninjas are taking over software engineering, they're full of wonderful ideas like this.
ohgr
I work with one of those guys. Morally bankrupt at every level of his existence.
bendangelo
There are bots that scrape https registration sites thats how they usually find you.
xena
Wow it is so surreal to see a project of mine on Ars Technica! It's such an honor!
Figs
Hmm. Instead of requiring JS on the client, why don't you add a delay on the server side (e.g. 1 second default, adjustable by server admin) for requests that don't have a session cookie? For each session keep a counter and a timestamp. Every time you get a request from a session, look up the tracked entry, increment the counter (or initialize it if not found) and update the timestamp. If the counter is greater than a configured threshold, slow-walk the response (e.g. add a delay before forwarding the request to the shielded web server -- or transfer the response back out at reduced bytes/second, etc.)
You can periodically remove tracking data for entries older than a threshold -- e.g. once a minute or so (adjustable) remove tracked entries that haven't made a request in the past minute to keep memory usage down.
That'd effectively rate limit the worst offenders with minimal impact on most well-behaved edge-case users (like me running NoScript for security) while also wasting less energy globally on unnecessary computation through the proof-of-work scheme, wouldn't it? Is there some reason I'm not thinking of that would prevent that from working?
viraptor
> look up the tracked entry, increment the counter (or initialize it if not found) and update the timestamp
Not sure about author's motivation, but this part is why I don't track usage - PoW allows you to do everything statelessly and not keep any centralised database or write any data. The benefit of a system slowing down crawling should be minimal resource usage for the server.
shric
You're well into https://refactoringenglish.com/tools/hn-popularity/ so enjoy the fame!
seafoamteal
I've seen Anubis a couple times irl, mostly on Sourcehut, and the first time I saw it I was like, "Hey, I remember that blog post!" Congratulations on making something both useful and usable!
true_blue
On the few sites I've seen using it so far, it's been a more pleasant (and cuter) experience for me than the captchas I'd probably get otherwise. good work!
xena
Thanks! The artist I'm contracting and I are in discussions on how to make the mascot better. It will be improved. And more Canadian.
techjamie
The ffmpeg website is also using it. First time I actually saw it in the wild.
rco8786
I got DoSed by ClaudeBot (Anthropic) just last week. Hitting a website I manage 700,000 times in one month and tripping our bandwidth limit with our hosting provider. What a PITA to have to investigate that, figure it out, block the user agent, and work with hosting provider support to get the limit lifted as a courtesy.
Noticed that the ChatGPT bot was 2nd in traffic to this site, just not enough to cause trouble.
i5heu
At which level of DDos one can claim damages from them?
Neil44
I've got claude bot blocked too. It regularly took sites offline and ignored robots.txt. Claude bot is an asshole.
throwaway2037
robots.txt did not work?
seabird
Of course it didn't work. At best, the dorks doing this think there's a gamechanging LLM application to justify the insane valuations right around the corner if they just scrape every backwater site they can find. At worst, they're doing it because it's paying good money. Either way, they don't care, they're just going to ignore robots.txt.
hsbauauvhabzb
I have not monitored traffic in this way, but I imagine most AI companies would explicitly follow links listed in robots, even if not mentioned elsewhere on the site.
lemper
bro, since when vc funded ai companies have the courtesy to respect robots.txt?
ipaddr
I've had a number of content sites I've shut down a few sites in the last few days because of the toll these aggressive AI bots. Alexa seems like the worst.
These were created 20 years ago and updated over the years. I use to get traffic but that's been slowed to 1,000 or less legitimate visitors over the last year. But now I have to deal with server down emails caused by these aggressive bots that don't respect the robots file.
Aurornis
> Alexa seems like the worst.
Many of the bots disguise themselves as coming from Amazon or other big company.
Amazon has a page where you can check some details to see if it’s really their crawler or someone imitating it.
spookie
yup, actually most that I've seen are impersonating amazon
null
svelle
It's funny how every time this topic comes up someone says "I've had this happen and x is the worst" With x being any of the big AI providers. Just a couple minutes ago I read the same in another thread and it was Anthropic. A couple weeks back it was Meta.
My conclusion is that they're all equally terrible then.
Aurornis
All of the crawlers present themselves as being from one of the major companies, even if they’re not.
Setting user-agent headers is easy.
cyanydeez
at the same time, all the AI providers have some kind of web based AI agent, so let snot pretend they're crafting their services in care of other peoples websites.
ipaddr
I agree they are all creating negative value for site owners. In my personal experience this week blocking Amazon solved my server overload issue.
HermanMartinus
I literally just published a post on this and how it affects Bear Blog.
I think the proper answer is to aim for the bots to get _negative_ utility value from visiting our sites, that is poisoning their well, not just zero value, that is to block them.
Did you try to GET a canary page forbidden in robots.txt? Very well, have a bucket load of articles on the benefits of drinking bleach.
Is your user-agent too suspicious? No problem, feel free to scrape my insecure code (google "emergent misalignment" for more info).
A request rate too inhuman? Here, take those generated articles about positive effect of catching measles on performance in bed.
And so on, and so forth ...
Nepenthes is nice, but word salad can be detected easily. It needs a feature that pre-generates linguistically plausible but factically garbage text via open models