Cloudflare Introduces Default Blocking of A.I. Data Scrapers

336 comments

·July 2, 2025

abalashov

Few people realise that virtually everything we do online has, until this point, been free training to make OpenAI, Anthropic, etc. richer while cutting humans--the ones who produced the value--out of the loop.

It might be too little, too late, at this juncture, and this particular solution doesn't seem too innovative. However, it is directionally 100% correct, and let's hope for massively more innovation in defending against AI parasitism.

andy99

It's cloudflare and parasites like them that will make the internet un-free. It's already happening, I'm either blocked or back to 1998 load times be cause of "checking your browser". They are destroying the internet and will make it so only people who do approved things on approved browsers (meaning let advertising companies monetize their online activity) will get real access.

Cloudflare isn't solving a problem, they are just inserting themselves as an intermediary to extract a profit, and making everything worse.

slenk

How is Cloudflare a parasite? I can use Cloudflare, and get their AI protection, for free. I have dozens of domains I have used with Cloudflare at one point and I haven't paid them a dime.

brendyn

A parasite leaches off it's host to the hosts harm. Maybe it's not a good analogy, but Im in china, and it's painful after paying money for a VPN to bypass censorship to find myself routinely blocked by CDNs because they decided I'm not human. I'm honestly feeling more opressed by these middlemen than the government sometimes. For example, maybe I can't log in to a game due to being blocked by the login API, and the game company just responds by telling me to run an antivirus scanner and try again since they are not personally developing that system that lack awareness. Such people with genuine need for VPNs and privacy tools are the sacrifice for this system.

fsflover

They put themselves as a middle man for almost the whole Internet, collect huge usage data about everyone and block anybody who doesn't use mainstream tools:

https://news.ycombinator.com/item?id=42953508

https://news.ycombinator.com/item?id=13718752

https://news.ycombinator.com/item?id=23897705

https://news.ycombinator.com/item?id=41864632

https://news.ycombinator.com/item?id=42577076

chaoskitty

Serious question: You put Cloudflare between all your domains and all your visitors without looking in to how this would affect your site's reachability? If so, that's interesting, considering that many people in this community are negatively affected by Cloudflare because they're using Linux and/or some less than mainstream browser.

You might want to read some threads on here about Cloudflare.

lxgr

> I have dozens of domains I have used with Cloudflare at one point and I haven't paid them a dime.

Maybe you haven't, but your users (primarily those using "suspicious" operating systems and browsers) certainly have – with their time spent solving captchas.

bombcar

Download Brave.

Turn on Tor and browse for a week.

Now you know what “undesirables” feel like, where “undesirables” can be from a poor country, a bad IP block, outdated browsers, etc.

It sucks.

shlomo_z

Did you read his comment? He explained the issue he has with Cloudflare...

axus

From the server perspective Cloudflare is solving problems and not causing problems to other servers.

Analogy: locks for high-value items in grocery stores are annoying to customers, but other stores aren't being coerced by the locksmith to use them.

carlhjerpe

I use Firefox with adblocking and some fingerprinting anti-measurements and I rarely hit their challenges. Your IP reputation must be bad.

They have an addon [1] that helps you bypass Cloudflare challenges anonymously somehow, but it feels wrong to install a plugin to your browser from the ones who make your web experience worse

1: https://developers.cloudflare.com/waf/tools/privacy-pass/

chrismorgan

> Your IP reputation must be bad.

And for an extremely large number of honest users, they cannot realistically avoid this.

I live in India. Mobile data and fibre are all through tainted CGNAT, and I encounter Cloudflare challenges all the time. The two fibre providers I know about use CGNAT, and I expect others do too. I did (with difficulty!) ask my ISP about getting a static IP address (having in mind maybe ditching my small VPS in favour of hosting from home), but they said ₹500/month, which is way above market rate for leasing IPv4 addresses, more than I pay for my entire VPS in fact, so it definitely doesn’t make things cheaper. And I’m sceptical that it’d have good reputation with Cloudflare even then. It’ll probably still be in a blacklisted range.

henrixd

I'm having lots of problems with fingerprinting protection on Librewolf and ungoogled-chromium. I use uBlock Origin and JShelter extensions on both. I'm always getting "your browser is out of date" despite always having the most newest versions.

Some sites like Stackexchange will work after just reloading the page. And rest of the sites usually work when I remove Javascript protection and Fingerprint detection from JShelter. Sill not all of them. So, they maybe/probably want to reliably fingerprint my browser to let me continue.

If I use crappy fingerprint protection, I'm not having problems but if I actually randomize some values then sites wont work. JShelter deterministicly randomizes some values using session identifier and eTLD+1 domain as a key to avoid breaking site functionality but apparently Cloudflare is beeing really picky. Tor browser is not having these problems but it uses different strategy to protect itself from fingerprinting and doesn't randomize values but tries to have unified values across different users making identification impossible.

godelski

I'm in a pretty similar boat except I frequently hit challenges. Especially if I use a VPN (which is more trustworthy than my ISP). Ironically, I'm using Cloudflare for DoH

rockskon

LLM scrapers have dramatically been increasing the cost of hosting various small websites.

Without something being done, the data that these scrapers rely on would eventually no longer exist.

benjiro

I think the correct term is, that unrestricted LLM scrapers have dramatically been increasing the cost of hosting various small websites.

Its not a issue when somebody does "ethical" scraping, with for instance, a 250ms delay between requests, and a active cache that checks specific pages (like news article links) to rescrape at 12 or 24h intervals. This type of scraping results in almost no pressure on the websites.

The issue that i have seen, is that the more unscrupulous parties, just let their scrapers go wild, constantly rescraping again and again because the cost of scraping is extreme low. A small VM can easily push 1000's of scraps per second, let alone somebody with more dedicated resources.

Actually building a "ethical" scraper involves more time, as you need to fine tune it per website. Unfortunately, this behavior is going to cost the more ethical scraper a ton, as anti-scraping efforts will increase the cost on our side.

trollbridge

I use Cloudflare and edge caching, so it doesn’t really affect me, but the amount of LLM scraping of various static assets for apps I host is ridiculous.

We’re talking a JavaScript file of strings to respond like “login failed”, “reset your password” just over and over again. Hundreds of fetches a day, often from what appears to be the same system.

brumar

Correction: extract monstreous profits. When I read about the revenues associated with Reddit AI deals, I can't even imagine what could possibly be deals that cover half of the internet. Cynically speaking, it's a genious level move.

dceddia

Yep this terrifies me, 100%. We’re slowly losing the open internet and the frog is being boiled slowly enough that people are very happy to defend the rising temperature.

If DDoS wasn’t a scary enough boogeyman to get people to install Cloudflare as a man-in-the-middle on all their website traffic, maybe the threat of AI scrapers will do the trick?

The thing about this slow slide is it’s always defensible. Someone can always say “but I don’t want my site to be scraped, and this service is free, or even better yet, I can set up my own toll booth and collect money! They’re wonderful!”

Trouble is, one day, at this rate, almost all internet traffic will be going through that same gate. And once they have literally everyone (and all their traffic)… well, internet access is an immense amount of power to wield and I can’t see a world in which it remains untainted by commercial and government interests forever.

And “forever” is what’s at stake, because it’ll be near impossible to recover from once 99% of the population is happy to use one of the 3 approved browsers on the 2 approved devices (latest version only). Feels like we’re already accepting that future at an increasing rate.

RiverCrochet

The Internet is not the first global network. Before the Internet, you had the global telephone network. It, too, strangulated end users, but eventually became stagnant, overpriced, and irrelevant. Super long-term, the current Internet is not immune from this. Internet standards are about getting as complicated and quirky as the old Bell stuff that was trying to make miles of buried copper the future, and if regulatory/commercial forces freeze this stuff in place, it's going to lead to stagnation eventually.

Something coming down the pike I think, for example, is that IPv4 addresses are going to get realllly expensive soon. That's going to lead to all sorts of interesting things in the Internet landscape and their applications.

I'm sure we'll probably have to spend some decades in the "approved devices and browers only" world before a next wave comes.

mattl

We need a reasonable alternative to some of what Cloudflare does that can be easily installed as a package on Linux distributions without any of the following to install it.

* curl | bash

* Docker

* Anything that smacks of cryptocurrency or other scams

Just a standard repo for Debian and RHEL derived distros. Fully open source so everyone can use it. (apt/dnf install no-bad-actors)

Until that exists, using Cloudflare is inevitable.

It needs to be able to at least:

* provide some basic security (something to check for sql injection, etc)

* rate limiting

* User agent blocking

* IP address and ASN blocking

Make it easy to set up with sensible defaults and a way to subscribe to blocklists.

BugheadTorpeda6

[dead]

nickjj

Yep, it's really annoying.

I'm using Firefox with a normal adblocker (uBlock Origin).

I get hit with a Cloudflare captcha often and that page itself takes a few seconds before I can even click the checkbox. It's probably an extra 6-7 seconds and it happens quite a few times a day.

It's like calling into a billion dollar company and it taking 4 minutes to reach a human because you're forced through an automated system where you need to choose 9 things before you even have a chance to reach a human. Of course it rattles through a bunch of non-skippable stuff that isn't related to your issue for the first minute, like how much the company is there to offer excellent customer support and how much they value you.

It's not about the 8 seconds or 4 minutes. It's the feeling that you're getting put into really poor experiences from companies with near-unlimited resources with no control over the situation while you slowly watch everything get worse over time.

The Cloudflare situation is worse because you have no options as an end user. If a site uses it, your only option is to stop using the site and that might not be an option if they are providing you an important service you depend on.

Secondly they now have a complete profile over your browsing history for any site that has CF enabled and there's not much you can do here except stop using 20% or whatever market share of the internet they have, and also do a DNS lookup for every domain you visit from an anonymous machine to see if it's a Cloudflare IP range.

In case you didn't know, CF offers a partial CNAME / DNS feature where your primary DNS can be hosted anywhere and then you can proxy traffic from CF to your back-end on a per domain / sub-domain level. Basically you can't just check a site's DNS provider to see if they are on CF. You would have to check each domain and sub-domain to see if it resolves to a CF IP range which is documented here: https://www.cloudflare.com/ips-v4/# and https://www.cloudflare.com/ips-v6/#

MichaelZuo

If your on ipv6, I think they have to for ipv6 addresses… there’s just way too many bots and way too many addresses to feasibly do anything more precise.

If your on ipv4 you should check whether your behind a NAT otherwise you may have gotten an address that was previously used by a bot network.

lxgr

> I think they have to for ipv6 addresses… there’s just way too many bots and way too many addresses

Are you really arguing that it's legitimate to consider all IPv6 browsing traffic "suspicious"?

If anything, I'd say that IPv4 is probably harder, given that NATs can hide hundreds or thousands of users behind a single IPv4 address, some of which might be malicious.

> you may have gotten an address that was previously used by a bot network.

Great, another "credit score" to worry about...

jefftk

I write online (comments here, open source software, blogging, etc) because I have ideas I want to share. Whether it's "I did a thing and here's how" or "we should change policy in this specific way" or "does anyone know how to X" I'm happy for this to go into training models just like I'm happy for it to go into humans reading.

dolebirchwood

Thank you for having this attitude. I have never attempted any blogging because I always figured no one is actually going to read it. With LLMs, however, I know they will. I actually see this as a motivation to blog, as we are in a position to shape this emerging knowledge base. I don't find it discouraging that others may be profiting off our freely published work, just as I myself have benefited tremendously from open source and the freely published works of others.

arkmm

This is an interesting take, thanks for sharing. I wonder how someone should adjust their blogging if they believe their primary audience will be LLMs.

godelski

Tbh, that content I'm mostly fine with. My only real issue is that people are making trillions off the free labor of people like you and me, giving less time to create that OSS and blogs. But this isn't new to AI, it is just scaled.

What I do care about is the theft of my identity. A person may learn from the words I write but that person doesn't end up mimicking the way I write. They are still uniquely themselves.

I'm concerned that the more I write the more my text becomes my identifier. I use a handle so I can talk more openly about some issues.

We write OSS and blog because information should be free. But that information is then being locked behinds paywalls and becoming more difficult to be found through search. Frankly, that's not okay

lxgr

> What I do care about is the theft of my identity. A person may learn from the words I write but that person doesn't end up mimicking the way I write. They are still uniquely themselves.

Of course they do, to some extent. Just because it's been infeasible to track the exact "graph of influence", that's literally how humans have learned to speak and write for as long as we've had language and writing.

> I'm concerned that the more I write the more my text becomes my identifier. I use a handle so I can talk more openly about some issues.

That's a much more serious concern, in my view. But I believe that LLMs are both the problem and solution here: "Remove style entropy" is just a prompt away, these days.

BeetleB

> A person may learn from the words I write but that person doesn't end up mimicking the way I write.

Oh, I wish I could get AI to mimic the way I write! I'd pay money for it. I often want to type up an email/doc/whatever but don't because of occasional RSI issues. If I could get an AI to type it up for me while still sounding like me - that would be a big boon for my health.

bob1029

> OSS

> people are making trillions off the free labor of people like you and me

I read "No Discrimination Against Fields of Endeavor" to also include LLMs and especially the cases that we most deeply disagree with.

Either we believe in the principles of OSS or we do not. If you do not like the idea of your intellectual property being used for commercial purposes then this model is definitely not for you.

There is no shame in keeping your source code and other IP a secret. If you have strong expectations of being compensated for your work, then perhaps a different licensing and distribution model is what you are after.

> that information is then being locked behinds paywalls and becoming more difficult to be found through search

Sure - If you give up and delete everything. No one is forcing you to put your blog and GH repos behind a paywall.

bawolff

I think its 100% ok to freely train on public internet data.

What is absolutely not ok is to crawl at such an excessive speed that it makes it difficult to host small scale websites.

Truly a tragedy of the commons.

tedd4u

Agree. The problem lately is that even if each single scraper is doing so “reasonably,” there are so many individuals and groups doing this that it’s still too onerous for many sites. And of course many are not “reasonable.”

SchemaLoad

This is the attitude that's going to kill the public internet. Because you're right, it is a free for all right now with the only way to opt out being putting content behind restricted platforms.

visarga

> everything we do online has, until this point, been free training to make OpenAI, Anthropic, etc. richer while cutting humans--the ones who produced the value--out of the loop

I think on the contrary, who sets the prompts stands to get benefits, the AI provider gets a flat fee, and authors get nothing except the same AI tools as anyone else. That is natural since the users are bringing the problem to the AI, of course they have the lion share here.

AI is useless until applied to a specific task owned by a person or company. Within such a task there is opportunity for AI to generate value. AI does not generate its own opportunities, users do.

Because users are distributed across society benefits follow the same curve. They don't flow to the center but mainly remain at the edge. In this sense LLMs are like Linux, they serve every user in their specific way, but the contributors to the open source code don't get directly compensated.

qskousen

That's a really interesting way to think about it, thank you! I've always had a kind of "gut feeling" that AI training on our data is fine with me, but without really thinking too much about why. I think this explains what I've been feeling.

jowea

Is it even possible that Cloudfare could manage to block all AI data scrapping? I think this measure is just going to make it harder and more expensive, which will stop AI scrappers from hitting every single page every single day and creating expenses for publishers, but not actually stop their data from ending up in a few datasets.

godelski

Including your comment, including this comment.

HN itself is routinely scraped. What makes me most uncomfortable is deanonymization via speech analysis. It's something we can already do but is hard to do at scale. This is the ultimate tool for authoritarians. There's no hidden identities because your speech is your identifier. It is without borders. It doesn't matter if your government is good, a bad acting government (or even large corporate entity) has the power to blackmail individuals in other countries.

We really are quickly headed towards a dystopia. It could result in the entire destruction of the internet or an unprecedented level of self censorship. We already have algospeak because platform censorship[0]. But this would be a different type of censorship. Much more invasive, much more personal. There are things worse than the dark forest

[0] literally yesterday YouTube gave me, a person in the 25-60 age bracket, a content warning because there was a video about a person that got removed from a plane because they wore a shirt saying "End veteran suicide".

[0.1] Even as I type this I'm censored! Apple will allow me to swipe the word suicidal but not suicide! Jesus fuck guys! You don't reduce the mental health crisis by preventing people from even being able to discuss their problems, you only make it worse!

trollbridge

The degree to which people say “self-delete” and “unalive” is absurd these days and I now hear it in real life.

It’s Orwellian in the truest sense of the word.

baq

Orwell was the optimist. It’s Huxley’s vision we should be really worried about. Brave new world indeed.

cmeacham98

Cutting humans out of what loop? What jobs or opportunities were people posting Reddit comments or whatever getting that are now going to AI?

Larrikin

People who used to post gained knowledge from their profession or hobby. I don't bother posting any of that information on large sites like Reddit anymore, for various reasons but AI scraping solidified.

I'll still post on the increasingly fewer hobby message boards that are out there.

kamarg

> What jobs or opportunities were people posting Reddit comments or whatever getting that are now going to AI?

Content writing, product reviews (real & fake), creative writing, customer support, photography/art to name a few off the top of my head.

fkyoureadthedoc

Now the astroturfing is done by AI agents instead of hard working serfs in a call center, you hate to see it

null

[deleted]

jasonthorsness

I turned this on and it adjusts the robots.txt automatically; not sure what else it is doing.

# NOTICE: The collection of content and other data on this # site through automated means, including any device, tool, # or process designed to data mine or scrape content, is # prohibited except (1) for the purpose of search engine indexing or # artificial intelligence retrieval augmented generation or (2) with express # written permission from this site’s operator.

# To request permission to license our intellectual # property and/or other materials, please contact this # site’s operator directly.

# BEGIN Cloudflare Managed content

User-agent: Amazonbot Disallow: /

User-agent: Applebot-Extended Disallow: /

User-agent: Bytespider Disallow: /

User-agent: CCBot Disallow: /

User-agent: ClaudeBot Disallow: /

User-agent: Google-Extended Disallow: /

User-agent: GPTBot Disallow: /

User-agent: meta-externalagent Disallow: /

# END Cloudflare Managed Content User-agent: * Disallow: /* Allow: /$

1vuio0pswjnm7

"User-agent: CCBot disallow: /"

Is Common Crawl exclusively for "AI"

CCBot was already in so many robots.txt prior to this

How is CC supposed to know or control how people use the archive contents

What if CC is relying on fair use

   # To request permission to license our intellectual
   # property andd/or other materials, please contact this
   # site's operator directly

If the operator has no intellectual property rights in the material, then do they need permission from the rights holders to license such materials for use in creating LLMs and collect licensing fees

Is it common for website terms and conditions to permit site operators to sublicense other peoples' ("users") work for use in creating LLMs for a fee

Is this fee shared with the rights holders

ronsor

   # To request permission to license our intellectual
   # property andd/or other materials, please contact this
   # site's operator directly

Scrapers don't accept the terms of service.

Ironically, I've only ever scraped sites that block CCBot, otherwise I'd rather go to Common Crawl for the data.

nemomarx

Read a tos and notice that you give the site operators unlimited license to reproduce or spread your works, almost on any site. it's required to host and show the content essentially

postalcoder

This is interesting. The reasoning and response don't line up.

  > Cloudflare is making the change to protect original content on the internet, Mr. Prince said. If A.I. companies freely use data from various websites without permission or payment, people will be discouraged from creating new digital content, he said

  >  prohibited except for the purpose of [..] artificial intelligence retrieval augmented generation

This seems to be targeted at taxing training of language models, but why an exclusion for the RAG stuff? That seems like it has a much greater immediate impact for online content creators, for whom the bots are obviating a click.

fennecfoxy

With that opinion, are you also suggesting that we ban ad blockers? Because it's better I not click & consume resources than click and not be served ads, basically just costing the host money.

It means sense to allow for RAG in the same way that search engines provide a snippet of an important chunk of the page.

A blog author could not complain that their blog is getting ragged when they're extremely liable to be Google/whatever searching all day and basically consuming others' content in exactly the same way that they're trying to disparage.

ijk

What I want to know is if the flood of scraping everyone has been complaining about is coming from people trying to scrape for training or bots doing RAG search.

I get that everyone wants data, but presumably the big players already scraped the web. Do they really need to do it again? Or is it bit players reproducing data that's likely already in the training set? Or is it really that valuable to have your own scraped copy of internet scale data?

I feel like I'm missing something here. My expectation is that RAG traffic is going to be orders of magnitude higher than scraping for training. Not that it would be easy to measure from the outside.

postalcoder

I don't think we should ban ad blockers, but I also think it's fair to suggest that the loss of organic traffic could be affecting the incentive to create new digital content, at least as much as the fear of having your content absorbed into an LLM's training data.

lxgr

More and more people use ChatGPT for search, so blocking that doesn't seem like a successful strategy long-term.

bee_rider

I wonder… Google scrapes for indexing and for AI, right? I wonder if they will eventually say: ok, you can have me or not, if you don’t want to help train my AI you won’t get my searches either. That’s a tough deal but it is sort of self-consistent.

mrweasel

Very few people seems to be complaining that Google crashes their sites. Google also publish their crawlers IP ranges, but you really don't need to rate-limit Google, they know how to back off and not overload sites.

Symbiote

In theory — in practise I've had to limit Google on two large sites at work. I currently have them limited to 10/s for non-cached requests.

giancarlostoro

"Embrace, Extend, Extinguish" Google's mantra. And yes, I know about Microsoft's history with that phrase ;) But Google has done this with email, browsers (Google has web apps that run fine on Firefox but request you use Chrome), Linux (Android), and I'm sure there's others I am forgetting about.

So yeah, I too could see them doing this.

xyst

So in addition to updating the robots.txt file, which really only blocks a small number of them.

Seems CF has been gathering data and profiling these malicious agents.

This post by CF elaborates a bit further: https://blog.cloudflare.com/declaring-your-aindependence-blo...

Basically becomes a game of cat and mouse.

Bender

For my silly hobby sites I just return status 444 close the connection for anything that has case-insentive "bot" in the UA requesting anything other than robots.txt, humans.txt, favicon.ico, etc... This would also drop search engines but I blackhole route most of their CIDR blocks. I'm probably the only one here that would do this.

sneak

How does a bot scraping your silly hobby sites for any purpose harm or negatively affect you in any way?

Bender

Only if they push me over my bandwidth limits but they can't do that if I just drop them on the floor.

pixl97

Depends if they hit a site enough to make it cost something. It's not hard for bots to flood servers.

lxgr

That's at least a more reasonable default than that I've seen at least one newspaper do, which is to block both LLM scrapers and things like ChatGPT's search feature explicitly.

slenk

I thought I saw cloudflare insert noindex links?

swyx

what actually are the consequences of ignoring robots.txt (apart from DDOS)? have any of these cases ended up in court at all?

v5v3

BBC recently served a cease and desist on perplexity to stop, and delete all existing.

https://www.bbc.co.uk/news/articles/cy7ndgylzzmo

So an ai company can just be naughty till asked to stop, and then exclud that one company that has the financial resources to go legal.

btown

The headline is somewhat misleading: sites using Cloudflare now have an opt-in option to quickly block all AI bots, but it won't be turned on by default for sites using Cloudflare.

The idea that Cloudflare could do the latter at the sole discretion of its leadership, though, is indicative of the level of power Cloudflare holds.

GrayShade

> sites using Cloudflare now have an opt-in option to quickly block all AI bots, but it won't be turned on by default for sites using Cloudflare

Do you have a source for that? https://blog.cloudflare.com/content-independence-day-no-ai-c... does say "changing the default".

mattcollins

"This feature is available to all customers, meaning anyone can enable this today from the Cloudflare dashboard."

https://blog.cloudflare.com/control-content-use-for-ai-train...

bitpush

It is now an adversarial relationship between aibots and website, and cloudflare is merely reacting to it.

Would you say the same for ddos protection? Isn't that the same as well?

TechDebtDevin

They arent doing anything. They are attempting to insert themselves into the middle of a marketplace (that doesnt exist and never will) where scrapers pay for IP. They think theyre going to profit off the bots, not protect your site. Dont fall for their scam.

bitpush

What do you mean they are trying to insert themselves? If I have a website that I host with cloudflare, I (as the rightful website owner) has inserted Cloudflare in between.

It isnt CF going around saying, that's a nice website you have there. I'm gonna put myself in between.

TechDebtDevin

They cant do anything other than bog down the internet. I havent found a single cf provided challenge I havent been able to get past in < half a day.

This is simply juat the first step in them implementing a marketplace and trying to get into LLM SEO. They dont care about your site or protecting it. They are gearing up to start making a cut in the Middle between scrapers and publishers. Why wouldnt I go DIRECTLY to the publisher and make a deal. So dumb I hate cf so much.

The only thing cloudflare knows how to do is MITM attacks.

Marsymars

So what would you suggest as an alternative if I have a site where I don’t want the content used for LLM training?

fkyoureadthedoc

Auth? Because whatever Cloudflare is doing isn't going to stop anyone serious about scraping data.

sct202

My data served by Cloudflare has increased to 100gb /month compared to <20gb like 2 years ago, and they're all fairly static hobby sites. Actual people traffic is down by like half in the same time frame, so I imagine a lot of this is probably cost savings for Cloudflare to reduce resource usage.

Apofis

Makes total sense, bandwidth on this scale is expensive.

Meekro

I've heard lots of people on HN complaining about bot traffic bogging down their websites, and as a website operator myself I'm honestly puzzled. If you're already using Cloudflare, some basic cache configuration should guarantee that most bot traffic hits the cache and doesn't bog down your servers. And even if you don't want to do that, bandwidth and CPU are so cheap these days that it shouldn't make a difference. Why is everyone so upset?

noodle

As someone who had some outages due to AI traffic and is now using CloudFlare's tools:

Most of my site is cached in multiple different layers. But some things that I surface to unauthenticated public can't be cached while still being functional. Hammering those endpoints has taken my app down.

Additionally, even though there are multiple layers, things that are expensive to generate can still slip through the cracks. My site has millions of public-facing pages, and a batch of misses that happen at the same time on heavier pages to regenerate can back up requests, which leads to errors, and errors don't result in caches successfully being filled. So the AI traffic keeps hitting those endpoints, they keep not getting cached and keep throwing errors. And it spirals from there.

Symbiote

That's a pretty big assumption.

The largest site I work on has 100,000s of pages, each in around 10 languages — that's already millions of pages.

It generally works fine. Yesterday it served just under 1000 RPS over the day.

AI crawlers have brought it down when a single crawler has added 100, 200 or more RPS distributed over a wide range of IPs — it's not so much the number of extra requests, though it's very disproportionate for one "user", but they can end up hitting an expensive endpoint excluded by robots.txt and protected by other rate-limiting measures, which didn't anticipate a DDoS.

Meekro

Ok, clearly I had no idea of the scale of it. 200RPS from a single bot sounds pretty bad! Do all 100,000+ pages have to be live to be useful, or could many be served from a cache that is minutes/hours/days old?

Symbiote

The main data for those pages is in a column store, so it can sustain many thousand RPS (at least).

The problem is we have things like

  Disallow: /the-search-page
  Disallow: /some-statistics-pages

in robots.txt, which is respected by most search engine (etc) crawlers, but completely ignored by the AI crawlers.

By chance, this morning I find a legacy site is down, because in the last 8 hours it's had 2 million hits (70/s) to a location disallowed in robots.txt. These hits have come from over 1.5 million different IP addresses, so the existing rate-limit-by-IP didn't catch it.

The User-Agents are a huge mixture of real-looking web browsers; the IPs look to come from residential, commercial and sometimes cloud ranges, so it's probably all hacked computers.

I could see Cloudflare might have data to block this better. They don't just get 1 or 2 requests from an IP, they presumably see a stream of them to different sites. They could see many different user agents being used from that IP, and other patterns, and can assign a reputation score.

I think we will need to add a proof-of-work thing in front of these pages and probably whitelist some 'good' bots (Wikipedia, Internet Archive etc). It is annoying since this was working fine in its current form for over 5 years.

conductr

The presumption I’m already using cloudfare is a start. Is this a requirement for maintaining a simple website now?

haiku2077

Either that or Anubis (https://anubis.techaro.lol/docs), yes.

roguecoder

So these companies broke the internet

jtolmar

The stories I've heard have been mostly about scraper bots finding APIs like "get all posts in date range" and then hammering that with every combo of start/end date.

x0x0

It's not complex. I worked on a big site. We did not have the compute or i/o (most particularly db iops) to live generate the site. Massive crawls both generated cold pages / objects (cpu + iops) and yanked them into cache, dramatically worsening cache hit rates. This could easily take down the site.

Cache is expensive at scale. So permitting big or frequent crawls by stupid crawlers either require significant investments in cache or slow down and worsen the site for all users. For whom we, you know, built the site, not to provide training data for companies.

As others have mentioned, Google is significantly more competent than 99.9% of the others. They are very careful to not take your site down and provide, or used to provide, traffic via their search. So it was a trade, not a taking.

Not to mention I prefer not to do business with Cloudflare because I don't like companies that don't publish quota. If going over X means I need an enterprise account that starts at $10k/mo, I need to know the X. Cloudflare's business practice appears to be letting customers exceed that quota then aggressively demanding they pay or they'll be kicked off the service nearly immediately.

jauntywundrkind

I too am a bit confused / mystified at the strong reaction. But I do expect a lot of badly optimized sites that just want out.

I struggle to think of a web related library that has spread faster than Anubis checker. It's everywhere now! https://github.com/TecharoHQ/anubis

I'm surprised we don't see more efforts to rate limit. I assume many of these are distributed crawlers, but it feels like there's got to be pools of activity spinning up, on a handful of IPs. And that they would be time correlated together pretty clearly. Maybe that's not true. But it feels like the web, more than anything else, needs some open source software to add a lot more 420 Enhance Your Calm responses, as it feels like. https://http.dev/420

zerocrates

The reaction comes from some combination of

- opposition to generative AI in general

- a view that AI, unlike search which also relies on crawling, offers you no benefits in return

- crawlers from the AI firms being less well-behaved than the legacy search crawlers, not obeying robots.txt, crawling more often, more aggressively, more completely, more redundantly, from more widely-distributed addresses

- companies sneaking in AI crawling underneath their existing tolerated/whitelisted user-agents (Facebook was pretty clearly doing this with "facebookexternalhit" that people would have allowed to get Facebook previews; they eventually made a new agent for their crawling activity)

- a simultaneous huge spike in obvious crawler activity with spoofed user agents: e.g. a constant random cycling between every version of Chrome or Firefox or any browser ever released; who this is or how many different actors it is and whether they're even doing crawling for AI, who knows, but it's a fair bet.

Better optimization and caching can make this all not matter so much but not everything can be cached, and plenty of small operations got by just fine without all this extra traffic, and would get by just fine without it, so can you really blame them for turning to blocking?

jowea

I'm not an expert on website hosting, but after reading some of the blog posts on Anubis, those people were truly at wit's end trying to block AI scrappers with techniques like the ones you imply.

jauntywundrkind

https://xeiaso.net/blog/2025/anubis/ links to https://pod.geraspora.de/posts/17342163 which says:

> If you try to rate-limit them, they'll just switch to other IPs all the time. If you try to block them by User Agent string, they'll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

My gut is that the switch between IP addresses can't be that hard to follow. That the access pattern it pretty obvious to follow across identities.

But it would be non trivial, it would entail crafting new systems and doing new work per request (when traffic starts to be elevated, as a first gate).

Just making the client run through some math gauntlet is an obvious win that aggressors probably can't break. But I still think there's probably some really good hanging fruit for identifying and rate limiting even these somewhat rather more annoying traffic patterns, that the behavior itself leaves a figure print that can't be hidden and which can absolutely be rate limited. And I'd like to see that area explored.

Edit: oh heck yes, new submission with 1.7tb logs of what AI crawlers do. Now we can machine learn some better rate limiting techniques! https://news.ycombinator.com/item?id=44450352 https://huggingface.co/datasets/lee101/webfiddle-internet-ra...

deepsiml

Not much into that kind of DevOps. What is a good basic caching in this instance?

haiku2077

It comes down to:

1. Use the Cache-Control header to express how to cache your site correctly (https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Cac...)

2. Use a CDN service, or at least a caching reverse proxy, to serve most of the cacheable requests to reduce load on the (typically much more expensive) origin servers

mrweasel

Just note that many AI scrapers will go to great length to do cache busting. For some reason many of them feel like they need to get the absolute latest version and don't trust your cache.

TechDebtDevin

Cloudflare and other CDNs will usually automatically cache your static pages.

alganet

> If A.I. companies freely use data from various websites without permission or payment, people will be discouraged from creating new digital content

I don't see a way out of this happening. AI fundamentally discourages other forms of digital interaction as it grows.

Its mechanism of growing is killing other kinds of digital content. It will eventually kill the web, which is, ironically, its main source of food.

spwa4

Yes what everyone wants to do with AI: generate entertainment and interactions with humans, including economical ones, will need to happen or AI will starve.

alganet

That's what is going to make it starve. Belly full, but of its own shit being tossed around humans seeking cheap copouts of doing actual work.

fennecfoxy

Additionally, ad blocker usage is apparently at 30%. So it's a redundant or more nuanced argument, really.

account42

Ad blockers only discourage commercialized content creation, not all of it. IMO that actually improves the quality of the content created.

BrouteMinou

Just like cancer?

preachermon

[flagged]

alganet

These kinds of comparisons rarely lead to good discussions.

Let's instead be focused and talk about real stuff.

Consider https://learnpythonthehardway.org/ for example. It has influenced a generation of Python developers. Not just the main website, but the tons of Python code and Python-related content it inspired.

Why would anyone write these kinds of textbooks/websites/guides if AI can replace them? AI companies are effectively broadcasting you don't need the hard way anymore, you can just vibe.

Arguibly though, without the existance of Learn Python the Hard Way and similar content, AI would be worse at writing Python stuff. That's what I mean by "main source of food", good content that influences a lot of people. Net-positive effects hard to predict or even identify except for the more popular cases (such as LPTHW).

If my prediction is right, no one will notice that good content has stopped being produced. It will appear as if content is being created in generally the same way as before, but in reality, these long tail initiatives like LPTHW will have ceased before anyone can do anything about it.

Again, I don't see a way out of this scenario. Not for AI companies, not for content writers. This is going to happen. The world in which I'm wrong is the best one.

mfost

In a similar vein, I remember people advocating for replacing new untrained hires with AI. After all, a competent senior engineer is needed to validate the contributions of the new hires anyway and they can do the same checking the AI code.

But then, how would you even train and replace those competent seniior engineers that do the filtering when they retire? The whole system was predicated on having a chain of new hires that gain experience in the process.

baq

Now? Always has been. Compared to alternatives it’s still the best economy-scale resource usage optimization framework we’ve got.

greenchair

nothing's perfect but it is still better than the other options

postalcoder

  > When you enable this feature via a pre-configured managed rule, Cloudflare can detect and block verified AI bots that comply with robots.txt and respect crawl rates, and do not hide their behavior from your website. The rule has also been expanded to include more signatures of AI bots that do not follow the rules.

We already know companies like Perplexity are masking their traffic. I'm sure there's more than meets the eye, but taking this at face value, doesn't punishing respectful and transparent bots only incentivize obfuscation?

edit: This link[0], posted in a comment elsewhere, addresses this question. tldr, obfuscation doesn't work.

  > We leverage Cloudflare global signals to calculate our Bot Score, which for AI bots like the one above, reflects that we correctly identify and score them as a “likely bot.”

  > When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint. For every fingerprint we see, we use Cloudflare’s network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint. To power our models, we compute global aggregates across many signals. Based on these signals, our models were able to appropriately flag traffic from evasive AI bots, like the example mentioned above, as bots.

[0] https://blog.cloudflare.com/declaring-your-aindependence-blo...

jerf

"doesn't punishing respectful and transparent bots only incentivize obfuscation?"

Sure, but we crossed that bridge over 20 years ago. It's not creating an arms race where there wasn't already one.

Which is my generic response to everyone bringing similar ideas up. "But the bots could just...", yeah, they've been doing it for 20+ years and people have been fighting it for just as long. Not a new problem, not a new set of solutions, no prospect of the arms race ending any time soon, none of this is new.

hombre_fatal

Next line:

> The rule has also been expanded to include more signatures of AI bots that do not follow the rules.

The Block AI Bots rule on the Super Bot Fight Mode page does filter out most bot traffic. I was getting 10x the traffic from bots than I was from users.

It definitely doesn't rely on robots.txt or user agent. I had to write a page rule bypass just to let my own tooling work on my website after enabling it.

account42

How many of those "bots" you are filtering are actually bots and how many are regular users buttflare has misidentified as bots?

hombre_fatal

Pretty simple to see this if you've run a website: compare your analytics pre-bot to post-bot to post-bot-blocker.

There is a clear moment where you land on AI bot radar. For my large forum, it was a month ago.

Overnight, "72 users are viewing General Discussion" turned into "1720 users".

40% requests being cached turned into 3% of requests are cached.

fluidcruft

Cloudflare already knows how to make the web hell for people they don't like.

I read the robots.txt entries as those AI bots that will be not marked as "malicious" and that will have the opportunity to be allowed by websites. The rest will be given the Cloudflare special.

colechristensen

>doesn't punishing respectful and transparent bots only incentivize obfuscation?

They're cloudflare and it's not like it's particularly easy to hide a bot that is scraping large chunks of the Internet from them. On top of the fact that they can fingerprint any of your sneaky usage, large companies have to work with them so I can only assume there are channels of communication where cloudflare can have a little talk with you about your bad behavior. I don't know how often lawyers are involved but I would expect them to be.

Sol-

Do the major AI companies actually honor robots.txt? Even if some of their publicly known crawlers might do it, surely they have surreptitious campaigns where they do some hidden crawling, just like how they illegally pirate books, images and user data to train on.

chasd00

My thought too, honoring robots.txt is just a convention. There's no requirement to follow robots.txt, or at least certainly no technical requirement. I don't think there's any automatic legal requirement either.

Maybe sites could add "you must honor policies set in robots.txt" to something like a terms of service but I have no idea if that would have enough teeth for a crawler to give up.

prmoustache

I don't think terms of service are applicable anyway. Terms of Service aren't a signed contract as you may never see it nor know there is one. This happens both in the case of visiting the site interactively or fetching a page programatically.

TechDebtDevin

Cloudflare snd their customera have been desperately for years trying to kill scrapers in court. This is all. Meaningless, but they are probably gearing up for another legal battle to define robots.txt as a legal contract. Theyre going to use this marketplace theyre scamming people with to do it. They will fail.

px43

There's a lack of clarity, but it seems likely to me that a majority of this traffic is actually people asking questions to the AI, and the AI going out and researching for answers. When the AI tools are being used like a web browser to do research, should they still be adhering to robots.txt, or is that only intended for search indexing?

deepsun

Hard to tell, because minor crawlers mimic major companies to not getting banned.

mschuster91

Cloudflare, for all I hate their role as a gatekeeper these days, actually has the leverage to force the AI companies to bend.

blakesterz

The list of bots is pretty short right now:

https://developers.cloudflare.com/bots/concepts/bot/#ai-bots

JimDabell

> AI bots

> You can opt into a managed rule that will block bots that we categorize as artificial intelligence (AI) crawlers (“AI Bots”) from visiting your website. Customers may choose to do this to prevent AI-related usage of their content, such as training large language models (LLM).

> CCBot (Common Crawl)

Common Crawl is not an AI bot:

https://commoncrawl.org

johneth

The data it collects is used by AI companies, though.

hennell

Cloudflare sees a lot of the web traffic. I assume these are the biggest bots they're seeing right now, and any new contenders would be added as they find them. Probably impossible to really block everything, but they've got the web-coverage to detect more than most.

TechDebtDevin

They are lying. They cant detect crawlers unless we tell them we are who we are.

ZiiS

Enough to more than half the traffic to most sites if the blocks hold.

cmg

Archive link: https://archive.ph/ARnyu

joshdavham

How did you make that link?

zargath

Sounds very basic, sadly.

Anybody know why these web crawling/bot standards are not evolving ? I believe robots.txt was invented in 1994(thx chatgpt). People have tried with sitemaps, RSS and IndexNow, but its like huge$$ organizations are depending on HelloWorld.bas tech to control their entire platform.

I want to spin up endpoints/mcp/etc. and let intelligent bots communicate with my services. Let them ask for access, ask for content, pay for content, etc. I want to offer solutions for bots to consume my content, instead of having to choose between full or no access.

I am all for AI, but please try to do better. Right now the internet is about to be eaten up by stupid bot farms and served into chat screens. They dont want to refer back to their source and when they do its with insane error rates.

stereolambda

> I believe robots.txt was invented in 1994(thx chatgpt).

Not to pick on you, but I find it quicker to open new tab and do "!w robots.txt" (for search engines supporting the bang notation) or "wiki robots.txt"<click> (for Google I guess). The answer is right there, no need to explain to LLM what I want or verify [1].

[1] Ok, Wikipedia can be wrong, but at least it is a commonly accessible source of wrong I can point people to if they call me out. Plus my predictive model of Wikipedia wrongness gives me pretty low likelihood for something like this, while for ChatGPT it is more random.

reaperducer

robots.txt was invented in 1994(thx chatgpt)

Thought of and discussed as a possibility in 1994.

Proposed as a standard in 2019.

Adopted as a standard in 2022.

Thanks, IETF.

Dylan16807

This phrasing is very misleading. To bullet point directly from "possibility" to "standard" implies the standardization was a turning point where it could start being used. But it was massively used long before that. The standard is a side note that's barely relevant.

reaperducer

It only became massively used in 2019, when Google recommended it.

TechDebtDevin

This comment seems like it comes from a Cloudflare employee.

This is clearly the first step in cf building out a marketplace where they will (fail) at attempting to be the middleman in a useless market between crawlers and publishers.

zargath

nah, disappointed cf customer

badlibrarian

Did they ever fix the auto-blocking of RSS feeds?

https://news.ycombinator.com/item?id=41864632