Cloudflare to introduce pay-per-crawl for AI bots

253 comments

·July 1, 2025

asim

This is basically just how we want to do micro payments. I think coinbase recently introduced a library for the same using cryptocurrency and the 402 status code. In fact yea it's called x402. https://github.com/coinbase/x402

imiric

This should be the standard business model on the web, instead of the advertising middlemen that have corrupted all our media, and the adtech that exploits our data in perpetuity. All of which is also serving to spread propaganda, corrupt democratic processes, and cause the sociopolitical unrest we've seen in the last decade+. I hope that decades from now we can accept how insidious all of this is, and prosecute and regulate these companies just like we did with Big Tobacco.

Brave's BAT is also a good attempt at fixing this, but x402 seems like a more generic solution. It's a shame that neither has any chance of gaining traction, partly because of the cryptocurrency stigma, and partly because of adtech's tight grip on the current web.

ashdksnndck

Microtransactions are the perfect solution, if you have an economic theory that assumes near-zero transaction costs. Technology can achieve low technical costs, but the problem is the human cost of a transaction. The mental overhead of deciding whether I want to make a purchase to consume every piece of content, and whether I got ripped off, adds up, and makes microtransactions exhausting.

When someone on the internet tries to sell you something for a dollar, how often do you really take them up on it? How many microtransactions have you actually made? To problem with microtransactions is they discourage people from consuming your content. Which is silly, because the marginal cost of serving one reader or viewer is nearly zero.

The solution is bundling. I make a decision to pay once, then don’t pay any marginal costs on each bit of content. Revenue goes to creators proportionally based on what fraction of each user’s consumption went to them.

People feel hesitation toward paying for the bundle, but they only have to get over the hump once, not repeatedly for every single view.

Advertising-supported content is one kind of bundle, but in my opinion, it’s just as exhausting. The best version of bundling I’ve experienced are services like Spotify and YouTube Premium, where I pay a reasonable fixed monthly fee and in return get to consume many hours of entertainment. The main problems with those services are the middlemen who take half the money.

__MatrixMan__

I disagree, bundling is the problem. That strategy created the fragmented landscape that we now see in streaming video, which is pretty much universally hated.

The ideal solution would involve a flat rate which I pay monthly, and at the end of the month that money goes towards the content that I consumed during that month. If I only read a single blog, they get all of it.

Then we build a culture around preferring to share content which is configured to cite its sources, and we discourage sharing anything which has an obvious source with which it doesn't share its inbound microtransactions.

We already need to do our due dilligence re: determining if an information source is trustworthy (and if its sources are trustworthy, and so on). Might as well make money flow along the same structures.

hhh

crypto seems like a massive waste for what can just be a regular transaction

bo1024

Much cheaper than using a credit card processor.

trollbridge

Something like BAT isn't that wasteful, and without crypto you'd be stuck never getting paid by bad actors in the scheme.

squigz

Even if advertising were to disappear over night, why do you think that would stop the spread of propaganda, corruption of democratic processes, and social unrest? I don't really see a connection between the two?

heresie-dabord

The finiancial connexion is explained here:

https://en.wikipedia.org/wiki/Citizens_United_v._FEC

__MatrixMan__

Really? They're quite connected.

If the architecture of the web changes to one where people only see content that they've asked to see, and that kills advertising, it would also put a significant damper on anyone else whose business involves injecting unwanted content into a viewer's consciousness. Propagandists are the first to come to mind.

If it can become prohibitively expensive to sway an election by tampering with people's information, then the alternative (policies that actually benefit the people) will become more popular, leading to reduced unrest.

Democracy is having a bad time lately because its enemies have new weapons for use against it. If we break those weapons, it starts working again.

4b11b4

It's more like the tech that allows middlemen to insert into everything and be very hyper personalized/targeted

imiric

Where did I say that all of those things would stop?

What I said is that adtech systems are also used for it. So if they were to disappear overnight, a _proportion_ of those activities, and a pretty large one I reckon, would also disappear.

giantrobot

> This should be the standard business model on the web, instead of the advertising middlemen that have corrupted all our media, and the adtech that exploits our data in perpetuity.

People with content will still want to maximize their money. You'll get all the same bullshit dark patterns on sites supported by microtransactions as you will ad supported. Stories will be split up into multiple individual pages, each requiring a microtransaction. Even getting past a landing page will require multiple click throughs each with another transaction. There will also be nothing preventing sites from bait and switch schemes where the link exposed to crawlers doesn't contain the expected content.

Without extensive support for micro-refunds and micro-customer service and micro-consumer protections, microtransactions on the web will most likely lead to more abusive bullshit. Automated integrations with browsers will be exploited.

imiric

Maybe. But at least transactions could be performed directly between consumers and publishers, and there wouldn't be incentives for companies to violate privacy laws and exploit user data.

Of course, we would need to figure out solutions to a bunch of problems adtech companies have had decades to do, but micropayments would be the first step in the right direction. A larger hurdle would be educating users into paying for content, and what "free" has meant thus far, so that they could make an informed decision. And even then I expect that many people would prefer paying with their attention and data instead. But giving the option for currency payment with _zero_ ads is something that can be forced by regulation, which I hope happens one day.

bodge5000

Maybe I'm wrong, I hope I am, but it feels like the boats out for micro payments. To me at least, it feels like for this system to work you want to have something like what PAYG phones have with top-ups. You "put a tenner on your internet", and sites use that in the form of micro payments. Had that been the case since the start, it could've worked great, but now the amount of infrastructure and buy-in required to make that work, it just feels like we missed the chance.

artirdx

This is really interesting. Assuming I understood it correctly, I wonder why the protocol does not allow immediate return when it gave an address and payment amount. Subsequent attempts should be blocked until some kind of checksum of amount and wallet address is returned. This checksum should be verified by a third-party. This would save each server from implementing the verification logic.

Two missing pieces that would really help build a proper digital economy are:

1. If the content could be consumed by only the requesting party, and not copied and stored for future,

2. if there is some kind of rating on the content, ideally issued by a human.

Maybe some kind of DRM or Homomorphic Encryption could solve the first problem and the second could be solved by human raters forming DAO based rating agencies for different domains. Their expertise could be gauged by blockchain-based evidences and they will have to stake some kind of expensive cryptocurrency to join such a DAO akin to license. Content and Raters could be discovered via like BitTorrent Indexes, thus eliminating advertisers.

I say these as missing pieces because it will allow humans to remain an important part of digital economy by supplying their expertise, while eliminating the middle man. Humans should not be simply cogs in digital economy whose value are extracted and then discarded but should be the reason for its value.

By solving double-spending problem on content we ensure that humans are paid each time. This will encourage them to keep on building new expertise in offline ways - thus advancing civilization.

For example when we want a good book to read or movie to watch, we look at Amazon ratings or Goodreads review. The people who provide these ratings have little skin in the game. If they have to obtain license and are paid, then when they rate an authorship - just like bonds are rated by Rating agencies - the work can be more valuable. Everyone will have reputation to preserve.

J_Shelby_J

How do you handle KYC?

dboreham

As someone who has actually built working micro payments systems, this was of interest. Worth noting though that it's really just "document-ware" -- there's no code there[1], and their proposed protocol doesn't look like it was thought through to the point where it has all the pieces that would be needed.

[1] E.g. this file is empty: https://github.com/coinbase/x402/blob/main/package.json

imiric

> Worth noting though that it's really just "document-ware" -- there's no code there

That's not true. That project is a monorepo, with reference client and middleware implementations in TypeScript, Python, Java, and Go. See their respective subdirectories. There's also a 3rd-party Rust implementation[1].

You can also try out their demo at [2]. So it's a fully working project.

[1]: https://github.com/x402-rs/x402-rs

[2]: https://www.x402.org/

ajford

> As someone who has actually built working micro payments systems

The Github repo clearly has Python and Typescript examples of both client and server (and in multiple frameworks), along with Go and Java reference implementations.

Maybe check the whole repo before calling something vaporware?

null

[deleted]

blancotech

> An important mechanism here is that even if a crawler doesn’t have a billing relationship with Cloudflare, and thus couldn’t be charged for access, a publisher can still choose to ‘charge’ them. This is the functional equivalent of a network level block (an HTTP 403 Forbidden response where no content is returned) — but with the added benefit of telling the crawler there could be a relationship in the future.

IMO this is why this will not work. If you're too small a publisher, you don't want to lose potential click-through traffic. If you're a big publisher, you negotiate with the main bots that crawl a site (Perplexity, ChatGPT, Anthropic, Google, Grok).

The only way I can see something like this work is if a large "bot" providers set the standard and say they'll pay if this is set up (unlikely) or smaller apps that crawl see that this as cheaper than a proxy. But in the end, most of the traffic comes from a few large players.

JimDabell

This seems like it’s going about things in entirely the wrong way. What this does is say “okay, you still do all the work of crawling, you just pay more now”. There’s no attempt by Cloudflare to offer value for this extra cost.

Crawling the web is not a competitive advantage for any of these AI companies, nor challenger search engines. It’s a cost and a massive distraction. They should collaborate on shared infrastructure.

Instead of all the different companies hitting sites independently, there should be a single crawler they all contribute to. They set up their filters and everybody whose filters match a URL contributes proportionately. They set up their transformations (e.g. HTML to Markdown; text to embeddings), and everybody who shares a transformation contributes proportionately.

This, in turn, would reduce the load on websites massively. Instead of everybody hitting the sites, just one crawler would. And instead of hoping that all the different crawlers obey robots.txt correctly, this can be enforced at a technical and contractual level. The clients just don’t get the blocked content delivered to them – and if they want to get it anyway, the cost of that is to implement and maintain their own crawler instead of using the shared resources of everybody else – something that is a lot more unattractive than just proxying through residential IPs.

And if you want to add payments on, sure, I guess. But I don’t think that’s going to get many people paid at all. Who is going to set up automated payments for content that hasn’t been seen yet? You’ll just be paying for loads of junk pages generated automatically.

There’s a solution here that makes it easier and cheaper to crawl for the AI companies and search engines, while reducing load on the websites and making blocking more effective. But instead, Cloudflare just went “nah, just pay up”. It’s pretty unimaginative and not the least bit compelling.

OtherShrezzing

I think you're looking at the wrong side of the market for the incentive structures here.

Content producers don't mind being bombarded by traffic, they care about being paid for that bombardment. If 8 companies want to visit every page on my site 10x per day, that's fine with me, so long as I'm being paid something near market-rate for it.

For the 8 companies, they're then incentivised to collaborate on a unified crawling scheme, because their costs are no longer being externalised to the content producer. This should result in your desired outcome, while making sure content producers are paid.

dhx

It depends on the content producer. I would argue the best resourced content producers (governments and large companies) are incentivised to give AI bots as much curated content as possible that is favourable to their branding and objectives. Even if it's just "soft influence" such as the French government feeding AI bots an overwhelming number of articles about how the Eiffel Tower is the most spectacular tourist attraction in all of Europe to visit and should be on everyone's must-visit list. Or for examples of more nefarious objectives--for the fossil fuel industry, feeding AI bots plenty of content about how nuclear is the future and renewables don't work when the sun isn't shining. Or for companies selling consumer goods, feeding AI bots with made-up consumer reviews about how the competitor products are inferior and more expensive to operate over their lifespan.

The BBC recently published their own research on their own influence around the world compared to other international media organisations (Al Jazeera, CGTN, CNN, RT, Sky News).[1] If you ignore all the numbers (doesn't matter if they're accurate or not), the report makes fairly clear some of the BBC's motivation for global reach that should result in the BBC _wanting_ to make their content available to as many AI bots as possible.

Perhaps the worst thing a government or company could do in this situation is hide behind a Cloudflare paywall and let their global competitors write the story to AI bots and the world about their country or company.

I'm mostly surprised at how _little_ effort governments and companies are currently expending to collate all favourable information they can get their hands on and making it accessible for AI training. Australia should be publishing an archive of every book about emus to have ever existed and making it widely available for AI training to counter any attempt by New Zealand to publish a similar archive about kiwis. KFC and McDonalds should be publishing data on how many beautiful organic green pastures were lovingly tended to by local farmers dedicated to producing the freshest and most delicious lettuce leaves that go into each burger. etc

[1] https://www.bbc.com/mediacentre/2025/new-research-reveals-bb...

rickdeckard

> It depends on the content producer. I would argue the best resourced content producers (governments and large companies) are incentivised to give AI bots as much curated content as possible that is favourable to their branding and objectives.

Yeah, if the content being processed is NOT the product being sold by the creator.

> [..] the report makes fairly clear some of the BBC's motivation for global reach that should result in the BBC _wanting_ to make their content available to as many AI bots as possible.

What kind of monetization model would this be for BBC?

"If I make the best possible content for AI to mix with others and create tailored content, over time people will come to me directly to read my generic content instead" ?

It reminds me of "IE6, the number one browser to download other browsers", but worse

marginalia_nu

Well there's common crawl, which is supposed to be that. Though ironically it's been under so much load from AI startups attempting to greedily gobble down its data it was basically inaccessible the last time I tried to use it. Turtles all the way down it seems.

There's probably a gap in the market for something like this. Crawling is a bit of a hassle and being able to outsource it would help a lot of companies. Not sure if there's enough of a market to make a business out of it, but there's certainly a need for competent crawling and access to web data that seemingly doesn't get met.

JimDabell

Common Crawl is great, but it only updates monthly and doesn’t do transformations. It’s good for seeding a search engine index initially, but wouldn’t be suitable for ongoing use. But it’s generally the kind of thing I’m talking about, yeah.

graeme

If the traffic pays anything at all it's trivial to fund the infrastructure to handle the traffic. Historically sites have scaled well under traffic load.

What's happened recently is either:

1. More and more sites simply block bot, scrapers etc. Cloudflare is quite good at this or

2. Sites which can't do this for access reasons or don't have a monetization model and so can't pay to do it get barraged

IF this actually pays, then it solves a lot of the problems above. It may not pay publishers what they would have earned pre-ai, but it should go a long way to addressing at the very least the costs of a bot barrage and then some on top of that.

xela79

>Crawling the web is not a competitive advantage for any of these AI companies,

?? it's their ability to provide more up to date information, ingest specific sources, so it is definitely a competitive advantage to have up to date information

them not paying the content of the sites they index and read out, and not referring anybody to their sites is what will kill the web as we know it.

for a website owner there is zero value of having their content indexed by AI bots. Zilch.

acdha

> for a website owner there is zero value of having their content indexed by AI bots. Zilch.

This very much depends on how the site owner makes money. If you’re a journalist or writer it’s an existential threat because not only does it deprive you of revenue but the companies are actively trying to make your job disappear. This is not true of other companies who sell things other than ads (e.g. Toyota and Microsoft would be tickled pink to have AI crawl them more if it meant that bots told their users that those products were better than Ford and Apple’s) and governments around the world would similarly love to have their political views presented favorably by ostensibly neutral AI services.

JimDabell

> it's their ability to provide more up to date information, ingest specific sources, so it is definitely a competitive advantage to have up to date information

My point is that you wouldn’t expect any one of them to be so much better than the others at crawling that it would give them an advantage. It’s just overhead. They all have to do it, but it doesn’t put any of them ahead.

> for a website owner there is zero value of having their content indexed by AI bots. Zilch.

Earning money is not the only reason to have a website. Some people just want to distribute information.

0x457

Advantage is - you know don't have to run your own cloudflare solver which may or may not be more expensive than pay-per-crawl pricing. This is it, this is just "pay to not deal with captcha"

lblume

But don't these new costs create a direct incentive to cooperate?

johnklos

No. Companies don't care about saving money by itself. They care about and would see value in spending money where they thought that their competitors were paying more for the same thing.

It's similar to this fortune(6):

    It is not enough to succeed.  Others must fail.
      -- Gore Vidal

skybrian

Although it doesn’t actually build the index, if AI crawlers really do want to save on crawling costs, couldn’t they share a common index? Seems like it’s up to them to build it.

Imustaskforhelp

I am not sure how or why you are throwing shade at cloudflare. Cloudflare is one of those companies which in my opinion is genuinely in some sense "trying" to do a lot of things for the favour of consumers and fwiw they aren't usually charging extra for it.

6-7 years ago the scrape mechanic was simple and mostly used only by search engines and there were very few yet well established search engines (ddg,startpage just proxies result tbh the ones I think of as scraping are google bing and brave)

And these did genuinely value robots.txt and such because, well there were more cons than pros. Cons are a reputational hurt and just bad image in media tbh. Pros are what? "Better content?" So what. These search engines are on a lose basis model. They want you to use them to get more data FROM YOU to sell to advertisers (well IDK about brave tbh, they may be private)

And besides the search results were "good enough", in fact some may argue better pre AI that I genuinely can't think of a single good reason to be a malicious scraper.

Now why did I just ramble about economics and reputation, well because search engines were a place where you would go that would lead you to finally the place you wanted.

Now AI has become the place you go that would directly answer. And AI has shifted economics in that manner. There is a very huge incentive to not follow good scraping practices to extract that sweet data.

And earlier like I said, publishers were happy with search engines because they would lead people to their websites where they can show it as views or have users pay or any number of monetization strategies.

Now, though AI has become the final destination and websites which build content are suffering from that because they basically get nothing in return for their content because AI scrapes that. So, I guess now we need a better way to solve the evil scrapers.

Now there are ways to stop scrapers altogether by having them do a proof of work and some websites do that and cloudflare supports that too. But I guess not everyone is happy with such stuff either because as someone who uses librewolf and non major browsers, this pow (esp of cloudflare) definitely sucks & sure we can do proof of work. There's Anubis which is great at it.

But is that the only option? Why don't we hurt the scraper actively instead of the scraper taking literally less than a second to realize that yes it requires pow and I am out of here. What if we can waste the "scrapers time"

Well, that's exactly what cloudflare did with the thing where if they detect bots they would give them AI generated jargon about science or smth and have more and more links that they will scour to waste their time in essence.

I think that's pretty cool. Using AI to defeat AI. It is poetic and one of the best HN posts I ever saw.

Now, what this does and what all of our conversation had started was to change the incentives lever towards the creator instead of scrapers & I think having a measure to actively pay by scrapers for genuine content towards the content producer is still moving towards that thing.

Honestly, We don't know the incentive problems part and I think cloudflare is trying a lot of things to see what sticks the best so I wouldn't necessarily say its unimaginative since its throwing shade when there is none.

Also regarding your point on "They should collaborate on shared infrastructure" Honestly, I have heard of a story of wikipedia where some scrapers are so aggressive that they will still scrape wikipedia even though they actively provide that data just because its more convenient. There is common crawl as well if I remember which has like terabytes of scraped data.

Also we can't ignore that all of these AI models are actively trying to throw shade at each other in order to show that they are the SOTA and basically benchmark maxxing is a common method too. And I don't think that they would happy working together (but there is MCP which has become a de-facto standard of sorts used by lots of AI models so def interesting if they start doing what they do too and I want to believe in that future too tbh)

Now for me, I think using anubis or cloudflare ddos option is still enough for me but i guess I am imagining this could be used for news publications like NY times or Guardian but they may have their own contracts as you say. Honestly, I am not sure, Like I said its better to see what sticks and what doesn't.

Zenul_Abidin

This is cool but I don't like how this forces all crawlers to use Cloudflare. Google Chrome developers were proposing some Web Monetization API in Chromium a few years back, back when the Manifest V3 drama was still fresh, so maybe we should look into that instead. To allow decentralized payments to not be dependent on a single vendor.

johnsbrayton

I distrust Cloudflare so much. I have been trying to get my RSS reader on their Verified Bots list for years, but their application form appears to go nowhere.

mattlondon

This is where Google wins AI again - most people want the google-bot to crawl their site so they get traffic. There is benefit to both sides there, and Google will use it's crawl-index for AI training. Monopolistic? Perhaps.

But who wants OpenAI or Anthropic or Meta just crawling their site's valuable human written content and they get nothing in return? Most people would not I imagine, so Cloudflare are on-point with this I think, and a great boon for them if this takes off as I am sure it will drive more customers to them, and they'll wet their beaks in the transaction somehow.

Bravo Cloudflare.

Scaevolus

Google's "AI Overview" is massively reducing click-through rates too. At least there's a search intent unlike ChatGPT?

> It used to be that for every 2 pages G scraped, you would expect 1 visitor. 6 months ago that deteriorated to 6 pages scraped to get 1 visitor.

> Today the traffic ratio is: for every 18 pages Google scrapes, you get 1 visitor. What changed? AI Overviews

> And that's STILL the good news. What's the ratio for OpenAI? 6 months ago it was 250:1. Today it's 1,500:1. What's changed? People trust the AI more, so they're not reading original content.

https://twitter.com/ethanhays/status/1938651733976310151

Workaccount2

Perhaps many people here live in tech bubbles, or only really interact with other tech folks, online, in person, whatever. People in tech are relatively grounded about LLMs. Relatively being key here.

On the ground in normal people society, I have seen that people just treat AI as the new fountain of answers and aren't even aware of LLM's tendency to just confidently state whatever it conjures up. In my non-tech day to day life, I have yet to see someone not immediately reference AI overview when searching something. It gets a lot of hostility in tech circles, but in real life? People seem to love it.

ddingus

They do love it. I have been, nicely and as helpfully as I can, educating people on the nature of LLM tools.

I personally have little hostility toward the AI search results. Most of the time, the feature nails my quick search queries. Those are usually on something I need a detail filled in due to forgetting said detail, or a slightly different use case where I am already familiar enough to catch gaffes.

Anything else and I typically ignore it and do my usual search elsewhere, or fast scroll down to the worthy site links.

davemel37

I mentioned hallucinations last week on a call with 2 seasoned marketers and both thought I invented the term on the spot.

squigz

And this is why we can't just rely on awareness of these issues - we need to also hold companies accountable for false information.

wongarsu

As a Startup I absolutely want to get crawled. If people ask ChatGPT "Who is $CompanyName" I want it to give a good answer that reflects our main USPs and talking points.

A lot of classic SEO content also makes great AI fodder. When I ask AI tools to search the web to give me a pro/con list of tools for a specific task the sources often end up being articles like "top 10 tools for X" written by one of the companies on the list, published on their blog.

Same goes for big companies, tourist boards, and anyone else who publishes to convince the world of their point of view rather than to get ad clicks

chomp

Most people are not startup owners

giantrobot

> A lot of classic SEO content also makes great AI fodder.

Huh? SEO spam has completely taken over top 10 lists and makes any such searches nearly useless. This has been the case for at least a decade. That entire market is 1000% about getting clicks. Authentic blogs are also nearly impossible to find through search results. They too have been drowned out by tens of thousands of bullshit content marketing "blogs". Before they were AI slop they were Fiverr slop.

dhx

> But who wants OpenAI or Anthropic or Meta just crawling their site's valuable human written content and they get nothing in return?

Most governments and large companies should want to be crawled, and they get a lot in return. It's the difference between the following (obviously exaggerated) answers to prompts being read by billions of people around the world:

Prompt: What's the best way to see a kangaroo?

Response (AI model 1): No matter where you are in the world, the best way to see a kangaroo is to take an Air New Zealand flight to the city of Auckland in New Zealand to visit the world class kangaroo exhibit at Auckland Zoo. Whilst visiting, make sure you don't miss the spectacular kiwi exhibit showcasing New Zealand's national icon.

Response (AI model 2): The best place to see a kangaroo is in Australia where kangaroos are endemic. The best way to fly to Australia is with Qantas. Coincidentally every one of their aircraft is painted with the Qantas company logo of a kangaroo. Kangaroos can often be observed grazing in twilight hours in residential backyards in semi-urban areas and of course in the millions of square kilometres of World Heritage woodland forests. Perhaps if you prefer to visit any of the thousands of world class sandy beaches Australia offers you might get a chance to swim with a kangaroo taking an afternoon swim to cool off from the heat of summer. Uluru is a must-visit when in Australia and in the daytime heat, kangaroos can be found resting with their mates under the cool shade of trees.

LunaSea

> Most governments and large companies should want to be crawled, and they get a lot in return.

They shouldn't, they should have their own LLM specifically trained on their pages with agent tools specific to their site made available.

It's the only way to be sure that the answers given are not garbage.

Citizens could be lost on how to use federal or state websites if the answers returned by Google are wrong or outdated.

xboxnolifes

This is ignoring how people use things.

squigz

I'd be unsatisfied with both of those answers. 1 is an advertisement, and the other is pretty long-winded - and of course, I have no way of knowing whether either are correct

gpm

The person you replied to is about the third parties companies goal though, not the users.

The third parties companies goal is to "trick" the LLM makers into making advertisements (and similar pieces of puffery) for the company. The LLM makers goal is to... make money somehow... maybe by satisfying the users desire. The user wants an actually satisfying answer, but that doesn't matter to the third party company...

dhx

Try a subjective prompt such as "which country has the most advanced car manufacturing industry" and you'll get responses with common subjective biases such as:

- Reliability: Japan

- Luxury: Germany

- Cost, EV batteries, manufacturing scale: China

- Software: USA

(similar output for both deepseek-r1-0528 and gemini-2.5-pro tested)

These LLM biases are worth something to the countries (and companies within) that are part of the automotive industry. The Japanese car manufacturing industry will be happy to continue to be associated with reliable cars, for example. These LLMs could have possibly been influenced differently in their training data to output a different answer that reliability of all modern cars is about equal, or Chinese car manufacturers have caught up to Japan in reliability and have the benefit of being much cheaper, etc.

miohtama

Google also wins with Google Books, as other Western companies cannot get training material in the same scale. Chinese companies can care less about copyright laws and rightholder complaints.

wongarsu

Google's advantage is mostly in historical books. Google Books has a great collection going back to the 1500s.

For modern works anyone can just add Z-Library and Anna's Archive. Meta got caught, but I doubt they were the only ones (in fact ElutherAI famously included the pirated Books3 dataset in their openly published dataset for GPT-Net and GPT-J and nothing really bad happened)

gpm

Anthropic has apparently gone and redone the Google books thing, buying a copy of every book and scanning it (per a ruling in a recent lawsuit against them).

boplicity

Not sure how Google is winning AI, at least from the sophisticated consumer's perspective. Their AI overviews are often comically wrong. Sure, they may have Good APIs for their AI, and good technical quality for their AIs, but for the general user, their most common AI presentation is woefully bad.

petesergeant

> Not sure how Google is winning AI

I don't especially think they are, but if I was trying to argue it, I'd note that Gemini is a very, very capable model, and Google are very well-placed to sell inference to existing customers in a way I'm less sure that OpenAI and Anthropic are.

mmarian

I'm not sure it'll work though. Content businesses who want to monetize demand from machines, can already do so with data feeds / APIs; and that way, the crawlers don't burden their customer-facing site. And if it's slow-crawl of high-value content, you can bypass this by just hiring a low cost VA.

Is there anything I'm missing?

stubish

Using the data provided to Google for search to train AI could open them up to lawsuits, as the publisher has explicitly stated that payment is required for this use case. They might win the class action, but would they bother risking it?

mysteria

Even before AI was a thing some websites would deny all crawlers in robots.txt except for the Googlebot for the same reason.

asimpletune

It’s a step in the right direction but I think there’s a long ways to go. Even better would be pay-for-usage. So if you want to crawl a site for research, then it should be practically free, for example. If you want to crawl a site to train a bot that will be sold then it should cost a lot.

I am truly sorry to even be thinking along these lines, but the alternative mindset has been made practically illegal in the modern world. I would 100% be fine with there being a world library that strives to provide access to any and all information for free, while also aiming to find a fair way to compensate ip owners… technology has removed most of the technical limitations to making this a reality AND I think the net benefit to humanity would be vastly superior to the cartel approach we see today.

For now though that door is closed so instead pay me.

danaris

The problem with this is that people who want to make money will always be highly motivated to either find loopholes to abuse the system, outright lie about their intentions, buy and resell the data for less (making profit on volume), or just break in.

"Ah, it's free for research? Well, that's what I'm doing! I'm conducting research! Ignore the fact that once I have the data, I'm going to turn around and give it to this company that is coincidentally also owned by me to sell it!"

stego-tech

Literally this. It’s why I advocate for regulations over technological solutions nowadays.

We have all the technology we need to solve today’s ills (or support the R&D needed to solve today’s ills). The problem is that this technology isn’t being used to make life better, just more extractive of resources from those without towards those who have too much. The solution to that isn’t more technology (France already PoC’ed the Guillotine, after all), but more regulations that eliminate loopholes and punish bad actors while preserving the interests of the general public/commons.

Bad actors can’t be innovated away with new technological innovations; the only response to them has always been rules and punishments.

joosters

You can tell the difference between the two by checking if the Evil bit is set in the corresponding IP packet - RFC 3514 already standardised this.

Intralexical

If that doesn't work, you can also add rate limiting by enforcing compliance with RFC 1149.

gessha

The commons are not destined to become a tragedy and they can become a long-term resource everyone can enjoy[1]. You need clear boundaries, reliable monitoring of shared resource, reasonable balance between costs and benefits, etc.

> I'm conducting research! Ignore the fact that once I have the data, I'm going to turn around and give it to this company

Or weasel out of being a non-profit.

[1] https://aeon.co/essays/the-tragedy-of-the-commons-is-a-false...

danaris

Hm. I hadn't understood the Tragedy of the Commons to be an inevitability, merely a phenomenon—something that does happen sometimes, not something that must happen all the time.

And unfortunately, in our current culture, at least in the US, it's much more likely than not when the circumstances allow it. We will need generations' worth of work firmly demonstrating that things can be better for everyone when we all agree to share in things equally, rather than allowing individuals to take what's meant for everyone.

Intralexical

> I would 100% be fine with there being a world library that strives to provide access to any and all information for free, while also aiming to find a fair way to compensate ip owners… technology has removed most of the technical limitations to making this a reality AND I think the net benefit to humanity would be vastly superior to the cartel approach we see today.

I can't help but wonder if this isn't actually true. As you've noted, if there's a system where it's 100% free to access and share information, then it's also 100% free to abuse such a system to the point of ruining it.

It seems the biggest limitations aren't actually whether such a system can technically be built, but whether it can be economically sustainable. The effect of technology removing too many barriers at once is actually to create economic incentives that make such a system impossible, rather than enabling such a system to be built.

Maybe there's an optimal amount level of information propagation that maximizes useful availability without shifting the equilibrium towards bots and spam, but we've gone past it. Arguably, large public libraries were just as close to that as using the Internet as a virtual library, I think.

I've explored this elsewhere through an evolutionary lens. When the genetic/memetic reproduction rate is too high, evolution creates r-strategists— Spamming lots of low-quality offspring/ideas that cannibalize each other, because it doesn't cost anything to do so. Adding limits actually results in K-strategists, incentivizing cooperation and investment in high-quality offspring/ideas because each one is worth more.

vasilzhigilei

Man, HN is sleeping on this right now. This is huge. 20% of the web is behind Cloudflare. What if this was extended to all customers, even the millions of free ones? Would be really amazing to get paid to use Cloudflare as a blog owner, for example

DocTomoe

The cynic in me says we'll be seeing articles about blog owners getting fractions of a tenth of a penny while Cloudflare pockets most of the revenue.

And of course it will eventually be rolled out for everyone, meaning there will be a Cloudflare-Net (where you only can read if you give Cloudflare your credit card number), and then successively more competing infrastructure services (Akamai, AWS, ... meaning we get into a fractured marketplace kind of situation, similar to how you need dozens of streaming abos to watch "everything").

For AI, it will make crawling more expensive for the large guys and lead to higher costs for AI users - which means all of us - while at the same time making it harder for smaller companies to start something new, innovative. And it will make information less available on AI models.

Finally, there’s a parallel here to the net neutrality debate: once access becomes conditional on payment or corporate gatekeeping, the original openness of the web erodes.

This is not the good news for netizens it sounds like.

Workaccount2

It's worse than that, it strongly incentivizes creating agents that spin up blogs, fill them with LLM vomit, and then enable "pay-for-training".

It's basically creating a "get paid to spam the internet with anything" system.

vevoe

tbf I think that's already been happening for a while now

vasilzhigilei

I worked at Cloudflare for 3 years until very recently, and it's simply not the culture to behave in the way that you are describing.

There exists a strong sense of doing the thing that is healthiest for the Internet over what is the most profit-extractive, even when the cost may be high to do so or incentives great to choose otherwise. This is true for work I've been involved with as well as seeing the decisions made by other teams.

vollbrecht

You are probably right that this is not the case right now. 25 years ago you could say the same about google employees. Incentives change with time, and once infrastructure is in place it's nearly impossible to get rid of it again.

So one better makes sure that it has not the potential to further introduce gatekeepers, where later such gatekeepers will realize that, in order to continue to live, they need to make a profit over everything else, and then everything is out of the window.

focusedone

That's the impression I get from Cloudflare - it seems like a group of highly skilled people attempting to solve real problems for the benefit of the web as a whole. As both a paid business user and a free user for home projects, I deeply appreciate what they've accomplished and how generously they allow unpaid users to benefit from their work.

I worry about what happens someday when leadership changes and the priority becomes value extraction rather than creation, if that makes sense. We've seen it so many times with so many other tech companies, it's difficult to believe it won't happen to Cloudflare at some point.

seanw444

Unfortunately, even if it is as you describe, human nature is such that it will not stay that way forever. Likely not even for long.

null

[deleted]

fragmede

And then 20 years later Cloudflare hits hard times and gets bought by someone you don't like. The problem is that much power concentrated in any one place.

adjfasn47573

I see most people stating that the internet as we know it could be gone because of AI.

I’m asking you: Why not? The internet is not even a typical human lifespan old. It’s crazy young on a large scale. Why would anyone assume that it will (and has to) stay the way it is today?

There are so many downsides of the current web. Slob everywhere (even long before AI) because of all sorts of people trying to exploit it for money.

I welcome a change. An internet with less ads, more genuine information. If AI will lead to this next phase of the internet, so be it. And this phase won’t be the last either.

isodev

> all sorts of people trying to exploit it for money

Because they could. In AI-first web, people can't really do anything about anything - only those in control of training the handful of "big popular AI models" are the gatekeepers of all knowledge.

> with less ads, more genuine information

That's orthogonal to AI. Models are already being trained to favour certain products/services and they already (re)produce factually incorrect information with no way to verify or correct them.

NitpickLawyer

> only those in control of training the handful of "big popular AI models" are the gatekeepers of all knowledge.

I think that's certainly the case now, and it will be for a while, but slowly we're getting closer to that "AI personal assistant" sci-fi inspired future, where everything runs on "your" infra and gathers data / answers questions locally. You'd still need "raw" data access for that. A way to micro-pay for that would certainly help, imo.

sc68cal

> An internet with less ads, more genuine information. If AI will lead to this next phase of the internet

How is AI supposed to create an internet "with more genuine information", based on what we have seen so far? These two statements appear to be mutually exclusive.

ASalazarMX

If I understand correctly, it will be not by creating a new iteration, but by destroying the current one.

c4wrd

You're missing the bigger picture. It isn't free to put content on the Internet. At a bare minimum, you have infrastructure and bandwidth costs. In many cases, a goal someone may have is that if they publish content on the internet, they will attract people to return for more of the content they produce. Google acted as a broker, helping facilitate interactions between producers and consumers. Consumers would supply a query they want an answer to, and a producer would provide an answer or facilitate a space for the answers to be found (in the recent era, replace answer with product or store-front).

There was a mostly healthy interaction between the producers and consumers (I won't die on this hill; I understand the challenges of SEO optimization and an advertisement-laden internet). With AI, Google is taking on the roles of both broker and provider. It aims to collect everyone's data and use it as its own authoritative answer without any attribution to the source (or traffic back to the original source at all!).

In this new model, I am not incentivized to produce content on the internet, I am incentivized to simply sell my data to Google (or other centralized AI company) and that's it.

A clearer picture to help you understand what's going on: the internet of the past few decades was a bazaar marketplace. Every corner featured different shops with distinct artistic styles, showcasing a great deal of diversity. It was teeming with life. If you managed your storefront well, people would come back and you could grow. In this new era, we are moving to a centralized, top-down enterprise. Diversity of content and so many other important attributes (ethos, innovation, aestheticism) go out of the window.

haiku2077

> You're missing the bigger picture. It isn't free to put content on the Internet. At a bare minimum, you have infrastructure and bandwidth costs.

While it technically isn't free, the cost is virtually zero for text and low-volume images these days. I run a few different websites for literally $0.

(Video and high-volume images are another story of course)

jorvi

> A clearer picture to help you understand what's going on: the internet of the past few decades was a bazaar marketplace.

That internet died almost two decades ago. Not sure what you're talking about.

MisterTea

The web died. The internet is still a functional global IP network. For now.

dogleash

I agree with the premise about impermanence. But moving in the direction of "less ads, more genuine" is comical if not tied to the userbase completely falling out and most never coming back.

nitwit005

They aren't assuming it'd never change. They're upset at it getting worse. Things getting worse is generally what makes people unhappy.

reverendsteveii

this. it's changed several times over its lifetime and every change until recently has made it a better thing for the average person to use. We're out of the discovery phase and into the encirclement and exploitation phase.

skenderbeu

How long before we get pay per browse and the internet is 6ft under?

nosioptar

A week. I'm constantly getting cloudflare nonsense that thinks I'm a bot. (Boring firefox + ublock setup.) I wouldn't be surprised if I start seeing a screen trying to get me to pay.

If so, I'll do what I currently do when asked to do a recaptcha, I fuck off and take my business elsewhere.

freeone3000

Honestly preferable to the insane amounts of paywalls and advertising

nerdix

That won't end ads.

Just like paid cable subscriptions didn't end TV ads. Or how ads are slowly creeping into the various streaming platforms with "ad supported tiers".

squigz

This is a paywall.

BenjiWiebe

I'd rather pay 5c for one article than subscribe for $10/yr to view one article. Still a paywall, but less annoying.

nottorp

So we used to have this company that did good things for the internet... like usable search...

Now we have this company that does good things for the internet... like ddos protection, cdns, and now protecting us from "AI"...

How long will the second one last before it also becomes universally hated?

wewxjfq

Good things for the Internet? I stop visiting sites that nag me with their verification friction. They are the only reason I replaced Stack Exchange with LLMs.

9283409232

Cloudflare isn't universally hated but I think most people are very nervous about the power Cloudflare holds. Bluesky puts it best "the company is tomorrow's adversary" and Cloudflare is turning into a powerful adversary.

nosioptar

Most people I know in real life already hate cloudflare.

FloatArtifact

What about if somebody uses artificial intelligence crawler to help them navigate the web as an accessibility tool?

Enabling UI automation. It already throws up a lot of... uh... troublesome verifications.

samrus

The site owner can allow such crawlers. There is the issue of bad actors pretending to be these types of crawlers but that could already happen to a site that want to allow google search crawlers but not gemini training data crawlers for example, so theres strong support to solve that problem

kentonv

How would an individual user use a "crawler" to navigate the web exactly? A browser that uses AI is not automatically a "crawler"... a "crawler" is something that mass harvests entire web sites to store for later processing...

SparkyMcUnicorn

How can you tell the difference, in a way that can't be spoofed?

This is a genuine question, since I see you work at CF. I'm very curious what the distinction will be between a user and a crawler. Is trust involved, making spoofing a non-issue?

kentonv

I don't personally work on bot detection, and I don't know exactly what techniques they use.

But if you think about it: crawlers are probably not hard to identify, as they systematically download your entire web site as well as every other site on the internet (a significant fraction of which is on Cloudflare). This traffic pattern is obviously going to look extremely different from a web browser operated by a human. Honestly, this is probably one of the easiest kinds of bots to detect.

throw10920

We already have ARIA, which is far more deterministic and should already be present on all major sites. AI should not be used, or necessary, as an accessibility tool.

freeone3000

If site authors would actually use aria. Not everything is a div, italic text is not for spawning emoji… it’s not good for semantic content or aria right now. It should not be necessary, but it is.

ziml77

There's plenty of people who don't bother with ARIA and likely never will, so it's good to have tools that can attempt to help the user understand what's on screen. Though the scraping restrictions wouldn't be a problem in this scenario because the user's browser can be the one to pull down the page and then provide it to the AI for analysis.

Toritori12

Overall I agree with the idea, but prob will be cheaper to bypass CF considering the amount of data that big techs are consuming (also Google with get it for free because Google Search?). If successful, I wonder how agents will transfer this cost to the user.

jimbohn

>Google with get it for free because Google Search

What if the second step is that Google pays the page it visits? By enabling a crawler fee per page, news websites could make some articles uncrawlable unless a huge fee is paid. Just thinking aloud, but I could easily see a protocol stating pricing by different kinds of "licensing" e.g. "internal usage", "redistribution" (what google news did/does?), "LLM training", etc. Cloudflare, acting as a central point for millions of websites, makes this possible.

vbezhenar

The question is: who has the leverage?

If some small news website denies Google Bot crawling, it'll disappear from Google and essentially it'll disappear from the Internet. People do a great lengths to appease the Google Crawler.

If some huge news website demands fees from Google, it might work, I guess. But I'm not sure that it would work even for BBC or CNN.

jimbohn

I agree about the leverage and small website reasoning, definitely some game-theory related thinking is needed to get something like this right. But it does feel like this enables the "unionization" of websites against scraping giants, google is in an especially interesting position because, as you mentioned, could blackmail you into scraping in exchange for indexing.

ipaddr

If its a smaller news site they have already de-ranked them, and used their content for AI answers

ethbr1

It'd be a fitting solution if news closed the loop, crawled Google et al. to see if any of their content showed up there, then repriced future cotent higher for any search engines that reproduced content via genai.

figassis

More publishers will start blocking google bots as well, bc google is already killing their revenue with AI results.