Abusive AI Web Crawlers: Get Off My Lawn

PeterStuer

You can monetize your app users by partnering with providers that offer SDKs for residential proxy networks. These services let users opt-in to share their internet connection, earning you revenue while they get benefits like ad-free experiences.

How It Works: Providers like Proxyrack, Live Proxies, Rayobyte, and Infatica allow you to integrate their SDKs into your app. Users who agree to join the proxy network contribute their device’s bandwidth, often used for web scraping, and you get paid based on their activity—typically per monthly or daily active user.

So it need not be "compromised Android SetTop Boxes", but just millions of free apps running on user's phones.

mateuszbuda

There are many different methods used by proxy providers to unethically source their IPs: https://scrapingfish.com/how-ips-for-web-scraping-are-source...

DarkPlayer

We observed the same behavior. Each request used a different IP address and a random user agent. In our case, most of the IP addresses belonged to Chinese ISPs. They went to great lengths to avoid being blocked, but at the same time used user agents such as Windows 95/98 or IE 5. Fortunately, the combination of the odd user agents and the fact that they still use HTTP/1.1 makes them somewhat easy to identify. So you can use a captcha on more expensive endpoints to block them.

intellectronica

I don't understand the current thing about "AI Crawlers". Maybe someone can help educate me.

How is it related to AI? Do AI crawlers do something different from traditional search index crawlers? Or is it simply a proliferation of crawlers because of the growth of AI products?

What makes AI special in this context?

hansvm

The proliferation of crawlers is part of the problem. They're also more aggressive and poorly behaved than typical search engines. Some issues:

- They request every resource, vastly increasing costs compared to a normal crawler.

- They not only don't respect robots.txt; they use it as an explicit source of more links to mine.

- They request resources frequently (many reports of 100x per day), sometimes from bugs and sometimes to ensure they have the latest copy.

- There's no rate limiting. It's trivial to create a crawler architecture where the crawler operates at full tilt, spread across millions of pages and respecting each site, but they don't bother, so even if everything else were fine it starts looking like a DOS attack.

- They intentionally use pools of IPs and other resources to obfuscate their identities and make themselves harder to block.

How much of that is "baby's first crawler" not being written very well, and how much is actual malice? Who knows, but the net effect is huge jumps in costs to support the AI wave.

klabb3

Speculation: If not already, it will be a data broker market for public-ish data too. What I mean by that is a separation of entities where open ai and ”legitimate” AI companies will buy data from data brokers of shadily scraped data, and throw them under the bus if shit hits the fan to protect the mothership. This makes sense from a corporate risk perspective, creating a gray area buffer of accountability. OpenAI and Anthropic already pleaded to the government to not take away their fair use hall pass, (by invoking the magic spell ”China”), but if this won’t work and publishers win they’ll need to be prepared.

At the same time, publicly and easily available quality content is a race against time. Platforms like Reddit and Xitter already lock down with aggressive anti-bot measures and fingerprinting, and the cottage industries are following. Meanwhile, public data is being polluted by content farms producing garbage at increased rate using AI.

Together this creates a perfect storm of bad incentives: (1) the data hoarders are no longer just Google and Microsoft, but probably thousands of smaller entities and (2) they’re short on time, and try to scrape more invasively and at a fast rate.

intellectronica

The incompetence hypothesis makes sense (it is often a good explanation). Web indexers like Google had decades to get really good at this, including hoards of people who work on crawlers full time. AI companies are often very young, execute with small teams, and don't consider web indexing their main activity, just something they do in support of pre-training (or maybe serving web results).

intellectronica

If the problem is really incompetence, then maybe a viable solution is for the community to create a really great (and well-behaved) OSS crawler. Make it easier for the AI people to do the right thing by making rolling their own crawler the more expensive, lower quality option.

brettzky

Vercel has made a few posts about how substantial the traffic has increased with the rise of AI crawlers, or crawlers for AI training. https://vercel.com/blog/the-rise-of-the-ai-crawler

gmuslera

Search engine crawlers, aggregators, vertical markets crawlers and so on may give you visibility, are not so much, and are usually well-behaved (i.e. respect robots.txt, announce themselves with a consistent user agent, etc).

Security/vulnerability scans doesn't ask too much pages, at least existing ones, and usually come from few IPs from time to time.

But AI crawlers could be really a lot, try to get all your pages, and not always are respectful about robots.txt or your performance, u. And don't give you anything back. There may be exceptions, but the few you notice ends having a negative impact.

intellectronica

Yes, I understand that, and I'm dismayed to learn about this.

But the question I'm asking is _why_ do AI crawlers behave in this different way.

gmuslera

Too many players

otikik

They don’t respect robots.txt at all and won’t hesitate to call all the endpoints they find, repeatedly, even when they’re costly for the host. That’s basically it.

intellectronica

Right, but how is it related to AI?

Is there just a correlation between crawlers that don't respect indexing norms and crawlers that operate in the service of AI products?

Or is there something about the sort of indexing people do for AI that makes this nasty behaviour more likely?

structural

One major difference is that while indexing, you're generating an internal data structure that represents that site. Once done, if the site doesn't change, you don't have any need to revisit it, and in fact, fetching the site multiple times just increases your own costs.

On the other hand, an unsupervised AI training algorithm may just need raw text, and as much of it as possible. It doesn't know what site it came from or much care, and it's not building any index that links the content back to its original source. So fetching the site on each training epoch might actually be viable: why bother storing the entire internet when you can just fetch -> transform -> ingest into your model? If your crawler is distributed enough, it won't be the bottleneck, either.

If this is the architecture some companies are using, this also means that these crawlers won't ever stop, because they are finetuning some model by constantly updating over time based on the "current" internet, whatever that might mean.

blakesterz

If normal crawlers are a light rain, AI crawlers are a hurricane. Most sites can handle some rain, but they are not built to handle hurricanes. AI crawlers can look like DDOS attacks. The worst offenders will just crawl a site as fast as possible until it goes offline.

lostmsu

Why does the author of this post assume their increase in traffic has anything to do with "AI" specifically?

null

[deleted]

HN

Abusive AI Web Crawlers: Get Off My Lawn

Abusive AI Web Crawlers: Get Off My Lawn