OpenAI's bot crushed this seven-person company's web site 'like a DDoS attack'
71 comments
·January 10, 2025Hilift
People who have published books recently on Amazon have noticed that immediately there are fraud knockoff copies with the title slightly changed. These are created by AI, and are competing with humans. A person this happened to was recently interviewed about their experience on BBC.
ericholscher
This keeps happening -- we wrote about multiple AI bots that were hammering us over at Read the Docs for >10TB of traffic: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...
They really are trying to burn all their goodwill to the ground with this stuff.
PaulHoule
In the early 2000s I was working at a place that Google wanted to crawl so bad that they gave us a hotline number to crawl if their crawler was giving us problems.
We were told at that time that the "robots.txt" enforcement was the one thing they had that wasn't fully distributed, it's a devilishly difficult thing to implement.
It boggles my mind that people with the kind of budget that some of these people have are struggling to implement crawling right 20 years later tough. It's nice those folks got a rebate.
One of the problems why people are testy today is that you pay by the GB w/ cloud providers; about 10 years ago I kicked out the sinosphere crawlers like Baidu because they were generating like 40% of the traffic on my site crawling over and over again and not sending even a single referrer.
jcgrillo
According to that weird fishy looking fascist[1] the AI models have run out of "natural" data to mine and they'll soon have to feed on their own synthetic excrement. Or it could just be the drugs talking.
[1] https://www.theguardian.com/technology/2025/jan/09/elon-musk...
jsheard
Judging by how often these scrapers keep pulling the same pages over and over again I think they're hoping that new data will just magically show up if they check enough times. Like those vuln scanners which ping your server for Wordpress exploits constantly just in case your not-Wordpress site turned into a Wordpress site since they last looked 5 minutes ago.
KTibow
I personally predict this won't be as bad as it sounds since training on synthetic data usually goes well (see Phi)
TuringNYC
Serious question - if robots.txt are not being honored, is there a risk that there is a class action from tens of thousands of small sites against both the companies doing the crawling and individual directors/officers of these companies? Seems there would be some recourse if this is done at at large enough scale.
krapp
No. robots.txt is not in any way a legally binding contract, no one is obligated to care about it.
vasco
If I have a "no publicity" sign in my mailbox and you dump 500 lbs of flyers and magazines by my door every week for a month and cause me to lose money dealing with all the trash, I think I'd have a reasonable ground to sue even if there's no contract saying you need to respect my wish.
End of the day the claim is someone's action caused someone else undue financial burden in an way that is not easily prevented beforehand, so I wouldn't say it's a 100% clear case but I'm also not sure a judge wouldn't entertain it.
ericmcer
You can sue over literally anything, the parent comment could sue you if they could demonstrate your reply damaged them in some way.
jdenning
We need a way to apply a click-through "user agreement" to crawlers
huntoa
Did I read it right that you pay 62,5$/TB?
Uptrenda
Hey man, I wanted to say good job on read the docs. I use it for my Python project and find it an absolute pleasure to use. Write my stuff in restructured text. Make lots of pretty diagrams (lol), slowly making my docs easier to use. Good stuff.
Edit 1: I'm surprised by the bandwidth costs. I use hetzner and OVH and the bandwidth is free. Though you manage the bare metal server yourself. Would readthedocs ever consider switching to self-managed hosting to save costs on cloud hosting?
exe34
can you feed them gibberish?
blibble
here's a nice project to automate this: https://marcusb.org/hacks/quixotic.html
couple of lines in your nginx/apache config and off you go
my content rich sites provide this "high quality" data to the parasites
Groxx
LLMs poisoned by https://git-man-page-generator.lokaltog.net/ -like content would be a hilarious end result, please do!
jcpham2
This would be my elegant solution, something like an endless recursion with a gzip bomb at the end if I can identify your crawler and it’s that abusive. Would it be possible to feed an abusing crawler nothing but my own locally-hosted LLM gibberish?
But then again if you’re in the cloud egress bandwidth is going to cost for playing this game.
Better to just deny the OpenAI crawler and send them an invoice for the money and time they’ve wasted. Interesting form of data warfare against competitors and non competitors alike. The winner will have the longest runway
actsasbuffoon
It wouldn’t even necessarily need to be a real GZip bomb. Just something containing a few hundred kb of seemingly new and unique text that’s highly compressible and keeps providing “links” to additional dynamically generated gibberish that can be crawled. The idea is to serve a vast amount of poisoned training data as cheaply as possible. Heck, maybe you could even make a plugin for NGINX to recognize abusive AI bots and do this. If enough people install it then you could provide some very strong disincentives.
joelkoen
> “OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site.
The IP addresses in the screenshot are all owned by Cloudflare, meaning that their server logs are only recording the IPs of Cloudflare's reverse proxy, not the real client IPs.
Also, the logs don't show any timestamps and there doesn't seem to be any mention of the request rate in the whole article.
I'm not trying to defend OpenAI but as someone who scrapes data I think it's unfair to throw around terms "like DDOS attack" without providing basic request rate metrics. This seems to be purely based on the use of multiple IPs, which was actually caused by their own server configuration and has nothing to do with OpenAI.
mvdtnz
Why should web store operators have to be so sophisticated to use the exact right technical language in order to have a legitimate grievance?
How about this: these folks put up a website in order to serve customers, not for OpenAI to scoop up all their data for their own benefit. In my opinion data should only be made available to "AI" companies on an opt-in basis, but given today's reality OpenAI should at least be polite about how they harvest data.
jonas21
It's "robots.txt", not "robot.txt". I'm not just nitpicking -- it's a clear signal the journalist has no idea what they're talking about.
That and the fact that they're using a log file with the timestamps omitted as evidence of "how ruthelessly an OpenAI bot was accessing the site" makes the claims in the article a bit suspect.
OpenAI isn't necessarily in the clear here, but this is a low-quality article that doesn't provide much signal either way.
peterldowns
Well said, I agree with you.
griomnib
I’ve been a web developer for decades as well as doing scraping, indexing, and analyzing million of sites.
Just follow the golden rule: don’t ever load any site more aggressively than you would want yours to be.
This isn’t hard stuff, and these AI companies have grossly inefficient and obnoxious scrapers.
As a site owner those pisses me off as a matter of decency on the web, but as an engineer doing distributed data collection I’m offended by how shitty and inefficient their crawlers are.
PaulHoule
I worked at one place where it probably cost us 100x (in CPU) more to serve content the way we were doing it as opposed to the way most people would do it. We could afford it for ordinary because it was still cheap, but we deferred the cost reduction work for half a decade and went on a war against webcrawlers instead. (hint: who introduced the robots.txt standard?)
mingabunga
We've had to block a lot of these bots as they slowed our technical forum to a crawl, but new ones appear every now and again. Amazons was the worst
methou
I used to have some problem with some Chinese crawlers, first I told them no with robots.txt, then I see a swarm of of non-bot user-agents from cloud providers in China, so I blocked their ASN, and then I see another rise of IPs from some Chinese ISP, so I eventually I have to block the entire country_code = cn and show them a robots.txt
vzaliva
From the article:
"As Tomchuk experienced, if a site isn’t properly using robot.txt, OpenAI and others take that to mean they can scrape to their hearts’ content."
The takeaway: check your robots.txt.
The question of how much load requests robots can reasonably generate when allowed is a separate matter.
krapp
Also probably consider blocking them with .htaccess or your server's equivalent, such as here: https://ethanmarcotte.com/wrote/blockin-bots/
All this effort is futile because AI bots will simply send false user agents, but it's something.
Sesse__
I took my most bothered page IPv6-only, the AI bots vanished in the course of a couple days :-) (Hardly any complaints from actual users yet. Not zero, though.)
PaulHoule
First time I heard this story it was '98 or so and the perp was somebody in the overfunded CS department and the victim somebody in the underfunded math department on the other side of a short and fat pipe. (Probably running Apache httpd on a SGI workstation without enough ram to even run Win '95)
In years of running webcrawlers I've had very little trouble, I've had more trouble in the last year than in the past 25. (Wrote my first crawler in '99, funny my crawlers have gotten simpler over time not more complex)
In one case I found a site got terribly slow although I was hitting it at much less than 1 request per second. Careful observation showed the wheels were coming off the site and it had nothing to do with me.
There's another site that I've probably crawled in it's entirety at least ten times over the past twenty years. I have a crawl from two years ago, my plan was to feed it into a BERT-based system not for training but to discover content that is like the content that I like. I thought I'd get a fresh copy w/ httrack (polite, respects robots.txt, ...) and they blocked both my home IP addresses in 10 minutes. (Granted I don't think the past 2 years of this site was as good as the past, so I will just load what I have into my semantic search & tagging system and use that instead)
I was angry about how unfair the Google Economy was in 2013, in lines with what this blogger has been saying ever since
(I can say it's a strange way to market an expensive SEO community but...) and it drives me up the wall that people looking in the rear view mirror are getting upset about it now.
Back in '98 I was excited about "personal webcrawlers" that could be your own web agent. On one hand LLMs could give so much utility in terms of classification, extraction, clustering and otherwise drinking from that firehose but the fear that somebody is stealing their precious creativity is going to close the door forever... And entrench a completely unfair Google Economy. It makes me sad.
----
Oddly those stupid ReCAPTCHAs and Cloudflare CAPTCHAs torment me all the time as a human but I haven't once had them get in the way of a crawling project.
andrethegiant
I'm working on fixing this exact problem[1]. Crawlers are gonna keep crawling no matter what, so a solution to meet them where they are is to create a centralized platform that builds in an edge TTL cache, respects robots.txt and retry-after headers out of the box, etc. If there is a convenient and affordable solution that plays nicely with websites, the hope is that devs will gravitate towards the well-behaved solution.
OutOfHere
Sites should learn to use HTTP error 429 to slow down bots to a reasonable pace. If the bots are coming from a subnet, apply it to the subnet, not to the individual IP. No other action is needed.
Sesse__
I've seen _plenty_ of user agents that respond to 429 by immediately trying again. Like, literally immediately; full hammer. I had to eventually automatically blackhole IP addresses that got 429 too often.
jcgrillo
It seems like it should be pretty cheap to detect violations of Retry-After on a 429 and just automatically blackhole that IP for idk 1hr.
It could also be an interesting dataset for exposing the IPs those shady "anonymous scraping" comp intel companies use..
atleastoptimal
Stuff like this will happen to all websites soon due to AI agents let loose on the web
Recent and related:
AI companies cause most of traffic on forums - https://news.ycombinator.com/item?id=42549624 - Dec 2024 (438 comments)