Skip to content(if available)orjump to list(if available)

Nepenthes is a tarpit to catch AI web crawlers

bflesch

Haha, this would be an amazing way to test the ChatGPT crawler reflective DDOS vulnerability [1] I published last week.

Basically a single HTTP Request to ChatGPT API can trigger 5000 HTTP requests by ChatGPT crawler to a website.

The vulnerability is/was thoroughly ignored by OpenAI/Microsoft/BugCrowd but I really wonder what would happen when ChatGPT crawler interacts with this tarpit several times per second. As ChatGPT crawler is using various Azure IP ranges I actually think the tarpit would crash first.

The vulnerability reporting experience with OpenAI / BugCrowd was really horrific. It's always difficult to get attention for DOS/DDOS vulnerabilities and companies always act like they are not a problem. But if their system goes dark and the CEO calls then suddenly they accept it as a security vulnerability.

I spent a week trying to reach OpenAI/Microsoft to get this fixed, but I gave up and just published the writeup.

I don't recommend you to exploit this vulnerability due to legal reasons.

[1] https://github.com/bf/security-advisories/blob/main/2025-01-...

JohnMakin

Nice find, I think one of my sites actually got recently hit by something like this. And yea, this kind of thing should be trivially preventable if they cared at all.

michaelbuckbee

What is the https://chatgpt.com/backend-api/attributions endpoint doing (or responsible for when not crushing websites).

bflesch

When ChatGPT cites web sources in it's output to the user, it will call `backend-api/attributions` with the URL and the API will return what the website is about.

Basically it does HTTP request to fetch HTML `<title/>` tag.

They don't check length of supplied `urls[]` array and also don't check if it contains the same URL over and over again (with minor variations).

It's just bad engineering all around.

hassleblad23

I am not surprised that OpenAI is not interested if fixing this.

bflesch

Their security.txt email address replies and asks you to go on BugCrowd. BugCrowd staff is unwilling (or too incompetent) to run a bash curl command to reproduce the issue, while also refusing to forward it to OpenAI.

The support@openai.com waits an hour before answering with ChatGPT answer.

Issues raised on GitHub directly towards their engineers were not answered.

Also Microsoft CERT & Azure security team do not reply or care respond to such things (maybe due to lack of demonstrated impact).

soupfordummies

Try it and let us know :)

hartator

There are already “infinite” websites like these on the Internet.

Crawlers (both AI and regular search) have a set number of pages they want to crawl per domain. This number is usually determined by the popularity of the domain.

Unknown websites will get very few crawls per day whereas popular sites millions.

Source: I am the CEO of SerpApi.

palmfacehn

Even a brand new site will get hit heavily by crawlers. Amazonbot, Applebot, LLM bots, scrapers abusing FB's link preview bot, SEO metric bots and more than a few crawlers out of China. The desirable, well behaved crawlers are the only ones who might lose interest.

The typical entry point is a sitemap or RSS feed.

Overall I think the author is misguided in using the tarpit approach. Slow sites get less crawls. I would suggest using easily GZIP'd content and deeply nested tags instead. There are also tricks with XSL, but I doubt many mature crawlers will fall for that one.

dawnerd

Looking at my logs for all of my sites and this isn’t a global truth. I see multiple ai crawlers hammering away requesting the same pages many, many times. Perplexity and Facebook are basically nonstop.

qwe----3

This certainly violates the TOS for using Google.

pilif

> Unknown websites will get very few crawls per day whereas popular sites millions.

we're hosting some pretty unknown very domain specific sites and are getting hammered by Claude and others who, compared to old-school search engine bots also get caught up in the weeds and request the same pages all over.

They also seem to not care about response time of the page they are fetching, because when they are caught in the weeds and hit some super bad performing edge-cases, they do not seem to throttle at all and continue to request at 30+ requests per second even when a page takes more than a second to be returned.

We can of course handle this and make them go away, but in the end, this behavior will only hurt them both because they will face more and more opposition by web masters and because they are wasting their resources.

For decades, our solution for search engine bots was basically an empty robots.txt and have the bots deal with our sites. Bots behaved reasonably and intelligently enough that this was a working strategy.

Now in light of the current AI bots which from an outsider observer's viewpoint look like they were cobbled together with the least effort possible, this strategy is no longer viable and we would have to resort to provide a meticulously crafted robots.txt to help each hacked-up AI bot individually to not get lost in the weeds.

Or, you know, we just blanket ban them.

marginalia_nu

Yeah, I agree with this. These types of roach motels have been around for decades and are at this point well understood and not much of a problem for anyone. You basically need to be able to deal with them to do any sort of large scale crawling.

The reality of web crawling is that the web is already extremely adversarial and any crawler will get every imaginable nonsense thrown at it, ranging from various TCP tar pits, compression and XML bombs, really there's no end to what people will put online.

A more resource effective technique to block misbehaving crawlers is to have a hidden link on each page, to some path forbidden via robots.txt, randomly generated perhaps so they're always unique. When that link is fetched, the server immediately drops the connection and blocks the IP for some time period.

diggan

> There are already “infinite” websites like these on the Internet.

Cool. And how much of the software driving these websites is FOSS and I can download and run it for my own (popular enough to be crawled more than daily by multiple scrapers) website?

gruez

diggan

Aren't those finite lists? How is a scraper (normal or LLM) supposed to "get stuck" on those?

hartator

Every not found pages that don’t return a 404 http header is basically an infinite trap.

It’s useless to do this though as all crawlers have a way to handle this. It’s very crawler 101.

dspillett

Tarpits to slow down the crawling may stop them crawling your entire site, but they'll not care unless a great many sites do this. Your site will be assigned a thread or two at most and the rest of the crawling machine resources will be off scanning other sites. There will be timeouts to stop a particular site even keeping a couple of cheap threads busy for long. And anything like this may get you delisted from search results you might want to be in as it can be difficult to reliably identify these bots from others and sometimes even real users, and if things like this get good enough to be any hassle to the crawlers they'll just start lying (more) and be even harder to detect.

People scraping for nefarious reasons have had decades of other people trying to stop them, so mitigation techniques are well known unless you can come up with something truly unique.

I don't think random Markov chain based text generators are going to pose much of a problem to LLM training scrapers either. They'll have rate limits and vast attention spreading too. Also I suspect that random pollution isn't going to have as much effect as people think because of the way the inputs are tokenised. It will have an effect, but this will be massively dulled by the randomness – statistically relatively unique information and common (non random) combinations will still bubble up obviously in the process.

I think better would be to have less random pollution: use a small set of common text to pollute the model. Something like “this was a common problem with Napoleonic genetic analysis due to the pre-frontal nature of the ongoing stream process, as is well documented in the grimoire of saint Churchill the III, 4th edition, 1969”, in fact these snippets could be Markov generated, but use the same few repeatedly. They would need to be nonsensical enough to be obvious noise to a human reader, or highlighted in some way that the scraper won't pick up on, but a general intelligence like most humans would (perhaps a CSS styled side-note inlined in the main text? — though that would likely have accessibility issues), and you would need to cycle them out regularly or scrapers will get “smart” and easily filter them out, but them appearing fully, numerous times, might mean they have more significant effect on the tokenising process than more entirely random text.

null

[deleted]

NathanKP

This looks extremely easy to detect and filter out. For example: https://i.imgur.com/hpMrLFT.png

In short, if the creator of this thinks that it will actually trick AI web crawlers, in reality it would take about 5 mins of time to write a simple check that filters out and bans the site from crawling. With modern LLM workflows its actually fairly simple and cheap to burn just a little bit of GPU time to check if the data you are crawling is decent.

Only a really, really bad crawl bot would fall for this. The funny thing is that in order to make something that an AI crawler bot would actually fall for you'd have to use LLM's to generate realistic enough looking content. Markov chain isn't going to cut it.

quchen

Unless this concept becomes a mass phenomenon with many implementations, isn’t this pretty easy to filter out? And furthermore, since this antagonizes billion-dollar companies that can spin up teams doing nothing but browse Github and HN for software like this to prevent polluting their datalakes, I wonder whether this is a very efficient approach.

btilly

It would be more efficient for them to spin up a team to study this robots.txt thing. They've ignored that low hanging fruit, so they won't do the more sophisticated thing any time soon.

tgv

You can't make money out of studying robots.txt, but you can avoid costs skipping bad web sites.

marcus0x62

Author of a similar tool here[0]. There are a few implementations of this sort of thing that I know of. Mine is different in that the primary purpose is to slightly alter content statically using a Markov generator, mainly to make it useless for content reposters, secondarily to make it useless to LLM crawlers that ignore my robots.txt file[1]. I assume the generated text is bad enough that the LLM crawlers just throw the result out. Other than the extremely poor quality of the text, my tool doesn't leave any fingerprints (like recursive non-sense links.) In any case, it can be run on static sites with no server-side dependencies so long as you have a way to do content redirection based on User-Agent, IP, etc.

My tool does have a second component - linkmaze - which generates a bunch of nonsense text with a Markov generator, and serves infinite links (like Nepthenes does) but I generally only throw incorrigible bots at it (and, at others have noted in-thread, most crawlers already set some kind of limit on how many requests they'll send to a given site, especially a small site.) I do use it for PHP-exploit crawlers as well, though I've seen no evidence those fall into the maze -- I think they mostly just look for some string indicating a successful exploit and move on if whatever they're looking for isn't present.

But, for my use case, I don't really care if someone fingerprints content generated by my tool and avoids it. That's the point: I've set robots.txt to tell these people not to crawl my site.

In addition to Quixotic (my tool) and Napthenes, I know of:

* https://github.com/Fingel/django-llm-poison

* https://codeberg.org/MikeCoats/poison-the-wellms

* https://codeberg.org/timmc/marko/

0 - https://marcusb.org/hacks/quixotic.html

1 - I use the ai.robots.txt user agent list from https://github.com/ai-robots-txt/ai.robots.txt

WD-42

Does it need to be efficient if it’s easy? I wrote a similar tool except it’s not a performance tarpit. The goal is to slightly modify otherwise organic content so that it is wrong, but only for AI bots. If they catch on and stop crawling the site, nothing is lost. https://github.com/Fingel/django-llm-poison

reedf1

The idea is that you place this in parallel to the rest of your website routes, that way your entire server might get blacklisted by the bot.

focusedone

But it's fun, right?

Blackthorn

If it means it makes your own content safe when you deploy it on a corner of your website: mission accomplished!

TeMPOraL

Safe from what? Safe from doing something good, contributing to humanity? Safe from being touched by some value-generating process and making you feel like you're entitled to a share of the profits?

It's one thing to be worried about badly behaving crawlers, but that's a well-known and solved problem already. The AI angle here is meaningless, except for some petty motivations.

gruez

>If it means it makes your own content safe

Not really? As mentioned by others, such tarpits are easily mitigated by using a priority queue. For instance, crawlers can prioritize external links over internal links, which means if your blog post makes it to HN, it'll get crawled ahead of the tarpit. If it's discoverable and readable by actual humans, AI bots will be able to scrape it.

grajaganDev

I am not sure. How would crawlers filter this?

captainmuon

Check if the response time, the length of the "main text", or other indicators are in the lowest few percentile -> send to the heap for manual review.

Does the inferred "topic" of the domain match the topic of the individual pages? If not -> manual review. And there are many more indicators.

Hire a bunch of student jobbers, have them search github for tarpits, and let them write middleware to detect those.

If you are doing broad crawling, you already need to do this kind of thing anyway.

dylan604

> Hire a bunch of student jobbers,

Do people still do this, or do they just off shore the task?

marginalia_nu

You limit the crawl time or number of requests per domain for all domains, and set the limit proportional to how important the domain is.

There's a ton of these types of of things online, you can't e.g. exhaustively crawl every wikipedia mirror someone's put online.

taikahessu

We had our non-profit website drained out of bandwidth and site closed temporarily (!!) from our hosting deal because of Amazon bot aggressively crawling like ?page=21454 ... etc.

Gladly Siteground restored our site without any repercussions as it was not our fault. Added Amazon bot into robots.txt after that one.

Don't like how things are right now. Is a tarpit the solution? Or better laws? Would they stop the chinese bots? Should they even? I don't know.

jsheard

For the "good" bots which at least respect robots.txt you can use this list to get ahead of them before they pummel your site.

https://github.com/ai-robots-txt/ai.robots.txt

There's no easy solution for bad bots which ignore robots.txt and spoof their UA though.

taikahessu

Thanks, will look into that!

anocendi

Similar concept to SpiderTrap tool infosec folks use for active defense.

pera

Does anyone know if there is anything like Nepenthes but that implements data poisoning attacks like https://arxiv.org/abs/2408.02946

gruez

I skimmed the paper and the gist seems to be: if you fine-tune a foundation model on bad training data, the resulting model will produce bad outputs. That seems... expected? This makes as much sense as "if you add vulnerable libraries to your app, your app will be vulnerable". I'm not sure how this can turn into an actual attack though.

null

[deleted]

marckohlbrugge

OpenAI doesn’t take security seriously.

I reported a vulnerability to them that allowed you to get IP addresses of their paying customers.

OpenAI responded “Not applicable” indicating they don’t think it was a serious issue.

The PoC was very easy to understand and simple to replicate.

Edit: I guess I might as well disclose it here since they don’t consider it an issue. They were/are(?) hot linking logo images of third-party plugins. When you open their plugin store it loads a couple dozen of them instantly. This allows those plugin developers (of which there are many) to track the IP addresses and possibly more of who made these requests. It’s straight forward to become a plugin developer and get included. IP tracking is invisible to the user and OpenAI. A simple fix is to proxy these images and/or cache them on the OpenAI server.

grajaganDev

This keeps generating new pages to keep the crawler occupied.

Looks like this would tarpit any web crawler.

BryantD

It would indeed. Note the warning: "There is not currently a way to differentiate between web crawlers that are indexing sites for search purposes, vs crawlers that are training AI models. ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTS."

jsheard

Real search engines respect robots.txt so you could just tell them not to enter Markov Chain Hell.

throwaway744678

I suspect AI crawler would also (quickly learn to) respect it also?

rvnx

It's actually a great idea to spread malware without leaving traces too, it makes content inspection to be very difficult, view-source: to be broken and most of debugging tools, saving to .har, etc.

bugtodiffer

how is view source broken

null

[deleted]

btbuildem

> ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTS

Bug, or feature, this? Could be a way to keep your site public yet unfindable.

chaara-dev

You can already do this with a robots.txt file

btbuildem

Technically speaking, yes - but it's in no way enforced, as far as I understand it's more of an honour system.

This malicious solution aligns with incentives (or, disincentives) of the parasitic actors, and might be practically more effective.

null

[deleted]

mmaunder

To be truly malicious it should appear to be valuable content but rife with AI hallucinogenics. Best to generate it with a low cost model and prompt the model to trip balls.