Poisoning Well

90 comments

·September 5, 2025

8organicbits

The robots.txt used here only tells GoogleBot to avoid /nonsense/. It would be nice to tell other web crawlers too otherwise your poisoning everyone but Google, not just crawlers that ignore robots.txt

nvader

A link to the poisoned version of the same article:

https://heydonworks.com/nonsense/poisoning-well/

1970-01-01

Crazy how close to coherent it reads, yet it clearly is gibberish.

kitku

This reminds me of the Nepenthes tarpit [1], which is an endless source of ad-hoc generated garbled mess which links to itself over and over.

Probably more effective at poisoning the dataset if one has the resources to run it.

[1]: https://zadzmo.org/code/nepenthes/

fleebee

I'm running Iocaine[1] which is essentially the same thing on my tiny $3/mo VPS and it's handling crawlers bombarding the honeypot with ~12 requests per second just fine. It's using about 30 MB of RAM.

[1]: https://iocaine.madhouse-project.org/

treetalker

Odorless, tasteless, and among the more deadly poisons known to crawlers!

BrenBarn

Unfortunately they will spend the next several years building up an immunity.

hosh

There is a project called Iocaine that does something similar while trying to minimize resource use.

neuroelectron

Kind of too late for this. The ground truth of models has already been established. That' why we see models converging. They will automatically reject this kind of poison.

nine_k

This will remain so as long as the models don't need to ingest any new information. If most novel texts will appear with slightly more insidious nonsense mirrors, LLMs would either have to stay without this knowledge, or start respecting "nofollow".

blagie

It's competition. Poison increases in toxicity over time.

I could generate subtly wrong information on the internet LLMs would continue to swallow up.

latexr

> I could generate subtly wrong information on the internet

There’s already a website for that. It’s called Reddit.

wilg

In my opinion, colonialism was significantly worse than web crawlers being used to train LLMs.

deadbabe

Not every bot that ignores your robots.txt is necessarily using that data.

What some bots do is they first scrape the whole site, then look at which parts are covered by robots.txt, and then store that portion of the website under an “ignored” flag.

This way, if your robots.txt changes later, they don’t have to scrape the whole site again, they can just turn off the ignored flag.

nvader

Not every intruder who enters your home is necessarily a burglar.

cyphar

Ah, so the NSA defence then -- "it's not bulk collection because it only counts as collection when we look at it".

imtringued

Your post-rationalization just doubles down on the stance that these crawlers are abusive and poorly developed.

You're also under the blatantly wrong misconception that people are worried about their data, when they are worried about the load of a poorly configured crawler.

The crawler will scrape the whole website on a regular interval anyway, so what is the point of this "optimization" that optimizes for highly infrequent events?

bboygravity

I find this whole anti-LLM stance so weird. It kind of feels like trying to build robot distractions into websites to distract search engine indexers in the 2000's or something.

Like why? Don't you want people to read your content? Does it really matter that meat bags find out about your message to the world through your own website or through an LLM?

Meanwhile, the rest of the world is trying to figure out how to deliberately get their stuff INTO as many LLMs as fast as possible.

tpxl

> Does it really matter that meat bags find out about your message to the world through your own website or through an LLM?

Yes, it matters a lot.

You know of authors by name because you read their works under their name. This has allowed them to profit (not necessarily in direct monetary value) and publish more works. Chucking everything into a LLM takes the profit from individual authors and puts them into pockets of gigacorporations.

Not to mention the facts the current generation of LLMs will straight up hallucinate things, sometimes turning the message you're trying to send on its head.

Then there's the question of copyright. I can't pirate a movie, but Facebook can pirate whole libraries, create a LLM and sell it and it's OK? I'd have a lot less of an issue if this was done ethically.

kulahan

Does it really matter when, previously, the exact same problem existed in the form of Google Cards in your search results? ;)

nvader

The presense of an earlier problem does not solve, or make less severe, a later problem.

Why are you winking?

aucisson_masque

I think it’s obvious.

In simpler terms, it comes down to the « you made this ?, I made this » meme.

Now if your ‘content’ is garbage that takes longer to publish than to write, I can get your point of view.

But for the authors who write articles that people actually want to read, because it’s interesting and well written, it’s like robbery.

Unlike humans, you can’t say that LLM create new things from what they read. LLM just sum up and repeat, evaluating with algorithms what word should be next.

Meanwhile humans… Oscar Wilde — 'I have spent most of the day putting in a comma and the rest of the day taking it out.'

snowram

LLM can create new things, since their whole purpose is to interpolate concepts in a latent space. Unfortunately they are mostly used to regurgitate verbatim what they learned, see the whole AI Ghibli craze. Blame people and their narrow imagination.

InsideOutSanta

I think at least partially, it's not an anti-LLM stance, it's an anti-"kill my website" stance. Many LLM crawlers behave very poorly, hurting websites in the process. One result of that is that website owners are forced to use things like Anubis, which has the side-effect of hurting everybody else, too.

I prefer this approach because it specifically targets problematic behavior without impacting clients who don't engage in it.

HankStallone

> Don't you want people to read your content?

People, yes. Well-behaved crawlers that follow established methods to prevent overload and obey established restrictions like robots.txt, yes. Bots that ignore all that and hammer my site dozens of times a second, no.

I don't see the odds of someone finding my site through an LLM being high enough to put up with all the bad behavior. In my own use of LLMs, they only occasionally include links, and even more rarely do I click on one. The chance that an LLM is going to use my content and include a link to it, and that the answer will leave something out so the person needs to click the link for more, seems very remote.

nottorp

> Don't you want people to read your content?

There's the problem right there. If all you produce is "content" your position makes sense.

kulahan

Can you elaborate on what this means? Because I’m not sure which alternative you’re suggesting exists to put on a website besides content.

wonger_

I can think of several anti-LLM sentiments right now:

- developers upset with the threat of losing their jobs or making their jobs more dreadful

- craftspeople upset with the rise in slop

- teachers upset with the consequences of students using LLMs irresponsibly

- and most recently, webmasters upset that LLM services are crawling their servers irresponsibly

Maybe the LLMs don't seem so hostile if you don't fall into those categories? I understand some pro-LLM sentiments, like content creators trying to gain visibility, or developers finding productivity gains. But I think that for many people here, the cons outweigh the pros, and little acts of resistance like this "poisoning well" resonate with them. https://chronicles.mad-scientist.club/cgi-bin/guestbook.pl is another example.

threetonesun

You forgot the big one, every head of the major AI companies had dinner with a fascist the other day, and we already know they have their thumbs on the scale of weighting responses. It's more reasonable to say that the well is already poisoned.

timdiggerm

If it's ad-supported or I'm seeking donations, I only want people reading it on my website. Why would I want people to access it through an LLM?

LtWorf

I don't want having to pay extra money for vibe-coded LLMs companies bots to scrape my website constantly, ignoring cache headers and the likes.

Every single person who has wrote a book is happy if others read their book. They might be less enthusiastic about printing million copies and shipping them to random people with their own money.

null

[deleted]

simonw

There are two common misconceptions in this post.

The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it: https://platform.openai.com/docs/bots

Since LLM skeptics frequently characterize all LLM vendors as dishonest mustache-twirling cartoon villains there's little point trying to convince them that companies sometimes actually do what they say they are doing.

The bigger misconception though is the idea that LLM training involves indiscriminately hoovering up every inch of text that the lab can get hold of, quality be damned. As far as I can tell that hasn't been true since the GPT-3 era.

Building a great LLM is entirely about building a high quality training set. That's the whole game! Filtering out garbage articles full of spelling mistakes is one of many steps a vendor will take in curating that training data.

vintermann

There are definitively scrapers that ignore your robots.txt file. Whether they're some "Enemy State" LLM outfit, an "Allied State" corporation outsourcing their dirty work a step or two, or just some data hoarder worried that the web as we know it is going away soon, everyone is saying they're a problem lately, I don't think everyone is lying.

But it's certainly also true that anyone feeding the scrapings to an LLM will filter it first. It's very naive of this author to think that his adlib-spun prose won't get detected and filtered out long before it's used for training. Even the pre-LLM internet had endless pages of this sort of thing, from aspiring SEO spammers. Yes, you're wasting a bit of the scraper's resources, but you can bet they're already calculating in that waste.

simonw

There are definitively scrapers that ignore your robots.txt file

Of course. But those aren't the ones that explicitly say "here is how to block us in robots.txt"

The exact quote from the article that I'm pushing back on here is:

"If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality."

Which appears directly below this:

  User-agent: GPTBot
  Disallow: /

spacebuffer

Your initial comment made sense after reading through the openai docpage. so I opened up my site to add those to robots.txt, turns out I had added all 3 of those user-agents to my robots file [0], out of curiosity I asked chatgpt about my site and it did scrape it, it even mentioned articles that have been published after adding the robots file

[0]: https://yusuf.fyi/robots.txt

simoncion

> But those aren't the ones that explicitly say "here is how to block us in robots.txt"

Facebook attempted to legally acquire massive amounts of textual training data for their LLM development project. They discovered that acquiring this data in an aboveboard manner would be in part too expensive [0], and in part simply not possible [1]. Rather than either doing without this training data or generating new training data, Facebook decided to just pirate it.

Regardless of whether you agree with my expectations, I hope you'll understand why I expect many-to-most companies in this section of the industry to publicly assert that they're behaving ethically, but do all sorts of shady shit behind the scenes. There's so much money sloshing around, and the penalties for doing intensely anti-social things in pursuit of that money are effectively nonexistent.

[0] because of the expected total cost of licensing fees

[1] in part because some copyright owners refused to permit the use, and in part because some copyright owners were impossible to contact for a variety of reasons

Retric

> The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it:

That’s testable and you can find content “protected” by robots.txt regurgitated by LLM’s. In practice it doesn’t matter if that’s through companies lying or some 3rd party scraping your content and then getting scraped.

simonw

There's a subtle but important difference between crawling data to train a model and accessing data as part of responding to a prompt and then piping that content into the context in order to summarize it (which may be what you mean by "regurgitation" here, I'm not sure.)

I think that distinction is lost on a lot of people, which is understandable.

simonw

Do you have an example that demonstrates that?

whilenot-dev

User Agent "Perplexity‑User"[0]:

> Since a user requested the fetch, this fetcher generally ignores robots.txt rules.

[0]: https://docs.perplexity.ai/guides/bots

rozab

After I set up a self hosted git forge a little while ago, I found that within minutes it immediately got hammered by OpenAI, Anthropic, etc. They were extremely aggressive, grabbing every individual file from every individual commit, one at a time.

I hadn't backlinked the site anywhere and was just testing, so I hadn't thought to put up a robots.txt. They must have found me through my cert registration.

After I put up my robots.txt (with explicit UA blocks instead of wildcards, I heard some ignore them), I found after a day or so the scraping stopped completely. The only ones I get now are vulnerability scanners, or random spiders taking just the homepage.

I know my site is of no consequence, but for those claiming OpenAI et al ignore robots.txt I would really like to see some evidence. They are evil and disrespectful and I'm gutted they stole my code for profit, but I'm still sceptical of these claims.

Cloudflare have done lots of work here and have never mentioned crawlers ignoring robots.txt:

https://blog.cloudflare.com/control-content-use-for-ai-train...

CrossVR

> Since LLM skeptics frequently characterize all LLM vendors as dishonest mustache-twirling cartoon villains there's little point trying to convince them that companies sometimes actually do what they say they are doing.

Even if the large LLM vendors respect it, there's enough venture capital going around that plenty of smaller vendors are attempting to train their own LLMs and they'll take every edge they can get, robots.txt be damned.

simonw

Yeah this is definitely true.

flir

> The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it: https://platform.openai.com/docs/bots

So, uh... where's all the extra traffic coming from?

simonw

All of the badly behaved crawlers.

flir

Yeah, I read the rest of the conversation and tried to delete. I understand your point now. Apologies.

hooloovoo_zoo

Your link does not say they will obey it.

simonw

Direct quote from https://platform.openai.com/docs/bots

"OpenAI uses the following robots.txt tags to enable webmasters to manage how their sites and content work with AI."

Then for GPT it says:

"GPTBot is used to make our generative AI foundation models more useful and safe. It is used to crawl content that may be used in training our generative AI foundation models. Disallowing GPTBot indicates a site’s content should not be used in training generative AI foundation models."

What are you seeing here that I'm missing?

hooloovoo_zoo

My read is that they are describing functionality for site owners to provide input about what the site owner thinks should happen. OpenAI is not promising that is what WILL happen, even in the narrow context of that specific bot.

charles_f

I somewhat agree with your viewpoint on copyright, but what terrifies me is VCs like a16z or Sequoia simultaneously backing up large LLMs profiting from ignoring copyright and media firms where they'll use whatever power and lobby they have to protect copyright.

I don't think the content I produce is worth that much, I'm glad if it can serve anyone, but I find amusing the idea to poison the well

fastball

> According to Google, it’s possible to verify Googlebot by matching the crawler’s IP against a list of published Googlebot IPs. This is rather technical and highly intensive

Wat. Blocklisting IPs is not very technical (for someone running a website that knows + cares about crawling) and is definitely not intensive. Fetch IP list, add to blocklist. Repeat daily with cronjob.

Would take an LLM (heh) 10 seconds to write you the necessary script.

kinix

From how I read it, the author seems to be suggesting that this list of IPs be on an allowlist as they see Google as "less nefarious". As such, sure, allowing google IPs is as easy as allowing all IPs, but discerning who are "nefarious actors" is probably harder.

A more tounge-in-cheek point: all scripts take an LLM ~10 seconds to write, doesn't mean it's right though.

simonw

Looks like OpenAI publish the IPs of their training data crawler too: https://openai.com/gptbot.json

Popeyes

What is this war about?

I was looking at another thread about how Wikipedia was the best thing on the internet. But they only got the head start by taking copy of Encyclopedia Britannica and everything else is a

And now the corpus is collected, what difference does a blog post make, does it nudge the dial to comprehension 0.001% in a better direction? How many blog posts over how many weeks makes the difference.

nvader

> they only got the head start by taking copy of Encyclopedia Britannica

Wikipedia used a version of Encyclopedia Britannica that was in the public domain.

Go thou and do likewise.

collinmcnulty

It is about imposing costs on poorly behaved scraping in an attempt to change the scrapers behavior, under the assumption that the scrapers' creators are anti-social but economically rational. One blog doesn't make a huge difference but if enough new blogs contain tarpits that cost the scraper as much as the benefit of 100 other non-tarpit blogs, maybe the calculus for doing any new scraping changes and the scrapers start behaving.

simonw

This is the first I've heard of Wikipedia starting with a copy of Britannia. Where did you see that?

simonw

OK, found it: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Encyclop...

"Starting in 2006, much of the still-useful text in the 1911 Encyclopaedia was adapted and absorbed into Wikipedia. Special focus was given to topics that had no equivalent in Wikipedia at the time. The process we used is outlined at the end of this page."

Wikipedia started in 2001. Looks like they absorbed a bunch of out-of-copyright Britannica 1911 content five years later.

There are still 13,000 pages on Wikipedia today that are tagged as deriving from that project: https://en.m.wikipedia.org/wiki/Template:EB1911

tqwhite

I continue to have contempt for the "I'm not contributing to the enrichment of our newest and most powerful technology" gang. I do not accept the assertion that my AI should have any less access to the internet that we all pay for than I do.

If guys like this have their way, AI will remain stupid and limited and we will all be worse off for it.

nkrisc

> I continue to have contempt for the "I'm not contributing to the enrichment of our newest and most powerful technology" gang.

Ok, when do I get paid for my contribution?

thoroughburro

I do not accept the assertion that the content I host is anything other than my own to serve to whom I wish. Get your hands off my belongings, freeloader.

Or, shorter: I hold you in as much contempt as you hold me.

> If guys like this have their way, AI will remain stupid and limited

AI doesn’t have a right to my stuff just because it’ll suck without it.

IncreasePosts

Okay, what percent of content is poison like this? Way less than 1%? If AI is so smart maybe it can figure out what content in a training set is jibberish and what isn't.

Anyway, a big problem people have isn't "AI bad", it's "AI crawlers bad", because they eat up a huge chunk of your bandwidth by behaving badly for content you intend to serve to humans.

HN

Poisoning Well

Poisoning Well