You Should Feed the Bots
73 comments
·October 26, 2025zkmon
righthand
The agricultural farmers did it to themselves, many are very wealthy already. Anything corporate America has taken over is because the farmers didn’t want to do the maintenance work. So they sell out to bug corporations who will make it easier.
Same as any other consumer using Meta products. You sell out because it’s easier to network that way.
I am the son of a farmer.
Edit: added disclosure at the bottom and clarified as agricultural farming
zkmon
I'm a farmer myself. I was talking about farmers in some third world countries. They are extremely marginalized and suffered for decades and centuries. They still do.
Lord-Jobo
This is a very biased source discussing a very real prescription issue, and worth a glance for the statistics:
https://www.farmkind.giving/the-small-farm-myth-debunked
Tldr; the concept of farmers as small family farms has not been rooted in truth for a very long time in America
righthand
This is for livestock farming, I was specifically discussing agricultural farming.
In general though, the easy rule of living and eating non-mega farmed food and sustainable living is to “eat aware”:
My other advice is a one-size-fits-all food equation, which is, simply, to know where it came from. If you can't place it, trace it, or grow it/raise it/catch it yourself, don't eat it. Eat aware. Know your food. Don't wait on waiters or institutions to come up with ways to publicize it, meet your small fishmonger and chat him or her up at the farmer's market yourself. [0]
[0] https://www.huffpost.com/entry/the-pescatores-dilemma_b_2463...
markus_zhang
I have always recommended this strategy: flood the AI bots with garbage that looks like authentic information so that they need actual humans to filter the information. Make sure that every site does this so they get more garbage than real stuffs. Hike up the proportion so that even ordinary people eventually figure out that using these AI products has more harm than use because it just produces garbage. I just don't know what is the cost, now it looks like pretty doable.
If you can't fight them, flood them. If they want to open a window, pull down the whole house.
peterlk
LLMs can now detect garbage much more cheaply than humans can. This might increase cost slightly for the companies that own the AIs, but it almost certainly will not result in hiring human reviewers
lcnPylGDnU4H9OF
> LLMs can now detect garbage much more cheaply than humans can.
Off the top of my head, I don't think this is true for training data. I could be wrong, but it seems very fallible to let GPT-5 be the source of ground truth for GPT-6.
markus_zhang
What about garbage that are difficult to tell from truth?
For example, say I have an AD&D website, how does AI tell whether a piece of FR history is canon or not? Yeah I know it's a bit extreme, but you get the idea.
goodthink
I have yet to see any bots figure out how to get past the Basic Auth protecting all links on my (zero traffic) website. Of course, any user following a link will be stopped by the same login dialog (I display the credentials on the home page). The solution is to make the secrets public. ALL websites could implement the same User/Pass credentials: User: nobots Pass: nobots Can bot writers overcome this if they know the credentials?
CaptainOfCoit
> Can bot writers overcome this if they know the credentials?
Yes, instead of doing just a HTTP request, do a HTTP request with authentication, trivial really. Probably the reason they "can't" do that now is because they haven't came across "public content behind Basic Auth with known correct credentials", so the behavior hasn't been added. But it's literally loading http://username:password@example.com instead of http://example.com to use Basic Auth, couldn't be simpler :)
8organicbits
The technical side is straightforward but the legal implications of trying passwords to try to scrape content behind authentication could pose a barrier. Using credentials that aren't yours, even if they are publicly known, is (in many jurisdictions) a crime. Doing it at scale as part of a company would be quite risky.
DrewADesign
The people in the mad dash to AGI are either driven by religious conviction, or pure nihilism. Nobody doing this seriously considers the law a valid impediment. They justify (earnestly or not) companies doing things like scraping independent artist’s bread and butter work to create commercial services that tank their market with garbage knockoffs by claiming we’re moving into a post-work society. Meanwhile, the US government is moving at a breakneck pace to dismantle the already insufficient safety nets we do have. None of them care. Ethical roadblocks seem to be a solved problem in tech, now.
Macha
The legal implications of torrenting giant ebook collections didn't seem to stop them, not sure why this would
CaptainOfCoit
> but the legal implications of trying passwords to try to scrape content behind authentication could pose a barrier
If you're doing something alike to cracking then yeah. But if the credentials are right there on the landing page, and visible to the public, it's not really cracking anymore since you already know the right password before you try it, and the website that put up the basic auth is freely sharing the password, so you aren't really bypassing anything, just using the same access methods as everyone else.
Again, if you're stumbling upon basic auth and you try to crack them, I agree it's at least borderline illegal, but this was not the context in the parent comment.
Filligree
Sure, it’s a crime for the bots, but it would also be a crime for the ordinary users that you want to access the website.
Or if you make it clear that they’re allowed, I’m not sure you can stop the bots then.
sisizbzb
There’s hundreds of billions of dollars behind these guys. Not only that, but they also have institutional power backing them. The laws don’t really matter to the worst offenders.
Similar to OPs article, trying to find a technical solution here is very inefficient and just a bandaid. The people running our society are on the whole corrupt and evil. Much simpler (not easier) and more powerful to remove them.
morkalork
The bot protection on low traffic sites can be hilarious in how simple and effective it can be. Just click this checkbox. That's it. But it's not a check box matching a specific pattern provided by a well-known service, so until the bot writer inspects the site and adds the case it'll work. A browser running openai operator or whatever its called would immediately figure it out though.
akoboldfrying
> A browser running openai operator or whatever its called would immediately figure it out though.
But running that costs money, which is a disincentive. (How strong of a disincentive depends on how much it costs vs. the estimated value of a scraped page, but I think it would 100x the per-page cost at least.)
lfkdev
Not sure if I can follow you, why would credentials known by anyone stop bots?
throw-10-13
[flagged]
fainpul
This follow-up post has the details of the "Markov babbler":
isoprophlex
Very elegant and surprisingly performant. I hope the llm bros have a hard time cleaning this shit out of their scrapes.
tyfon
Thank you, I am now serving them garbage :)
For reference, I picked Frankenstein, Alice in wonderland and Moby dick as sources and I think they might be larger than necessary as they take some time to load. But they still work fine.
There also seems to be a bug in babble.c in the thread handling? I did "fix" it as gcc suggested by changing pthread_detach(&thread) to pthread_detach(thread).. I probably broke something but it compiles and runs now :)
nodja
Why create the markov text server side? If the bots are running javascript just have their client generate it.
bastawhiz
1. The bots have essentially unlimited memory and CPU. That's the cheapest part of any scraping setup.
2. You need to send the data for the Markov chain generator to the client, along with the code. This is probably bigger than the response you'd be sending anyway. (And good luck getting a bot to cache JavaScript)
3. As the author said, each request uses microseconds of CPU and just over a megabyte of RAM. This isn't taxing for anyone.
vntok
> 1. The bots have essentially unlimited memory and CPU. That's the cheapest part of any scraping setup.
Anyone crawling at scale would try to limit the per-request memory and CPU bounds, no? Surely you'd try to minimize resource contention at least a little bit?
blackhaj7
Can someone explain how this works?
Surely the bots are still hitting the pages they were hitting before but now they also hit the garbage pages too?
blackhaj7
Ah, it is explained in another post - https://maurycyz.com/projects/trap_bots/
Clever
wodenokoto
In authors setup, sending Markova generated garbage is much lighter on resources than sending static pages. Only bots will continue to follow links to the next piece of garbage and thus he traps bots in garbage. No need to detect bots, they reveal themselves.
But yes, all bots start out on an actual page.
theturtlemoves
Does this really work though? I know nothing about the inner workings of LLMs, but don't you want to break their word associations? Rather than generating "garbage" text based on which words tend to occur together and LLMs generating text based on which words it has seen together, don't you want to give them text that relates unrelated words?
wodenokoto
Why? The point is not to train bots one way or another, it’s to keep them busy in low resource activities instead of high resource activities.
krzyk
But why?
Do they do any harm? They do provide source for material if users asks for it. (I frequently do because I don't trust them, so I check sources).
You still need to pay for the traffic, and serving static content (like text on that website) is way less CPU/disk expensive than generating anything.
kaoD
What you're referring to are LLMs visiting your page via tool use. That's a drop in the ocean of crawlers that are racing to slurp as much of the internet as possible before it dries.
null
blibble
if you want to be really sneaky make it so the web doesn't start off infinite
because as infinite site that has appeared out of nowhere will quickly be noticed and blocked
start it off small, and grow it by a few pages every day
and the existing pages should stay 99% the same between crawls to gain reputation
hyperhello
Why not show them ads? Endless ads, with AI content in between them?
delecti
To what end? I imagine ad networks have pretty robust bot detection. I'd also be surprised if scrapers didn't have ad block functionality in their headless browsing.
eviks
How does this help protect the regular non-garbage pages from the bots?
codeduck
it does at a macroscopic level by making scraping expensive. If every "valid" page is scattered at random amongst a tarpit of recursive pages of nonsense, it becomes computationally and temporaly expensive to scrape a site for "good" data.
A single site doing this does nothing. But many sites doing this has a severe negative impact on the utility of AI scrapers - at least, until a countermeasure is developed.
Really cool. Reminds me of farmers of some third world countries. Completely ignored by government, exploited by commission brokers, farmers now use all sorts of tricks, including coloring and faking their farm produce, without regard for health hazards to consumers. The city dwellers who thought they have gamed the system through high education, jobs and slick-talk, have to consume whatever is served to them by the desperate farmers.