Please stop externalizing your costs directly into my face

128 comments

·March 18, 2025

dalke

I have a small static site. I haven't touched it in a couple of years.

Even then, I see bot after bot, pulling down about 1/2 GB per day.

Like, I distribute Python wheels from my site, with several release versions X several Python versions.

I can't understand why ChatGPT, PetalBot, and other bots want to pull down wheels, much less the full contents when the header shows it hasn't changed:

  Last-Modified: Thu, 25 May 2023 09:07:25 GMT
  ETag: "8c2f67-5fc80f2f3b3e6"

Well, I know the answer to the second question, as DeVault's title highlights - it's cheaper to re-read the data and re-process the content than set up a local cache.

Externalizing their costs onto me.

I know 1/2 GB/day is not much. It's well under the 500 GB/month I get from my hosting provider. But again, I have a tiny site with only static hosting, and as far as I can tell, the vast majority of transfers from my site are worthless.

Just like accessing 'expensive endpoints like git blame, every page of every git log, and every commit in every repo' seems worthless.

CGamesPlay

This is a pet peeve of Rachel by the Bay. She sets strict limits on her RSS feed based on not properly using the provided caching headers. I wonder if anyone has made a WAF that automates this sort of thing.

notacoward

Given that they're actively trying to obfuscate their activity (according to Drew's description), identifying and blocking clients seems unlikely to work. I'd be tempted to de-prioritize the more expensive types of queries (like "git blame") and set per repository limits. If a particular repository gets hit too hard, further requests for it will go on the lowest-priority queue and get really slow. That would be slightly annoying for legitimate users, but still better than random outages due to system-wide overload.

BTW isn't the obfuscation of the bots' activity a tacit admission by their owners that they know they're doing something wrong and causing headaches for site admins? In the copyright world that becomes wilful infringement and carries triple damages. Maybe it should be the same for DoS perpetrators.

CGamesPlay

Just to clarify, my understanding is that she doesn't block user agent strings, she blocks based on IP and not respecting caching headers (basically, "I know you already looked at this resource and are not including the caching tags I gave to you"). It's a different problem than the original article discusses, but perhaps more similar to @dalke's issue.

dalke

I'm pretty sure:

"I have to review our mitigations several times per day to keep that number from getting any higher. When I do have time to work on something else, often I have to drop it when all of our alarms go off because our current set of mitigations stopped working. ... it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all."

means that no one has managed it.

socksy

I don't understand the thing about the cache. Presumably they have a model that they are training, that must be their cache? Are they retraining the same model on the same data on the basis that that will weigh higher page ranked pages higher or something? Or is this about training slightly different models?

If they are really just training the same model, and there's no benefit to training it multiple times on that data, then presumably they could use a statistical data structure like https://en.wikipedia.org/wiki/HyperLogLog to be check if they've trained on the site before based on the Last-Modified header + URI? That would be far cheaper than a cache, and cheaper than rescraping.

I was also under the impression that the name of the game with training was to get high quality, curated training sets, which by their nature are quite static? Why are they all still hammering the web?

MathMonkeyMan

Good rant!

The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not? Behavior? Volume? Unlikely coincidence?

> random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

future10se

There are commercial services that provide residential proxies, i.e. you get to tunnel your scraper or bot traffic through actual residential connections. (see: Bright Data, oxylabs, etc.)

They accomplish this by providing home users with some app that promises to pay them money for use of their connection. (see: HoneyGain, peer2profit, etc.)

Interestingly, the companies selling the tunnel service to companies and the ones paying home users to run an app are sometimes different, or at least they use different brands to cater to the two sides of the market. It also wouldn't surprise me if they sold capacity to each other.

I suspect some of these LLM companies (or the ones they outsource to capture data) some of their traffic through these residential proxy services. It's funny because some of these companies already have a foothold inside homes (Google Nest and Amazon Alexa devices, etc.) but for a number of reasons (e.g. legal) they would probably rather go through a third-party.

bayindirh

They can be local LLMs doing search, some SETI@Home style distributed work, or else.

I host my blog at Mataroa, and the stats show 2K hits for some pages on some days. I don't think my small blog and writings are that popular. Asking ChatGPT about these subjects with my name results in some hilarious hallucinations, pointing to poor training due to lack of data.

IOW, my blog is also being scraped and used to train LLMs. Not nice. I really share the sentiment with Drew. I never asked/consented for this.

stavros

> the stats show 2K hits for some pages on some days

This has been happening since long before LLMs. Fifteen years ago, my blog would see 3k visitors a day on server logs, but only 100 on Google Analytics. Bots were always scraping everything.

zild3d

> I never asked/consented for this.

You put it on the information super highway baby

bayindirh

> You put it on the information super highway baby

...with proper licenses.

Here, FTFY, baby.

Addenda: Just because you don't feel like obeying/honoring them doesn't make the said licenses moot and toothless. I mean, if you know, you know.

Applejinx

Between Microsoft and Google, my existence AND presence as a community open source developer is being scraped and stolen.

I've been trying to write a body of audio code that sounds better than the stuff we got used to in the DAW era, doing things like dithering the mantissa of floating-point words, just experimental stuff ignoring the rules. Never mind if it works: I can think it does, but my objection holds whether it does or not.

Firstly, if you rip my stuff halfway it's pointless: without the coordinated intention towards specific goals not corresponding with normally practiced DSP, it's useless. LLMs are not going to 'get' the intention behind what I'm doing while also blending it with the very same code I'm a reaction against, the code that greatly outnumbers my own contributions. So even if you ask it to rip me off it tries to produce a synthesis with what I'm actively avoiding, resulting in a fantasy or parody of what I'm trying to make.

Secondly, suppose it became possible to make it hallucinate IN the relevant style, perhaps by training exclusively on my output, so it can spin off variations. That's not so far-fetched: _I_ do that. But where'd the style come from, that you'd spend effort tuning the LLM to? Does it initiate this on its own? Would you let it 'hallucinate' in that direction in the belief that maybe it was on to something? No, it doesn't look like that's a thing. When I've played with LLMs (I have a Mac Studio set up with enough RAM to do that) it's been trying to explore what the thing might do outside of expectation, and it's hard to get anything interesting that doesn't turn out to be a rip from something I didn't know about, but it was familiar with. Not great to go 'oh hey I made it innovate!' when you're mistakenly ripping off an unknown human's efforts. I've tried to explore what you might call 'native hallucination', stuff more inherent to collective humanity than to an individual, and I'm not seeing much facility with that.

Not that people are even looking for that!

And lastly, as a human trying to explore an unusual position in audio DSP code with many years of practice attempting these things and sharing them with the world around me only to have Microsoft try to reduce me to a nutrient slurry that would add a piquant flavor to 'writing code for people', I turn around and find Google, through YouTube, repeatedly offering to speak FOR me in response to my youtube commenters. I'm sure other people have seen this: probably depends on how interactive you are with your community. YouTube clearly trains a custom LLM on my comment responses to my viewers, that being text they have access to (doubtless adding my very verbose video footnotes) to the point that they're regularly offering to BE ME and save me the trouble.

Including technical explanations and helpful suggestions of how to use my stuff, that's not infrequently lies and bizarro world interpretations of what's going on, plus encouraging or self-congratulatory remarks that seem partly drawn from known best practices for being an empty hype beast competing to win the algorithm.

I'm not sure whether I prefer this, or the supposed promise of the machines.

If it can't be any better than this, I can keep working as I am, have my intentionality and a recognizable consistent sound and style, and be full of sass and contempt for the machines, and that'll remain impossible for that world to match (whether they want to is another question… but purely in marketing terms, yes they'll want to because it'll be a distinct area to conquer once the normal stuff is all a gray paste)

If it follows the path of the YouTube suggestions, there will simply be more noise out there, driven by people trying to piggyback off the mindshare of an isolated human doing a recognizable and distinct thing for most of his finite lifetime, with greater and greater volume of hollow mimicry of that person INCLUDING mimicry of his communications and interpersonal tone, the better to shunt attention and literal money to, not the LLMs doing the mimicking, but a third party working essentially in marketing, trying to split off a market segment they've identified as not only relevant, but ripe for plucking because the audience self-identifies as eager to consume the output of something that's not usual and normal.

(I should learn French: that rant is structurally identical to an endlessly digressive French expostulation)

Today I'm doing a livestream, coding with a small audience as I try for the fourth straight day to do a particular sort of DSP (decrackling) that's previously best served by some very expensive proprietary software costing over two thousand dollars for a license. Ideally I can get some of the results while also being able to honor my intentions for preserving the aspects of the audio I value (which I think can be compromised by such invasive DSP). That's because my intention will include this preservation, these iconoclastic details I think important, the trade-offs I think are right.

Meanwhile crap is trained on my work so that a guy who wants money can harness rainforests worth of wasted electrical energy to make programs that don't even work, and a pretend scientist guru persona who can't talk coherently but can and will tell you that he is "a real audio hero who's worked for many years to give you amazing free plugins that really defy all the horrible rules that are ruining music"!

Because this stuff can't pay attention, but it can draw all the wrong conclusions from your tone.

And if you question your own work and learn and grow from your missteps to have greater confidence in your learned synthesis of knowledge, it can't do that either but it can simultaneously bluster with your confidence and also add 'but who knows maybe I'm totally wrong lol!'

And both are forms of lies, as it has neither confidence nor self-doubt.

I'm going on for longer than the original article. Sorry.

ilumanty

This sounds pretty interesting. Can you share a link to your work or livestream?

danaris

I run a small browser game—roughly 150 unique weekly active users.

Our Wiki periodically gets absolutely hammered by LLM scraper bots, rotating IP addresses like mad to avoid mitigations like fail2ban (which I do have in place). And even when they're not hitting it hard enough to crash the game (through the external data feeds many of the wiki pages rely on), they're still scraping pretty steadily.

There is no earthly way that my actual users are able to sustain ~400kbps outbound traffic round the clock.

zjehT

> The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not?

I think you cannot distinguish. But the issue is so large that Google now serves captchas on legitimate traffic, sometimes after the first search term if you narrow down the time window (less than 24 hours).

I wonder when real Internet companies will feel the hurt, simply because consumers will stop using the ruined Internet.

throwawayffffas

> If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

After using Claude code for an afternoon, I have to say I don't think this bubble is going to burst any time soon.

hobofan

I think there is a good chance of the VC/Startup side of the bubble bursting. However, I think we will never go back to a time without using LLMs, given that you can run useful open-weights models on a local laptop.

throwawayffffas

Yeah I totally agree, I also run phi4 mini locally and was thoroughly impressed. The genie is out of the bottle.

dingnuts

share your settings and system specs please, I haven't seen anything come out of a local LLM that was useful.

if you don't, since you're using a throwaway handle, I'll just assume you're paid to post. it is a little odd that you'd use a throwaway just to post LLM hype.

is that you, Sam?

sotix

That’s funny because after using Cursor with Claude for a month at work at the request of the CTO, I have found myself reverting to neovim and am more productive. I see the sliver of value but not for complex coding requirements.

XCSme

I had the same experience. Initially it was a fun and exciting, but in the end, the consistent small bugs and destructive agentic behaviour (deleting random code) made it less productive than simply writing the code yourself. And if you write yourself, you already understand it, and it's of a higher quality (unless you are a beginner).

mostlysimilar

What did you find it useful for in this afternoon?

easton

It sucks to say, but maybe the best solution is to pay Cloudflare or another fancy web security/CDN company to get rid of this problem for you?

Which is extremely dumb, but when the alternatives are “10x capacity just for stupid bots” or “hire a guy whose job it is to block LLMs”… maybe that’s cheapest? Yes, it sucks for the open web, but if it’s your livelihood I probably would consider it.

(Either that or make following robots.txt a legal requirement, but that feels also like stifling hobbyists that just want to scrape a page)

rakoo

Or stop defending individually and start attacking collectively ? This is an issue on scrapers side, they have to fix it

> make following robots.txt a legal requirement

Even better: make them publish their IP addresses so we know what to block, just like robocallers. Every scrape made from another IP is crime.

azornathogron

> Either that or make following robots.txt a legal requirement [...]

A legal requirement in what jurisdiction, and to be enforced how and by whom?

I guess the only feasible legislation here is something where the victim pursues a case with a regulating agency or just through the courts directly. But how does the victim even find the culprit when the origin of the crawling is being deliberately obscured, with traffic coming from a botnet running on exploited consumer devices?

socksy

It wouldn't have to go that deep. If we made not following robots.txt illegal in certain jurisdictions, and blocked all IP addresses not from those jurisdictions, then there would presumably have to be an entity in those jurisdictions, such as a VPN provider, an illegal botnet, or a legal botnet, and you pursue legal action with those.

The VPNs and legal botnets would be heavily incentivized to not allow this to happen (and presumably already are doing traffic analysis), and illegal botnets should be shutdown anyway (some grace in the law about being unaware of it happening should of course be afforded, but once you are aware it is your responsibility to prevent your machine from committing crimes).

azornathogron

> illegal botnets should be shutdown anyway

Illegal botnets aren't new. Are they currently shutdown regularly? (I'm actually asking, I don't know)

> If we made not following robots.txt illegal in certain jurisdictions, and blocked all IP addresses not from those jurisdictions

That sounds kinda like the balkanization of the internet. It's not without some cost. I don't mean financially, but in terms of eroding the connectedness that is supposed to be one of the internet's great benefits.

milch

Maybe people need to add deliberate traps on their websites. You could imagine a provider like Cloudflare injecting a randomly generated code phrase into thousands of sites and making sure to attribute it under a strict license, that is invisible so that no human sees it, and changes every few days. Presumably LLMs would learn this phrase and later be able to repeat it - getting a sufficiently high hit rate should be proof that they used illegitimately obtained data. Kinda like back in the old days when map makers included fake towns, rivers and so on in their maps so that if others copied it they could tell

null

[deleted]

ak38Hasg

Unsurprisingly, this submission is far removed from the front page.

The people stealing and ddos-ing literally do not care. Nobel prizes are handed out for this crap. Politicians support the steal.

Thanks Drew for being an honest voice.

pas

it was in the top 20 for the day

on https://hckrnews.com/

CGamesPlay

I feel like I recall someone recently built a simple proof of work CAPTCHA for their personal git server. Would something like that help here?

Alternatively, a technique like Privacy Pass might be useful here. Basically, give any user a handful of tokens based on reputation / proof of work, and then make each endpoint require a token to access. This gives you a single endpoint to rate limit, and doesn’t require user accounts (although you could allow known-polite user accounts a higher rate limit on minting tokens).

noosphr

Here we get to the original sin of packet networking.

The ARPANET was never meant to be commercial or private. All the protocols developed for it were meant to be subsidized by universities, the government or the military, with the names, phone numbers, and addresses of anyone sending a packet being public knowledge to anyone else in the network.

This made sense for the time since the IMPs used to send packets had less computing power than an addressable LED today.

Today the average 10 year old router has more computing power than was available in the whole world in 1970, but we've not made any push to move to protocols that incorporate price as a fundamental part of their design.

Worse is that I don't see anyway that we can implement this. So we're left with screeds by people who want information to be free, but get upset when they find out that someone has to pay for information.

gmemstr

You're likely thinking of Anubis from Techaro: https://github.com/TecharoHQ/anubis

newbie-02

Fresh versions of Firefox-ESR, Brave and Opera made it for the moment, however you need to find a way to allow them cookies while they persistently re-iterate, disabling our settings window. I find it absolutely unacceptable behavior to lock people out of their data - their property - without providing comprehensible reasons such as 'your browser is too old, only versions from xxx are supported'. And if this doesn't happen automatically, it should be communicated immediately in support requests. The way it is done, despite all good intentions, are malicious experiments with the users.

newbie-02

Anyone else unjustified blocked by TecharoHQ/anubis? Experience a silly 'Oh noes!' page and no help in the bug tracker ( 'we are new and experimental, pls. contact your sysadmin' ). Affected site: gitlab.gnome.org/GNOME .

CGamesPlay

Yes, thank you.

akoboldfrying

> This gives you a single endpoint to rate limit

Would you be rate-limiting by IP? Because the attacker is using (nearly) unique IPs for each request, so I don't see how that would help.

> someone recently built a simple proof of work CAPTCHA for their personal git server

As much as everyone hates CAPTCHAs nowadays, I think this could be helpful if it was IP-oblivious, random and the frequency was very low. E.g., once per 100 requests would be enough to hurt abusers making 10000 requests, but would cause minimal inconvenience to regular users.

CGamesPlay

I didn’t mention IP at all in my post. You can globally rate limit anonymous requests for everything (except maybe your login endpoint), if that’s the thing that makes sense for you.

The nice thing about the proof-of-work approach is that it can be backgrounded for users with normal browsers, just like link prefetching.

akoboldfrying

> You can globally rate limit anonymous requests for everything

Obviously you can do this, but that will start blocking everyone. How is that a solution?

Also what prevents attacker-controlled browsers from backgrounding the PoW too?

I took it as read that for something to qualify as a solution to this problem, it needs to affect regular users less badly than attackers.

pentaphobe

Is there currently any way to identify ethical (or I guess, semi-ethical or even not actively unethical..) AI companies?

Would be nice to see some effort on this front, a la "we don't scrape", or "we use this reputable dataset which scrapes while respecting robots.txt and reasonable caching", or heaven forbid "we train only on a dataset based on public domain content"

Even if it was an empty promise (or goodwill ultimately broken) it'd be _something_. If it exists, I'd certainly prioritise any products which proclaimed it

Do we need to make one?

pentaphobe

(Posting as separate comment because wall of text)

Also - Honestly don't even understand why so many people would need to scrape for training data.

Is it naïveté (thinking it necessary), arrogance (thinking it better than others, and thus justified)?

Aren't most advances now primarily focused on either higher level (agents, layered retrieval) or lower level (eg. alternatives to transformers, etc.. which would be easier to prove useful on existing datasets)?

Genuine questions, all of these - if I'm off the mark I'm keen to learn!

h4kor

I had to take offline my personal gitea instance as it was regularly spammed, crashing Caddy in the process.

myaccountonhn

If you don't have public repos you can use gitolite and have ssh only access.

h4kor

I used and am now using plain git over ssh. I only hosted gitea to have my code accessible for anyone interested.

My "main" projects I want to keep public are moved/cloned to GitHub, the rest I just don't bother with any longer.

petercooper

If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

Does this statement have ramifications for sourcehut and the type of projects allowed there? Or is it merely personal opinion?

ransom_rs

At the top of the article:

> This blog post is expressing personal experiences and opinions and doesn’t reflect any official policies of SourceHut.

znpy

> All of my sysadmin friends are dealing with the same problems.

I'm seeing this as well. Some of the websites my company operates suddenly go through some 10-25x in requests per minute at night. Most often the AS where the ip comes from is from some Microsoft datacenter (meaning, it's most likely OpenAI).

At this point i'm starting to consider blocklisting Microsoft AS numbers and/or requiring login (or some kind of api key) when coming from a datacenter network. Reddit does this as well, already, for example.

> If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

This sounds... childish (to put it politely) ?

me2too

He's just right

InDubioProRubio

All things unregulated eventually turn into a mogadishu style eco-system where parasitism is ammunition. Open source are the plants of this brave new disgusting world. To be grazed upon by the animals, that suppressed the protection system of the plants- aka the state.

Once the allmende, the grass runs out, things get interesting. We shall see Cyperpunk like computational parasitism plaguing companies and attempts to filter these work out. I guess that is the only way, to really prevent that sort of bot. You take arbitrary pieces of the UOW they want done- and reject them on principal. And depending on the cleverness of the batch algorithm, they will come back with that again and again, identifying themselves via repetition.

HN

Please stop externalizing your costs directly into my face

Please stop externalizing your costs directly into my face