Please stop externalizing your costs directly into my face
108 comments
·March 18, 2025dalke
CGamesPlay
This is a pet peeve of Rachel by the Bay. She sets strict limits on her RSS feed based on not properly using the provided caching headers. I wonder if anyone has made a WAF that automates this sort of thing.
dalke
I'm pretty sure:
"I have to review our mitigations several times per day to keep that number from getting any higher. When I do have time to work on something else, often I have to drop it when all of our alarms go off because our current set of mitigations stopped working. ... it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all."
means that no one has managed it.
notacoward
Given that they're actively trying to obfuscate their activity (according to Drew's description), identifying and blocking clients seems unlikely to work. I'd be tempted to de-prioritize the more expensive types of queries (like "git blame") and set per repository limits. If a particular repository gets hit too hard, further requests for it will go on the lowest-priority queue and get really slow. That would be slightly annoying for legitimate users, but still better than random outages due to system-wide overload.
BTW isn't the obfuscation of the bots' activity a tacit admission by their owners that they know they're doing something wrong and causing headaches for site admins? In the copyright world that becomes wilful infringement and carries triple damages. Maybe it should be the same for DoS perpetrators.
socksy
I don't understand the thing about the cache. Presumably they have a model that they are training, that must be their cache? Are they retraining the same model on the same data on the basis that that will weigh higher page ranked pages higher or something? Or is this about training slightly different models?
If they are really just training the same model, and there's no benefit to training it multiple times on that data, then presumably they could use a statistical data structure like https://en.wikipedia.org/wiki/HyperLogLog to be check if they've trained on the site before based on the Last-Modified header + URI? That would be far cheaper than a cache, and cheaper than rescraping.
I was also under the impression that the name of the game with training was to get high quality, curated training sets, which by their nature are quite static? Why are they all still hammering the web?
MathMonkeyMan
Good rant!
The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not? Behavior? Volume? Unlikely coincidence?
> random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
future10se
There are commercial services that provide residential proxies, i.e. you get to tunnel your scraper or bot traffic through actual residential connections. (see: Bright Data, oxylabs, etc.)
They accomplish this by providing home users with some app that promises to pay them money for use of their connection. (see: HoneyGain, peer2profit, etc.)
Interestingly, the companies selling the tunnel service to companies and the ones paying home users to run an app are sometimes different, or at least they use different brands to cater to the two sides of the market. It also wouldn't surprise me if they sold capacity to each other.
I suspect some of these LLM companies (or the ones they outsource to capture data) some of their traffic through these residential proxy services. It's funny because some of these companies already have a foothold inside homes (Google Nest and Amazon Alexa devices, etc.) but for a number of reasons (e.g. legal) they would probably rather go through a third-party.
bayindirh
They can be local LLMs doing search, some SETI@Home style distributed work, or else.
I host my blog at Mataroa, and the stats show 2K hits for some pages on some days. I don't think my small blog and writings are that popular. Asking ChatGPT about these subjects with my name results in some hilarious hallucinations, pointing to poor training due to lack of data.
IOW, my blog is also being scraped and used to train LLMs. Not nice. I really share the sentiment with Drew. I never asked/consented for this.
stavros
> the stats show 2K hits for some pages on some days
This has been happening since long before LLMs. Fifteen years ago, my blog would see 3k visitors a day on server logs, but only 100 on Google Analytics. Bots were always scraping everything.
zild3d
> I never asked/consented for this.
You put it on the information super highway baby
bayindirh
> You put it on the information super highway baby
...with proper licenses.
Here, FTFY, baby.
Addenda: Just because you don't feel like obeying/honoring them doesn't make the said licenses moot and toothless. I mean, if you know, you know.
Applejinx
Between Microsoft and Google, my existence AND presence as a community open source developer is being scraped and stolen.
I've been trying to write a body of audio code that sounds better than the stuff we got used to in the DAW era, doing things like dithering the mantissa of floating-point words, just experimental stuff ignoring the rules. Never mind if it works: I can think it does, but my objection holds whether it does or not.
Firstly, if you rip my stuff halfway it's pointless: without the coordinated intention towards specific goals not corresponding with normally practiced DSP, it's useless. LLMs are not going to 'get' the intention behind what I'm doing while also blending it with the very same code I'm a reaction against, the code that greatly outnumbers my own contributions. So even if you ask it to rip me off it tries to produce a synthesis with what I'm actively avoiding, resulting in a fantasy or parody of what I'm trying to make.
Secondly, suppose it became possible to make it hallucinate IN the relevant style, perhaps by training exclusively on my output, so it can spin off variations. That's not so far-fetched: _I_ do that. But where'd the style come from, that you'd spend effort tuning the LLM to? Does it initiate this on its own? Would you let it 'hallucinate' in that direction in the belief that maybe it was on to something? No, it doesn't look like that's a thing. When I've played with LLMs (I have a Mac Studio set up with enough RAM to do that) it's been trying to explore what the thing might do outside of expectation, and it's hard to get anything interesting that doesn't turn out to be a rip from something I didn't know about, but it was familiar with. Not great to go 'oh hey I made it innovate!' when you're mistakenly ripping off an unknown human's efforts. I've tried to explore what you might call 'native hallucination', stuff more inherent to collective humanity than to an individual, and I'm not seeing much facility with that.
Not that people are even looking for that!
And lastly, as a human trying to explore an unusual position in audio DSP code with many years of practice attempting these things and sharing them with the world around me only to have Microsoft try to reduce me to a nutrient slurry that would add a piquant flavor to 'writing code for people', I turn around and find Google, through YouTube, repeatedly offering to speak FOR me in response to my youtube commenters. I'm sure other people have seen this: probably depends on how interactive you are with your community. YouTube clearly trains a custom LLM on my comment responses to my viewers, that being text they have access to (doubtless adding my very verbose video footnotes) to the point that they're regularly offering to BE ME and save me the trouble.
Including technical explanations and helpful suggestions of how to use my stuff, that's not infrequently lies and bizarro world interpretations of what's going on, plus encouraging or self-congratulatory remarks that seem partly drawn from known best practices for being an empty hype beast competing to win the algorithm.
I'm not sure whether I prefer this, or the supposed promise of the machines.
If it can't be any better than this, I can keep working as I am, have my intentionality and a recognizable consistent sound and style, and be full of sass and contempt for the machines, and that'll remain impossible for that world to match (whether they want to is another question… but purely in marketing terms, yes they'll want to because it'll be a distinct area to conquer once the normal stuff is all a gray paste)
If it follows the path of the YouTube suggestions, there will simply be more noise out there, driven by people trying to piggyback off the mindshare of an isolated human doing a recognizable and distinct thing for most of his finite lifetime, with greater and greater volume of hollow mimicry of that person INCLUDING mimicry of his communications and interpersonal tone, the better to shunt attention and literal money to, not the LLMs doing the mimicking, but a third party working essentially in marketing, trying to split off a market segment they've identified as not only relevant, but ripe for plucking because the audience self-identifies as eager to consume the output of something that's not usual and normal.
(I should learn French: that rant is structurally identical to an endlessly digressive French expostulation)
Today I'm doing a livestream, coding with a small audience as I try for the fourth straight day to do a particular sort of DSP (decrackling) that's previously best served by some very expensive proprietary software costing over two thousand dollars for a license. Ideally I can get some of the results while also being able to honor my intentions for preserving the aspects of the audio I value (which I think can be compromised by such invasive DSP). That's because my intention will include this preservation, these iconoclastic details I think important, the trade-offs I think are right.
Meanwhile crap is trained on my work so that a guy who wants money can harness rainforests worth of wasted electrical energy to make programs that don't even work, and a pretend scientist guru persona who can't talk coherently but can and will tell you that he is "a real audio hero who's worked for many years to give you amazing free plugins that really defy all the horrible rules that are ruining music"!
Because this stuff can't pay attention, but it can draw all the wrong conclusions from your tone.
And if you question your own work and learn and grow from your missteps to have greater confidence in your learned synthesis of knowledge, it can't do that either but it can simultaneously bluster with your confidence and also add 'but who knows maybe I'm totally wrong lol!'
And both are forms of lies, as it has neither confidence nor self-doubt.
I'm going on for longer than the original article. Sorry.
ilumanty
This sounds pretty interesting. Can you share a link to your work or livestream?
zjehT
> The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not?
I think you cannot distinguish. But the issue is so large that Google now serves captchas on legitimate traffic, sometimes after the first search term if you narrow down the time window (less than 24 hours).
I wonder when real Internet companies will feel the hurt, simply because consumers will stop using the ruined Internet.
danaris
I run a small browser game—roughly 150 unique weekly active users.
Our Wiki periodically gets absolutely hammered by LLM scraper bots, rotating IP addresses like mad to avoid mitigations like fail2ban (which I do have in place). And even when they're not hitting it hard enough to crash the game (through the external data feeds many of the wiki pages rely on), they're still scraping pretty steadily.
There is no earthly way that my actual users are able to sustain ~400kbps outbound traffic round the clock.
easton
It sucks to say, but maybe the best solution is to pay Cloudflare or another fancy web security/CDN company to get rid of this problem for you?
Which is extremely dumb, but when the alternatives are “10x capacity just for stupid bots” or “hire a guy whose job it is to block LLMs”… maybe that’s cheapest? Yes, it sucks for the open web, but if it’s your livelihood I probably would consider it.
(Either that or make following robots.txt a legal requirement, but that feels also like stifling hobbyists that just want to scrape a page)
azornathogron
> Either that or make following robots.txt a legal requirement [...]
A legal requirement in what jurisdiction, and to be enforced how and by whom?
I guess the only feasible legislation here is something where the victim pursues a case with a regulating agency or just through the courts directly. But how does the victim even find the culprit when the origin of the crawling is being deliberately obscured, with traffic coming from a botnet running on exploited consumer devices?
socksy
It wouldn't have to go that deep. If we made not following robots.txt illegal in certain jurisdictions, and blocked all IP addresses not from those jurisdictions, then there would presumably have to be an entity in those jurisdictions, such as a VPN provider, an illegal botnet, or a legal botnet, and you pursue legal action with those.
The VPNs and legal botnets would be heavily incentivized to not allow this to happen (and presumably already are doing traffic analysis), and illegal botnets should be shutdown anyway (some grace in the law about being unaware of it happening should of course be afforded, but once you are aware it is your responsibility to prevent your machine from committing crimes).
azornathogron
> illegal botnets should be shutdown anyway
Illegal botnets aren't new. Are they currently shutdown regularly? (I'm actually asking, I don't know)
> If we made not following robots.txt illegal in certain jurisdictions, and blocked all IP addresses not from those jurisdictions
That sounds kinda like the balkanization of the internet. It's not without some cost. I don't mean financially, but in terms of eroding the connectedness that is supposed to be one of the internet's great benefits.
milch
Maybe people need to add deliberate traps on their websites. You could imagine a provider like Cloudflare injecting a randomly generated code phrase into thousands of sites and making sure to attribute it under a strict license, that is invisible so that no human sees it, and changes every few days. Presumably LLMs would learn this phrase and later be able to repeat it - getting a sufficiently high hit rate should be proof that they used illegitimately obtained data. Kinda like back in the old days when map makers included fake towns, rivers and so on in their maps so that if others copied it they could tell
null
throwawayffffas
> If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.
After using Claude code for an afternoon, I have to say I don't think this bubble is going to burst any time soon.
hobofan
I think there is a good chance of the VC/Startup side of the bubble bursting. However, I think we will never go back to a time without using LLMs, given that you can run useful open-weights models on a local laptop.
throwawayffffas
Yeah I totally agree, I also run phi4 mini locally and was thoroughly impressed. The genie is out of the bottle.
dingnuts
share your settings and system specs please, I haven't seen anything come out of a local LLM that was useful.
if you don't, since you're using a throwaway handle, I'll just assume you're paid to post. it is a little odd that you'd use a throwaway just to post LLM hype.
is that you, Sam?
sotix
That’s funny because after using Cursor with Claude for a month at work at the request of the CTO, I have found myself reverting to neovim and am more productive. I see the sliver of value but not for complex coding requirements.
mostlysimilar
What did you find it useful for in this afternoon?
flowerthoughts
Have there been efforts to set up something like the SMTP block lists for web scraping bots? Seems like this is something that needs to be crowd sourced. It'll be much easier to find patterns in larger piles of data. A single real user is unlikely to do as much as quickly as a bot.
johnea
Those are largely ineffective in the modern world.
Most bot nets, like he mentions in the blag post, come from 1 IP address only 1 time. With thousands to tens of thousands of IP addresses in the bot net, there is just no way to block by IP address anymore.
pentaphobe
Is there currently any way to identify ethical (or I guess, semi-ethical or even not actively unethical..) AI companies?
Would be nice to see some effort on this front, a la "we don't scrape", or "we use this reputable dataset which scrapes while respecting robots.txt and reasonable caching", or heaven forbid "we train only on a dataset based on public domain content"
Even if it was an empty promise (or goodwill ultimately broken) it'd be _something_. If it exists, I'd certainly prioritise any products which proclaimed it
Do we need to make one?
pentaphobe
(Posting as separate comment because wall of text)
Also - Honestly don't even understand why so many people would need to scrape for training data.
Is it naïveté (thinking it necessary), arrogance (thinking it better than others, and thus justified)?
Aren't most advances now primarily focused on either higher level (agents, layered retrieval) or lower level (eg. alternatives to transformers, etc.. which would be easier to prove useful on existing datasets)?
Genuine questions, all of these - if I'm off the mark I'm keen to learn!
CGamesPlay
I feel like I recall someone recently built a simple proof of work CAPTCHA for their personal git server. Would something like that help here?
Alternatively, a technique like Privacy Pass might be useful here. Basically, give any user a handful of tokens based on reputation / proof of work, and then make each endpoint require a token to access. This gives you a single endpoint to rate limit, and doesn’t require user accounts (although you could allow known-polite user accounts a higher rate limit on minting tokens).
gmemstr
You're likely thinking of Anubis from Techaro: https://github.com/TecharoHQ/anubis
CGamesPlay
Yes, thank you.
noosphr
Here we get to the original sin of packet networking.
The ARPANET was never meant to be commercial or private. All the protocols developed for it were meant to be subsidized by universities, the government or the military, with the names, phone numbers, and addresses of anyone sending a packet being public knowledge to anyone else in the network.
This made sense for the time since the IMPs used to send packets had less computing power than an addressable LED today.
Today the average 10 year old router has more computing power than was available in the whole world in 1970, but we've not made any push to move to protocols that incorporate price as a fundamental part of their design.
Worse is that I don't see anyway that we can implement this. So we're left with screeds by people who want information to be free, but get upset when they find out that someone has to pay for information.
akoboldfrying
> This gives you a single endpoint to rate limit
Would you be rate-limiting by IP? Because the attacker is using (nearly) unique IPs for each request, so I don't see how that would help.
> someone recently built a simple proof of work CAPTCHA for their personal git server
As much as everyone hates CAPTCHAs nowadays, I think this could be helpful if it was IP-oblivious, random and the frequency was very low. E.g., once per 100 requests would be enough to hurt abusers making 10000 requests, but would cause minimal inconvenience to regular users.
CGamesPlay
I didn’t mention IP at all in my post. You can globally rate limit anonymous requests for everything (except maybe your login endpoint), if that’s the thing that makes sense for you.
The nice thing about the proof-of-work approach is that it can be backgrounded for users with normal browsers, just like link prefetching.
h4kor
I had to take offline my personal gitea instance as it was regularly spammed, crashing Caddy in the process.
myaccountonhn
If you don't have public repos you can use gitolite and have ssh only access.
h4kor
I used and am now using plain git over ssh. I only hosted gitea to have my code accessible for anyone interested.
My "main" projects I want to keep public are moved/cloned to GitHub, the rest I just don't bother with any longer.
petercooper
If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.
Does this statement have ramifications for sourcehut and the type of projects allowed there? Or is it merely personal opinion?
ransom_rs
At the top of the article:
> This blog post is expressing personal experiences and opinions and doesn’t reflect any official policies of SourceHut.
me2too
He's just right
ak38Hasg
Unsurprisingly, this submission is far removed from the front page.
The people stealing and ddos-ing literally do not care. Nobel prizes are handed out for this crap. Politicians support the steal.
Thanks Drew for being an honest voice.
iinnPP
Assuming you can prove it's a company, doesn't the behavior equate to fraud? Seems to hit all the prongs, but I am no lawyer.
bigolkevin
It doesn't matter if it's fraud, "AI" is now considered an arms race, if we require them to fairly acquire their content we'll fall behind China or another country and then America might lose the WWIII it is constantly preparing for.
You might see a couple of small players or especially egregious executives get a slap on the wrist for bad behavior but in this political climate there's no chance that Republicans or Democrats will put a stop to it.
I have a small static site. I haven't touched it in a couple of years.
Even then, I see bot after bot, pulling down about 1/2 GB per day.
Like, I distribute Python wheels from my site, with several release versions X several Python versions.
I can't understand why ChatGPT, PetalBot, and other bots want to pull down wheels, much less the full contents when the header shows it hasn't changed:
Well, I know the answer to the second question, as DeVault's title highlights - it's cheaper to re-read the data and re-process the content than set up a local cache.Externalizing their costs onto me.
I know 1/2 GB/day is not much. It's well under the 500 GB/month I get from my hosting provider. But again, I have a tiny site with only static hosting, and as far as I can tell, the vast majority of transfers from my site are worthless.
Just like accessing 'expensive endpoints like git blame, every page of every git log, and every commit in every repo' seems worthless.