AI scrapers request commented scripts

107 comments

·October 31, 2025

renegat0x0

Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.

I know a thing or two about web scraping.

There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).

Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.

Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.

That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.

https://github.com/rumca-js/crawler-buddy

Based on library

https://github.com/rumca-js/webtoolkit

bakql

>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

jraph

When I open an HTTP server to the public web, I expect and welcome GET requests in general.

However,

(1) there's a difference between (a) a regular user browsing my websites and (b) robots DDoSing them. It was never okay to hammer a webserver. This is not new, and it's for this reason that curl has had options to throttle repeated requests to servers forever. In real life, there are many instances of things being offered for free, it's usually not okay to take it all. Yes, this would be abuse. And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free". Same thing here.

(2) there's a difference between (a) a regular user reading my website or even copying and redistributing my content as long as the license of this work / the fair use or related laws are respected, and (b) a robot counterfeiting it (yeah, I agree with another commenter, theft is not the right word, let's call a spade a spade)

(3) well-behaved robots are expected to respect robots.txt. This is not the law, this is about being respectful. It is only fair bad-behaved robots get called out.

Well behaved robots do not usually use millions of residential IPs through shady apps to "Perform a get request to an open HTTP server".

Cervisia

> robots.txt. This is not the law

In Germany, it is the law. § 44b UrhG says (translated):

(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.

(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.

(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.

grayhatter

If you're lying in the requests you send, to trick my server into returning the content you want, instead of what I would want to return to webscrapers, that's non-consensual.

You don't need my permission to send a GET request, I completely agree. In fact, by having a publicly accessible webserver, there's implied consent that I'm willing to accept reasonable, and valid GET requests.

But I have configured my server to spend server resources the way I want, you don't like how my server works, so your configure your bot to lie. If you get what you want only because you're willing to lie, where's the implied consent?

Calavar

I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

kelnos

> robots.txt is a polite request to please not scrape these pages

People who ignore polite requests are assholes, and we are well within our rights to complain about them.

I agree that "theft" is too strong (though I think you might be presenting a straw man there), but "abuse" can be perfectly apt: a crawler hammering a server, requesting the same pages over and over, absolutely is abuse.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

That's a shitty world that we shouldn't have to live in.

bigbuppo

Seriously. Did you see what that web server was wearing? I mean, sure it said "don't touch me" and started screaming for help and blocked 99.9% of our IP space, but we got more and they didn't block that so clearly they weren't serious. They were asking for it. It's their fault. They're not really victims.

jMyles

Sexual consent is sacred. This metaphor is in truly bad taste.

When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

grayhatter

> I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

This feels like the kind of argument some would make as to why they aren't required to return their shopping cart to the bay.

> robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

Well, no. That's an overly simplistic description which fits your argument, but doesn't accurately represent reality. yes, robots.txt is created as a hint for robots, a hint that was never expected to be non-binding, but the important detail, the one that is important to understanding why it's called robots.txt is because the web server exists to serve the requests of humans. Robots are welcome too, but please follow these rules.

You can tell your description is completely inaccurate and non-representative of the expectations of the web as a whole. because every popular llm scraper goes out of their way to both follow and announce that they follow robots.txt.

> It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch.

It's nothing like that, it's more like a note that says no soliciting, or please knock quietly because the baby is sleeping.

> It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center.

Or, people could not be assholes? Yes, I get it, the reality we live in there are assholes. But the problem as I see it, is not just the assholes, but the people who act as apologists for this clearly deviant behavior.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

Because it's your fault if you don't, right? That's victim blaming. I want to be able to host free, easy to access content for humans, but someone with more money, and more compute resources than I have, gets to overwhelm my server because they don't care... And that's my fault, right?

I guess that's a take...

There's a huge difference between suggesting mitigations for dealing with someone abusing resources, and excusing the abuse of resources, or implying that I should expect my server to be abused, instead of frustrated about the abuse.

watwut

If you ignore polite request, then it is perfectly ok to give you as much false data as possible. You have shown yourself not interested in good faith cooperation, that means other people can and should treat you as a jerk.

hsbauauvhabzb

How else do you tell the bot you do not wish to be scraped? Your analogy is lacking - you didn’t order a package, you never wanted a package, and the postman is taking something, not leaving it, and you’ve explicitly left a sign saying ‘you are not welcome here’.

Calavar

If you are serving web pages, you are soliciting GET requests, kind of like ordering a package is soliciting a delivery.

"Taking" versus "giving" is neither here nor there for this discussion. The question is are you expressing a preference on etiquette versus a hard rule that must be followed. I personally believe robots.txt is the former, and I say that as someone who serves more pages than they scrape

stray

You require something the bot won't have that a human would.

Anybody may watch the demo screen of an arcade game for free, but you have to insert a quarter to play — and you can have even greater access with a key.

> and you’ve explicitly left a sign saying ‘you are not welcome here’

And the sign said "Long-haired freaky people Need not apply" So I tucked my hair up under my hat And I went in to ask him why He said, "You look like a fine upstandin' young man I think you'll do" So I took off my hat and said, "Imagine that Huh, me workin' for you"

davsti4

Its simple, and I'll quote myself - "robots.txt isn't the law".

nkrisc

Put your content behind authentication if you don’t want it to be requested by just anyone.

bakql

Stop your http server if you do not wish to receive http requests.

mxkopy

The metaphor doesn’t work. It’s not the security of the package that’s in question, but something like whether the delivery person is getting paid enough or whether you’re supporting them getting replaced by a robot. The issue is in the context, not the protocol.

whimsicalism

There's an evolving morality around the internet that is very, very different from the pseudo-libertarian rule of the jungle I was raised with. Interesting to see things change.

sethhochberg

The evolutionary force is really just "everyone else showed up at the party". The Internet has gone from a capital-I thing that was hard to access, to a little-i internet that was easier to access and well known but still largely distinct from the real world, to now... just the real world in virtual form. Internet morality mirrors real world morality.

For the most part, everybody is participating now, and that brings all of the challenges of any other space with everyone's competing interests colliding - but fewer established systems of governance.

hdgvhicv

Based on the comments here the polite world of the internet where people obeyed unwritten best practices is certainly over in favour of “grab what you can might makes right”

sdenton4

The problem is that serving content costs money. Llm scraping is essentially ddos'ing content meant for human consumption. Ddos'ing sucks.

2OEH8eoCRo0

Scraping is legal. DDoSing isn't.

We should start suing these bad actors. Why do techies forget that the legal system exists?

ColinWright

There is no way that you can sue the people responsible for DDoSing your system. Even if you can find them ... and you won't ... they're likely as not either not in your jurisdiction (they might be in Russia, or China, or Bolivia, or anywhere) and they will have a lot more money than you.

People here on HN are laughing at the UKs Online Safety Act for trying to impose restrictions on people in other countries, and yet now you're implying that similar restrictions can be placed on people in other countries and over whom you have neither power nor control.

null

[deleted]

j2kun

You should not have to ask for permission, but you should have to honestly set your user-agent. (In my opinion, this should be the law and it should be enforced)

arccy

yeah all open HTTP servers are fair game for DDoS because well it's open right?

munk-a

I think there's a massive shift in what the letter of the law needs to be to match the intent. The letter hasn't changed and this is all still quite legal - but there is a significant different between what webscraping was doing to impact creative lives five years ago and today. It was always possible for artists to have their content stolen and for creative works to be reposted - but there was enough IP laws around image sharing (which AI disingenuously steps around) and other creative work wasn't monetarily efficient to scrape.

I think there is a really different intent to an action to read something someone created (which is often a form of marketing) and to reproduce but modify someone's creative output (which competes against and starves the creative of income).

The world changed really quickly and our legal systems haven't kept up. It is hurting real people who used to have small side businesses.

rokkamokka

I'm not overly surprised, it's probably faster to search the text for http/https than parse the DOM

embedding-shape

Not probably, searching through plaintext (which they seem to be doing) VS iterating on the DOM have vastly different amount of work behind them in terms of resources used and performance that "probably" is way underselling the difference :)

franktankbank

Reminds me of the shortcut that works for the happy path but is utterly fucked by real data. This is an interesting trap, can it easily be avoided without walking the dom?

embedding-shape

Yes, parse out HTML comments which is also kind of trivial if you've ever done any sort of parsing, listen for "". But then again, these people are using AI to build scrapers, so I wouldn't put too much pressure on them to produce high-quality software.

stevage

The title is confusing, should be "commented-out".

pimlottc

Agree, I thought maybe this was going to be a script to block AI scrapers or something like that.

zahlman

I thought it was going to be AI scraper operators getting annoyed that they have to run reasoning models on the scraped data to make use of it.

sharkjacobs

Fun to see practical applications of interesting research[1]

[1]https://news.ycombinator.com/item?id=45529587

mikeiz404

Two thoughts here when it comes to poisoning unwanted LLM training data traffic

1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.

2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.

latenightcoding

when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.

rightbyte

DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.

chaps

Doing both is fine! Just, once you've figured out your regex and such, hardening/generalizing demands DOM iteration. It sucks but it is what is is.

horseradish7k

but not when crawling. you don't know the page format in advance - you don't even know what the page contains!

bigbuppo

Sounds like you should give the bots exactly what they want... a 512MB file of random data.

aDyslecticCrow

Scraper sinkhole of randomly generated inter-linked files filled with AI poison could work. No human would click that link, so it leads to the "exclusive club".

oytis

Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here. Data poisoning is probably the way.

zahlman

> Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here.

That's what zip bombs are for.

kelnos

Most people have to pay for their bandwidth, though. That's a lot of data to send out over and over.

jcheng

512MB file of incredibly compressible data, then?

Noumenon72

It doesn't seem that abusive. I don't comment things out thinking "this will keep robots from reading this".

mostlysimilar

The article mentions using this as a means of detecting bots, not as a complaint that it's abusive.

EDIT: I was chastised, here's the original text of my comment: Did you read the article or just the title? They aren't claiming it's abusive. They're saying it's a viable signal to detect and ban bots.

ang_cire

They call the scrapers "malicious", so they are definitely complaining about them.

> A few of these came from user-agents that were obviously malicious:

(I love the idea that they consider any python or go request to be a malicious scraper...)

pseudalopex

Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".[1]

[1] https://news.ycombinator.com/newsguidelines.html

woodrowbarlow

the first few words of the article are:

> Last Sunday I discovered some abusive bot behaviour [...]

foobarbecue

Yeah but the abusive behavior is ignoring robots.txt and scraping to train AI. Following commented URLs was not the crime, just evidence inadvertently left behind.

mostlysimilar

> The robots.txt for the site in question forbids all crawlers, so they were either failing to check the policies expressed in that file, or ignoring them if they had.

michael1999

Crawlers ignoring robots.txt is abusive. That they then start scanning all docs for commented urls just adds to the pile of scummy behaviour.

tveyben

Human behavior is interesting - me, me, me…

OhMeadhbh

I blame modern CS programs that don't teach kids about parsing. The last time I looked at some scraping code, the dev was using regexes to "parse" html to find various references.

Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.

mikeiz404

It’s been some time since I have dealt with web scrapers but it takes less resources to run a regex than it does to parse the DOM (which may have syntactically incorrect parts anyway). This can add up when running many scraping requests in parallel. So depending on your goals using a regex can be much preferred.

ericmcer

How would recommend doing it? If I was just trying to pull <a/> tag links out I feel like treating it like text and using regex would be way more efficient than a full on HTML parser like JSDom or something.

singron

You don't need javascript to parse HTML. Just use an HTML parser. They are very fast. HTML isn't a regular language, so you can't parse it with regular expressions.

Obligatory: https://stackoverflow.com/questions/1732348/regex-match-open...

zahlman

The point is: if you're trying to find all the URLs within the page source, it doesn't really matter to you what tags they're in, or how the document is structured, or even whether they're given as link targets or in the readable text or just what.

vaylian

The people who do this type of scraping to feed their AI are probably also using AI to write their scraper.

tuwtuwtuwtuw

Do you think that if some CS programs taught parsing, the authors of the bot would parse the HTML to properly extract links, instead of just doing plain text search?

I doubt it.

null

[deleted]

winddude

i wish i could downvote.