I fear for the unauthenticated web
119 comments
·March 20, 2025cxr
btown
But also ironically, it's almost heartwarming these days to see blogspam that's not machine-generated! A real live human cared enough about an article to write a brief (perhaps only barely substantial, but at least handwritten) take on it!
It's reminiscent, perhaps, of the feel and motivation for Tumblr reblogs - and Tumblr continues to be vibrant by virtue of this culture: https://www.tumblr.com/engineering/189455858864/how-reblogs-... (2019)
Now, is driving attention and reputation to their site (in the broadest senses) part of a blogspammer/reblogger's motivation? Absolutely!
But should we be concerned about rewarding their act of curation, as long as there is at least some level of genuine curation intent? A world where that answer is categorically "no" would be antithetical, I think, to the concept of the participatory web.
dkkergoog
"heartwarming ... To see blogspam" the internet was a mistake
wongarsu
The internet was great, everything we did with it in the last 20 years with it was the mistake. Collimating in a comment that blogspam can now be one of the positive notes in the hellscape we are building.
A very useful hellscape though, for all its flaws
null
null
MisterTea
I dont feel this is blog spam it's more of a quick comment of the situation pointing to the actual article. I dont see anything wrong with writing a short post boosting or commenting on another article. There are no ads so I dont see this as blogspam which I associate with financial gain or clout.
Cheer2171
All the time I see links on HN front page to Twitter and Mastodon posts with just as little text to them. Why does it upset you when it is in the medium of blogs, but not micro blogs?
null
SethMLarson
Hehe, just participating in POSSE :) Funnily enough the story you're linking to quotes me with pictures of a story I wrote (https://sethmlarson.dev/slop-security-reports) about LLM-generated reports to open source projects.
tempfile
It also linked to https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali..., which is another worthwhile read.
null
hugs
I might be naive, but I think it's time we seriously start implementing "HTTP status code 402: Payment Required" across the board.
"L402" is an interesting proposal. Paying a fraction of a penny per request. https://github.com/l402-protocol/l402
rambambram
I stumbled upon this status code last year - had never heard of it before - and I bookmarked it and then forgot about it. Thanks for the reminder.
cwmma
this is basically what they are doing, but instead of charging actual money they are making visitors spin the CPU ideally in a proof of work problem, which has the same outcome from the crawlers perspective.
woah
This has existed for decades. The proof of CPU work is called "frontend frameworks"
fewsats
I've talked with tons of publishers and all say the same thing:
"Hey, we'd happily give these companies clean data if they just paid us instead of building these scrapers."
I think there is a psychological aspect that made micropayments never work for humans but machines may be better suited for it.
SoftTalker
This is ultimately the answer. If something has value, users should pay for it. We haven't had a good way to do that on the web, so it has resulted in the complete shitshow that most websites are.
fewsats
There's a real economic problem here: when someone scrapes your site, you're literally paying for them to use your stuff. That's messed up (and not sustainable)
It seems like a good fit for micropayments. They never took off with people but machines may be better suited for them.
L402 can help here.
fewsats
The other obvious solution is a "web of trust" where Cloudflare just tells you "this request goes in, this one goes out".
I think the paying approach is superior (after all you make money out of people using your service) but Cloudflare is a straight forward/simpler one.
tqwhite
Aren't you paying for me to use the site, too? Or Google? Isn't the point of paying for a web hosting service to distribute information?
fewsats
Yes, but there is a "free lunch" problem. I can run a script that hits your page costing you X at a fraction of the cost for me (the user)
*Edit: typo
tqwhite
I think the whole internet is a free lunch problem as far as that goes. I pay for web hosting because I consider the cost to be worth it to send my fabulous opinions into the ether.
The premise of this thread is that somehow the LLM builders are reading too much. I bet it's less than google.
I continue to believe, if you don't want everyone in the world to see and use your stuff, don't put it on the internet.
Aurornis
Rate limiting is the first step before cutting everything off behind forced logins.
> This practice started with larger websites, ones that already had protection from malicious usage like denial-of-service and abuse in the form of services like Cloudflare or Fastly
FYI Cloudflare has a very usable free tier that’s easy to set up. It’s not limited to large websites.
snerbles
Cloudflare also locks out non-Chrome/Firefox browsers, stifling the development of alternatives.
blibble
I get the feeling that I'm going to read a blog post in a few years telling us that the CDN companies have been selling everything pulled through their cache to the AI companies since 2022
Aurornis
CDNs are a cash cow. They’re not going to set their reputation on fire and violate all of their security guarantees for negligible amounts of money.
AshamedCaptain
I know a lot of companies that not only willingly send their most precious trade secrets (TM) freely to shady LLM operators (like OpenAI, Microsoft, etc.) , but they even pay for the privilege of doing it ...... just out of fear of "missing out" on this Next Big Thing.
blibble
cloudflare continues to make a loss
meanwhile: "I'm proud of how our team continued to deliver ground-breaking innovation, especially in AI" (Matthew Prince, co-founder & CEO of Cloudflare)
null
koakuma-chan
Cloudflare is free
littlestymaar
What reputation?! Cloudflare has been known for its shady practices for more than a decade now, but people just don't care.
mystified5016
See absolutely every other sector of industry and economy for copious counter-examples.
If there's profit on the table, capitalism will not allow it to sit there at any cost.
nottorp
And even if they don't, is everything depending on Cloudflare to stay online a good thing?
sshine
It’s a terrible thing.
Cloudflare is the company I hate the most: I think (what I know of) their tech is done right, and they’re just too big to put my eggs in their basket.
koakuma-chan
Why is nobody building a better product?
zwnow
Until they threaten you to pay a huge bill or they will shutdown your services. No thanks. Cloudflare has extremely questionable business practices.
sshine
Cloudflare took down our website: https://news.ycombinator.com/item?id=40481808
A user running an online casino claimed that Cloudflare abruptly terminated their service after they refused to upgrade to a $10,000/month enterprise plan. The user alleged that Cloudflare failed to communicate the reasons clearly and deleted their account without warning.
Quote: "Cloudflare wanted them to use the BYOIP features of the enterprise plan, and did not want them on Cloudflare's IPs. The solution was to aggressively sell the Enterprise plan, and in a stunning failure of corporate communication, not tell the customer what the problem was at all."
——
Tell HN: Don't Use Cloudflare: https://news.ycombinator.com/item?id=31336515
Summary: A user shared their experience of being forced to upgrade to a $3,000/month plan after using 200-300TB of bandwidth on Cloudflare's business plan. They criticized Cloudflare's lack of transparency regarding bandwidth limits and aggressive sales tactics.
Quote: "A lot of this stuff wasn't communicated when we signed up for the business plan. There was no mention of limits, nor any contracts nor fineprint."
——
Tell HN: Impassable Cloudflare challenges are ruining my browsing experience: https://news.ycombinator.com/item?id=42577076
Summary: A user expressed frustration with Cloudflare's bot protection challenges, which made it difficult for them to unsubscribe from emails or access websites. They highlighted how these challenges disproportionately affect privacy-conscious users with non-standard browser configurations.
Quote: "The 'unsubscribe' button in Indeed's job notification emails leads me to an impassable Cloudflare challenge. That's a CAN-SPAM act violation."
dougb5
What exactly should be rate-limited, though? See the discussion here -- https://news.ycombinator.com/item?id=43422413 -- the traffic at issue in that case (and in one that I'm dealing with myself) is from a large number of IPs making no more than a single request each.
layer8
Centralizing large parts of the web behind Cloudfare is something to be feared as well.
harha_
Screw cloudflare, I rather host my own proxies.
parliament32
Linked in the article that this article links to is a project I found interesting for combatting this problem, a (non-crypto) proof-of-work challenge for new visitors https://github.com/TecharoHQ/anubis
Looks like the GNOME Gitlab instance implements it: https://gitlab.gnome.org/GNOME
kh_hk
For targeted scrapes, isn't proof of work trivial to bypass?
1. headless browser 2. get cookie 3. use cookie on subsequent plain requests
parliament32
It doesn't sound like the scrapers are that smart yet, but when they get there, presumably you'd just lower the cookie lifetime until the requests are down to an acceptable level. It takes a split-second in my browser so it shouldn't interfere much for human visitors.
hubraumhugo
We should try separating good bots from bad bots:
Good bots: search engine crawlers that help users find relevant information. These bots have been around since the early days of the internet and generally follow established best practices like robots.txt and rate limits. AI agents like OpenAI's Operator or Anthopic's Computer Use probably also fit into that bucket as they are offering useful automation without negative side effects.
Bad bots: bots that have a negative affect website owners by causing higher costs, spam, or downtime (automated account creation, ad fraud, or DDoS). AI crawlers fit into that bucket as they disregard robots.txt and spoof user agent. They are creating a lot of headaches for developers responsible for maintaining heavily crawled sites. AI companies don't seem to care about any crawling best practices that the industry has developed over the past two decades.
So the actual question is how good bots and humans can coexist on the web while we protect websites against abusive AI crawlers. It currently feels like an arms race without a winner.
jsheard
Discriminating search engine bots is pretty straightforward, the big names provide bulletproof methods to validate whether a client claiming to be their bot is really their bot. It'll be an uphill battle for new search engines if everyone only trusts Googlebot and Bingbot though.
https://developers.google.com/search/docs/crawling-indexing/...
https://www.bing.com/webmasters/help/verifying-that-bingbot-...
kmeisthax
> How long until scrapers start hammering Mastodon servers?
Mastodon has AUTHORIZED_FETCH and DISALLOW_UNAUTHENTICATED_API_ACCESS which would at least stop these very naive scrapers from getting any data. Smarter scrapers could actually pretend to speak enough ActivityPub to scrape servers, though.
jmclnx
I would think all you need to do is add a copyright statement of some kind.
Sad things are getting to this point. Maybe I should add this to my site :)
(c) Copyright (my email), if used for any form of LLM processing, you must contact me and pay 1000USD per word from my site for each use.
jcranmer
The argument the AI companies are making is that training for LLMs is fair use which means a copyright statement means fuck all from their point of view. (Even if it does, assuming you're in the US, unless you register the copyright with the US copyright office, you can only sue for actual damages, which means the cost of filing a lawsuit against them--not even litigating, just the court fee for saying "I have a lawsuit"--would be more expensive than anything you could recover. Even if you did register and sued for statutory damages, the cost of litigation would probably exceed the recovery you could expect.)
Of course, the big AI companies are already trying to get the government to codify AI training as fair use and sidestep the litigation which doesn't seem to be going entirely their way on this matter (cf. https://arstechnica.com/google/2025/03/google-agrees-with-op...).
tqwhite
Fair use requires transformation. LLM is as transformative as it gets. If I'm on the jury, you're going to have to make new copyright law for me to convict.
I am personally happy to have everyone, people and LLM alike, learn from my wisdom.
jcranmer
> Fair use requires transformation.
No, it doesn't. There are four factors for fair use, and whether the use is transformative is part of one of them. And you don't need to win on all four factors.
> LLM is as transformative as it gets.
The current ruling precedent for "transformative" is the Warhol decision, which effectively says that to look at whether or not something is transformative, you kind of have to start by analyzing its impact on the market (and if you're going "doesn't that import the fourth factor into the first?" the answer is "yes, I don't like it, but it's what SCOTUS said"). By that definition, LLMs are nowhere near "transformative."
Even pre-Warhol, their role as "transformative" is sketchy, because you have to remember that this is using its legal definition, not its colloquial definition.
> If I'm on the jury
Fortunately, for this kind of question, the jury isn't going to be involved in determining fair use, so it doesn't matter what you think.
tsumnia
In addition, we need to start paying attention to the growing legislation about AI and copyright law. There was an article on HN I think this week (or last) specifically where a judge ruled AI cannot own copyright on its generated materials.
IANAL, but I do wonder how this ruling will be used as a point of reference whenever we finally ask the question "Does material produced by GenAI violate copyright laws?" Specifically if it cannot claim ownership, a right that we've awarded to trees and monkeys, how does it operate within ownership laws?
And don't even get me ranting about HUMAN digital rights or Personified AIs.
Aurornis
Copyright is for topics like redistribution of the source material. You can’t add arbitrary terms to a copyright claim that go beyond what copyright law supports.
I think you’re confusing copyright with a EULA. You would need users to agree to the EULA terms before viewing the material. You can’t hide contractual obligations in the footer of your website and call it copyright.
101008
What about if my index says "This are the EULA, by clicking "Next" or "Enter", you are accepting them", and a LLM scrapper "clicks" Next to fetch the rest of the content?
aaronbaugher
That's how the big software companies have been doing it to us for years, so it does seem like turnabout would be fair play.
jefftk
It's reasonably likely, but not yet settled, that LLM training falls under fair use and doesn't require a license. This is what the https://githubcopilotlitigation.com/ class action (from 2022) is about, and its still making its way through the court. This prediction market has it at 12% likely to succeed, suggesting that courts will not agree with you: https://manifold.markets/JeffKaufman/will-the-github-copilot...
jcranmer
> It's reasonably likely, but not yet settled, that LLM training falls under fair use and doesn't require a license.
I would say it's not reasonably likely that LLM training is fair use. Because I've read the most recent SCOTUS decision on fair use (Warhol), and enough other decisions on fair use, to understand that the primary (and nearly only, in practice) factor is the effect on the market for the original. And AI companies seem to be going out of their way to emphasize that LLM training is only going to destroy the market for the originals, which weighs against fair use. Not to mention the existence of deals licensing content for LLM training which... basically concedes the point.
Of the various options, a ruling that LLM training is fair use I find the least likely. More likely is either that LLM training is not fair use, that LLM training is not infringing in the first place, or that the plaintiffs can't prove that the LLM infringed their work.
tqwhite
I do not read it that way at all. The Goldsmith decision mainly turns on the idea that an artist protections include that for derivative works. Warhol produced a work that does substantially the same things as Goldsmith's, ie, is a picture that can be viewed.
When talking about parody, they note that the usage as the foundation for parody is always substantially different from the original and thereby allowed, even if it would otherwise infringe. LLMs are always substantially different from the original, too.
If I want to write software that draws that picture exactly, the code would not be a copyright violation. It is text and cannot be printed in a magazine as a picture. If I used it to print a picture that was a derivative work and sold that, it might be.
A large language model has no intersection with the picture or, for that matter, anything that it absorbs. It is possible that someone might figure out how to prompt it to do exactly the same picture as Goldsmith did but fairly unlikely.
Unless you could show that this was easy, common and part of the intent of the LLM creator, I can see no possibility that it is infringing.
maeln
> This prediction market has it at 12% likely to succeed
Randos on the internet with a betting addiction are distinctively different from a court of law. I wish people would stop talking about prediction market as if they mattered.
eudhxhdhsb32
Participants in prediction market do not need to be experts for their collective input to be informative.
There's a long history of economic research on the "wisdom of crowds" that backs up their value.
waveringana
why are we pretending that these gambling sites have any weight on anything
eudhxhdhsb32
What do you mean by weights?
I'd certainly trust their predictions more than those given by most "experts".
dingnuts
this isn't about copyright but about computer access. the CFAA is extremely broad; if you ban LLM companies from access on grounds of purpose you have every legal right to do so
in theory that legislation has teeth, too. they are not allowed to access your system if you say they are not; authentication is irrelevant.
every GET request to a system that doesn't permit access for training data is a felony
JohnFen
Such a notice is legally meaningless, though. Doubly so if the courts rule that scraping for AI purposes counts as fair use.
jasperr1
The reality is that a lot of these small websites have very permissive licenses. I really hope we don't get to the point where we must all make our licenses stricter.
krapp
The reality is that none of these LLM scrapers give a damn about copyright, because the entire AI industry is built on flagrant copyright violation, and the premise that they can be stopped by a magic string is laughable.
You could sue, if you can afford it, meanwhile all of your data is already training their models.
jasonjayr
A class action, funded by their rivals could hurt quite a bit, especially for sites damaged monetarily by these LLM scrapers.
jeffwask
Sure, because Meta certainly followed copyright law to the letter when they torrented thousands of copyrighted books from hundreds of published and known authors to train Lama. Forgive me if I doubt a text disclaimer on the page will slow them down.
dspillett
Unfortunately copyright is no limit to these companies.
Meta is stating in court that knowingly downloading pirated content is perfectly fine (ref https://news.ycombinator.com/item?id=43125840) so they for one would have absolutely no issue completely ignoring your copyright notice and stated licensing costs. Good luck affording a legal team to try force them to pay attention.
Copyright is something for them to beat us with, not the other way around, apparently.
kerkeslager
This is pretty naive.
The only reason copyright is so strong in the US is that there are big players (Disney, Elsevier) who benefit from it. But gig tech is much bigger, and LLMs have created a situation where big tech has a vested interest in eroding copyright law. Both sides are gearing up for a war in the court systems, and it's definitely not a given who will win. But, if you try to enter the fray as an individual or small company, you definitely aren't going to win.
renegat0x0
To be honest I feel that web2 is overrated.
Most of content, blogs could be static sites.
For mastodon, forums I think user validation is ok and a good way to go.
null
0x1ceb00da
Do I need to be worried about my bill if I've rented a simple EC2 instance without any fancy autoscaling stuff?
simonw
Probably not. Keep an eye on bandwidth usage since you'll be charged for that but you would need to attract an incredible amount of bot traffic for that to add up to anything meaningful.
The thing to watch out for is platforms like Vercel or Google Cloud Run where you get charged more for compute if you attract crawlers, potentially unbounded (make sure to set up spending limits if you can.)
MontgomeryPy
Could an answer here be for smaller websites to convert their sites into chatbots which could prevent AI scrapers from slurping up all their content/drive up their hosting costs?
cwmma
no
Perversely, this submission is essentially blogspam. The article linked in the second paragraph, to which this "1 minute" read adds almost nothing of value, is the main story:
<https://thelibre.news/foss-infrastructure-is-under-attack-by...>
394 comments. 645 points. Submitted 3 hours ago: <https://news.ycombinator.com/item?id=43422413>