A thought on JavaScript "proof of work" anti-scraper systems

220 comments

·May 26, 2025

myself248

If the proof-of-work system is actually a crypto miner, such that visitors end up paying the site for the content they host, have we finally converged on a working implementation of the micropayments-for-websites concepts of decades ago?

diggan

> If the proof-of-work system is actually a crypto miner, such that visitors end up paying the site for the content they host

Unsure how that would work. If the proof you generate could be used for blockchain operations, so that the website operator could be paid by using that proof as generated by the website visitor, why shouldn't the visitor keep that proof to themselves and use it instead? Then they'd get the full amount, and the website operator gets nothing. So then there is no point for it, and the visitor might as well just run a miner locally :)

lurkshark

This system actually existed for awhile, it was called Coinhive. Each visitor would be treated like a node in a mining pool with “credit” for the resources going to the site owner. Somewhat predictably it became primarily used by hackers who would inject the code on high profile sites or use advertising networks.

https://krebsonsecurity.com/2018/03/who-and-what-is-coinhive...

xd1936

The domain is now owned by Troy Hunt!

https://www.troyhunt.com/i-now-own-the-coinhive-domain-heres...

viraptor

Have a look at how mining pools are implemented. The client only gets to change some part of the block and does the hashing from there. You can't go back from that to change the original data - you wouldn't get paid. Otherwise you could easily scam the mining pool and always keep the winning numbers to yourself while getting paid for the partials too.

mistercow

Just to make sure I understand this (and maybe clarify for others too), because my understanding of proof-of-work systems is very high level:

When you mine a block, you’re basically bundling up a bunch of meaningful data, and then trying to append some padding data that will e.g. result in a hash that has N leading 0 bits. One of the pieces of meaningful data in the block is “who gets the reward?”

If you’re mining alone, you would put data on that block that says “me” as who gets the reward. But if you’re mining for a pool, you get a block that already says “the pool” for who gets the reward.

So then I’m guessing the pool gives you a lesser work factor to hit, so some value smaller than N? You’ll be basically saying “Well, here’s a block that doesn’t have N leading zeroes, but does have M, leading zeroes”, and that proves how much you’re working for the pool, and entitles you to a proportion of the winnings.

If you changed the “who gets the reward?” from “the pool” to “me”, that would change the hash. So you can’t come in after the fact, say “Look at that! N leading zeroes! Let me just swap myself in to get the reward…” because that would result in an invalid block. And if you put yourself as the reward line in advance, the pool just won’t give you credit for your “partial” answers.

Is that about right?

odo1242

The company Coinhive used to do this before they shut down. Basically, in order to enter a website, you have to provide the website with a certain number of Monero hashes (usually around 1,024) that the website would send to Coinhive’s miner pool before letting the user through.

It kinda worked, except for the fact that hackers would try to “cryptojack” random websites by hacking them and inserting Coinhive’s miner into their pages. This caused everyone to block Coinhive’s servers. (Also you wouldn’t get very much money out of it - even the cryptojackers who managed to get tens of millions of page views out of hacked websites reported they only made ~$40 from the operation)

kbenson

If attackers only made ~$40 fora good amount of work, seems like it would have resolved itself if the scheme was left to run itself to conclusion before people started blocking coinhive in (what sounds like from your description) a knee-jerk reaction.

Then again, I'm sure there's quite a bit of tweaking that could be done to make clients submit far more hashes, but that would make it much more noticeable.

hoppp

That $40 now coud be in the thousands if they didn't spend it Xmr was cheaper back then.

SilasX

From my understanding: to pose the problem for miners, you hash the block you're planning to submit (which includes all the proposed transactions). Miners only get the hash. To claim the reward, you need the preimage (i.e. block to be submitted), which miners don't have.

In theory, you could watch the transactions being broadcast, and guess (+confirm) the corresponding block, but that would require you to see all the transactions the pool owner did, and put them in the same order (the possibilities of which scale exponentially with the number of transactions). There may be some other randomness you can insert into a block too -- someone else might know this.

Edit: oops, I forgot: the block also contains the address that the fees should be sent to. So even if you "stole" your solution and broadcast it with the block, the fee is still going to the pool owner. That's a bigger deal.

csense

> why shouldn't the visitor keep that proof to themselves and use it instead

Because they can't, if the website operator designs their JavaScript correctly. In detail:

If Alice goes to Bob's website, Bob tells Alice to find a hash with 20 leading zeros for a Bitcoin block that says "Send the newly printed bitcoins and transaction fees to Bob." (It will take Alice ~2^20 guesses to find such a block, so Bob picked the number 20 such that those ~2^20 guesses happen in a couple seconds for normal humans with normal web browsers on normal devices.)

Supposing the actual Bitcoin blockchain needs a hash with 50 leading zeros, one in every 2^30 Alice's will mine a valid block (worth ~$300k at current Bitcoin prices).

If Alice finds a block with 50 leading zeros and then tries to change the block to say "Send the newly printed bitcoins and transaction fees to Alice," her new block will have a different hash (that is very unlikely to have 2^50 leading zeros), and neither the website nor the blockchain will accept it.

Sure, Alice could change the block at the beginning before starting the search. But if she finds a block with 20 leading zeros that says "Send the newly printed bitcoins and transaction fees to Alice," Bob won't accept it for access to his website. The only way Alice gets anything for a block that sends the coins to herself is from the Bitcoin blockchain if she finds a 2^50 block -- at that point, Alice is just mining Bitcoins for herself and not interacting with Bob's website at all.

(If a 1 in a billion chance to win $300k is too risky for Bob's liking, he can get a lower payout with a higher probability by using a different proof-of-work blockchain and/or a mining pool.)

Retr0id

If the user mined it themselves and then paid the site owner before accessing the site, they'd have to pay a transaction fee and wait for a high-latency transaction to commit. The transaction fee could dwarf the actual payment value.

Mining on behalf of the site owner negates the need for a transaction entirely.

viraptor

(unnecessary)

null

[deleted]

moralestapia

Yeah, but then they wouldn't get your content? Duh.

odo1242

Not really, because it takes a LOT of hashes to actually get any crypto out of the system. Yes, you’re technically taking the user’s power and getting paid crypto, but unless you’re delaying the user for a long time, you’re only really being paid about a ten thousandth of a cent for a page visit.

Also virus scanners and corporate networks would hate you, because hackers are probably trying to embed whatever library you’re using into other unsuspecting sites.

jfengel

What does one actually get per page impression from Google Ads? I gather that it's more than a ten thousandth of a cent, but perhaps not all that much more.

20after4

It used to be a few pennies per 1000 page views. But it's been more than 10 years since I had any insight into the details of internet ad revenue. (Also it's highly dependent on all kinds of contextual details about the content, the keywords, and the specific ads you wind up serving)

bdcravens

There were some Javascript-based embedded miners in the early days of Bitcoin

https://web.archive.org/web/20110603143708/http://www.bitcoi...

msgodel

It would be nice if this could get standardized http headers so bots could still use sites but they effectively pay for use. That seems like the best of all possible worlds to me, the whole point of HTML is that robots can read it, otherwise we'd just be emailing eachother pdfs.

theamk

We already have a standardized system - robots.txt - and AI bots are already ignoring it. Why would more standardized headers matter? Bots will ignore them just as they do today, pretend to be regular users, and get content without paying.

(A secondary thing is that AI bots have basically zero benefit for most websites, so unless you are some low-cost crappy content farm, it'll be in your interest to raise the prices to the max so the bots are simply locked out. Which will bring us back to point 1, bots ignoring the headers)

msgodel

Being indexed in search engines has zero benefit?

Also robots.txt is a suggestion but hashcash is enforced server side. I agree it's a tragedy people have started to completely ignore it but you can't ignore server side behavior.

DrillShopper

They should have to set the evil bit

ramses0

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

...and the requisite checklist: https://trog.qgl.org/20081217/the-why-your-anti-spam-idea-wo...

msgodel

Formalizing it doesn't change that it's being used. If it doesn't work it shouldn't be done, if it does it should be formalized.

overfeed

> bots could still use sites but they effectively pay for use. That seems like the best of all possible worlds to me

This would make the entire internet a maze of AI-slop content primarily made for other bots to consume. Humans may have to resort to emailing handwritten PDFs to avoid the thoroughly enshittified web.

dandelany

As opposed to what we have now?

kmeisthax

The problem with micropayments was fourfold:

1. Banner ads made more money. This stopped being true a while ago, it's why newspapers all have annoying soft paywalls now.

2. People didn't have payment rails set up for e-commerce back then. Largely fixed now, at least for adults in the US.

3. Transactions have fixed processing costs that make anything <$1 too cheap to transact. Fixed with batching (e.g. buy $5 of credit and spend it over time).

4. Having to approve each micropurchase imposes a fixed mental transaction cost that outweighs the actual cost of the individual item. Difficult to solve ethically.

With the exception of, arguably[0], Patreon, all of these hurdles proved fatal to microtransactions as a means to sell web content. Games are an exception, but they solved the problem of mental transaction costs by drowning it in intensely unethical dark patterns protected by shittons of DRM[1]. You basically have to make someone press the spend button without thinking.

The way these proof-of-work systems are currently implemented, you're effectively taking away the buy button and just charging someone the moment they hit the page. This is ethically dubious, at least as ethically dubious as 'data caps[2]' in terms of how much affordance you give the user to manage their spending: none.

Furthermore, if we use a proof-of-work system that's shared with an actual cryptocurrency, so as to actually get payment from these hashes, then we have a new problem: ASICs. Cryptocurrencies have to be secured by a globally agreed-upon hash function, and changing that global consensus to a new hash function is very difficult. And those hashes have economic value. So it makes lots of sense to go build custom hardware just to crack hashes faster and claim more of the inflation schedule and on-chain fees.

If ASICs exist for a given hash function, then proof-of-work fails at both:

- Being an antispam system, since spammers will have better hardware than legitimate users[3]

- Being a billing system, since legitimate users won't be able to mine enough crypto to pay any economically viable amount of money

If you don't insist on using proof-of-work as billing, and only as antispam, then you can invent whatever tortured mess of a hash function is incompatible with commonly available mining ASICs. And since they don't have to be globally agreed-upon, everyone can use a different, incompatible hash function.

"Don't roll your own crypto" is usually good security advice, but in this case, we're not doing security, we're doing DRM. The same fundamental constants of computing that make stopping you from copying a movie off Netflix a fool's errand also make stopping scrapers theoretically impossible. The only reason why DRM works is because of the gap between theory and practice: technically unsophisticated actors can be stopped by theoretically dubious usages of cryptography. And boy howdy are LLM scrapers unsophisticated. But using the tried-and-true solutions means they don't have to be: they can just grab off-the-shelf solutions for cracking hashes and break whatever you use.

[0] At least until Apple cracked Patreon's kneecaps and made them drop support for any billing mode Apple's shitty commerce system couldn't handle.

[1] At the very least, you can't sell microtransaction items in games without criminalizing cheat devices that had previously been perfectly legal for offline use. Half the shit you sell in a cash shop is just what used to be a GameShark code.

[2] To be clear, the units in which Internet connections are sold should be kbps, not GB/mo. Every connection already has a bandwidth limit, so what ISPs are doing when they sell you a plan with a data cap is a bait and switch. Two caps means the lower cap is actually a link utilization cap, hidden behind a math problem.

[3] A similar problem has arisen in e-mail, where spammy domains have perfect DKIM/SPF, while good senders tend to not care about e-mail bureaucracy and thus look worse to antispam systems.

jaredwiener

Point 4 is often overlooked and I think the biggest issue.

Once there is ANY value exchanged, the user immediately wonders if it is worth it -- and if the payment/token/whatever is sent prior to the pageload, they have no way of knowing.

hakfoo

Point 4 is solvable by selling a broad subscription rather than individual articles.

Streaming proves this. When people spend $10 per month on Netflix/Hulu/Crunchyroll they don't have to further ask "do I want to pay 7.5 cents for another episode" every 22 minutes. The math for who's getting paid how much for how many streams is entirely outside the customer's consideration, and the range is broad enough that it discourages one-and-done binging.

For individual content providers, you might need to form some sort of federated project. Media properties could organize through existing networks as an obvious framework ("all AP newspapers for $x per month") but we'd probably need new federations for online-first and less news-centric publishers.

wahern

> Once there is ANY value exchanged

There's always value exchanged--"If you're not paying for the product, you are the product".[1] For ads we've established the fiction that everybody knowingly understands and accepts this quid pro quo. For proof of work we'd settle on a similar fiction, though perhaps browsers could add a little graphic showing CPU consumption.

[1] This is true even for personal blogs, albeit the monetary element is much more remote.

bee_rider

This is most true of books and other types of media (well, you can flip through a book at the store, but it isn’t a perfect thing…).

I dunno. Brands and other quality signals (imperfect as they tend to be, they still aren’t completely useless) could develop.

x-complexity

> Furthermore, if we use a proof-of-work system that's shared with an actual cryptocurrency, so as to actually get payment from these hashes, then we have a new problem: ASICs. Cryptocurrencies have to be secured by a globally agreed-upon hash function, and changing that global consensus to a new hash function is very difficult. And those hashes have economic value. So it makes lots of sense to go build custom hardware just to crack hashes faster and claim more of the inflation schedule and on-chain fees.

> If ASICs exist for a given hash function, then proof-of-work fails at both:

> - Being an antispam system, since spammers will have better hardware than legitimate users[3]

> - Being a billing system, since legitimate users won't be able to mine enough crypto to pay any economically viable amount of money

Monero/XMR & Zcash break this part of the argument, along with ASIC/GPU-resistant algorithms in general (Argon2 being most well-known & recommended as a KDF).

Creating an ASIC-resistant coin is not impossible, as shown by XMR. The difficult part comes from creating & sustaining the network surrounding the coin, and those two are amongst the few that have done both. Furthermore, there's little actual need to create another coin to do so when XMR fulfills that niche.

------

> If you don't insist on using proof-of-work as billing, and only as antispam, then you can invent whatever tortured mess of a hash function is incompatible with commonly available mining ASICs. And since they don't have to be globally agreed-upon, everyone can use a different, incompatible hash function.

Counterpoint: People (and devs) want a pre-packaged solution that solves the mentioned antispam problem. For almost everyone, Anubis does its job as intended.

https://github.com/TecharoHQ/anubis

DocTomoe

[flagged]

rcxdude

It's not actually a good idea, though. It's basically just banditry: the cost to the users it much more than the value to the benefactor, and there's not much they can do about it. (to be fair, the super invasive tracking ad systems that now exist have the same problem, but it's not obvious that they're worse).

shkkmo

The doxxing is questinable, but much less questionable than your presentation of events.

Coinhive earned 35% of everything mined on any site, not just the image board. They had no means stopping malicious installations from stealing from users. This provided hackers financial incenctive to compromise as many sites as possible and Coinhive's incentives were aligned with this. The choice on Monero as the base blockchain made it pretty clear what the intentions were.

> Lessons learned: Good ideas in paying for your content needs to pass the outrage culture test. And Krebs is ... not a honest news source.

Don't create tools clearly intended to facilitate criminal activity, make money off of it, and expect every to be OK with it.

ChocolateGod

I'm glad after spending all this time trying to increase power efficiency people have come up with JavaScript that serves no purpose other than to increase power draw.

I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.

jeroenhd

This is rather unfortunate, but the way Anubis works, you will only get the PoW test once.

Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.

If you're visiting loads of different websites, that does suck, but most people won't be affected all that much in practice.

There are alternatives, of course. Several attempts at standardising remote attestation have been done. Apple has included remote attestation into Safari years ago. Basically, Apple/Google/Cloudflare give each user a limited set of "tokens" after verifying that they're a real person on real hardware (using TPMs and whatnot), and you exchange those tokens for website visits. Every user gets a load of usable tokens, but bots quickly run out and get denied access. For this approach to work, that means locking out Linux users, people with secure boot disabled, and things like outdated or rooted phones, but in return you don't get PoW walls or Cloudflare CAPTCHAs.

In the end, LLM scrapers are why we can't have nice things. The web will only get worse now that these bots are on the loose.

shiomiru

> Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.

Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.

Ultimately, I don't believe this is an issue that can be solved by technical means; any such attempt will solely result in continuous UX degradation for humans in the long term. (Well, it is already happening.) But of course, expecting any sort of regulation on the manna of the 2020s is just as naive... if anything, this just fits the ideology that the WWW is obsolete, and that replacing it with synthetic garbage should be humanity's highest priority.

ndiddy

> Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.

The reason why Anubis was created was that the author's public Gitea instance was using a ton of compute because poorly written LLM scraper bots were scraping its web interface, making the server generate a ton of diffs, blames, etc. If the AI companies work around proof-of-work blocks by not constantly scraping the same pages over and over, or by detecting that a given site is a Git host and cloning the repo instead of scraping the web interface, I think that means proof-of-work has won. It provides an incentive for the AI companies to scrape more efficiently by raising their cost to load a given page.

cesarb

> Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire.

AFAIK, Anubis does not work alone, it works together with traditional per-IP-address rate limiting; its cookies are bound to the requesting IP address. If the scraper uses a new IP address for each request, it cannot reuse the cookies; if it uses the same IP address to be able to reuse the cookies, it will be restricted by the rate limiting.

immibis

At some point it must become cheaper to pay the people running the site for a copy of the site than to scrape it.

ChocolateGod

> Scrapers, on the other hand, keep throwing out their session cookies

This isn't very difficult to change.

> but the way Anubis works, you will only get the PoW test once.

Not if it's on multiple sites, I see the weab girl picture (why?) so much it's embedded into my brain at this point.

viraptor

> (why?)

So you can pay the developers for the professional version where you can easily change the image. It's a great way of funding the work.

alpaca128

> I see the weab girl picture (why?)

As far as I know the creator of Anubis didn't anticipate such a widespread use and the anime girl image is the default. Some sites have personalized it, like sourcehut.

null

[deleted]

kokanee

Attestation is a compelling technical idea, but a terrible economic idea. It essentially creates an Internet that is only viewable via Google and Apple consumer products. Scamming and scraping would become more expensive, but wouldn't stop.

It pains me to say this, but I think that differentiating humans from bots on the web is a lost cause. Proof of work is just another way to burn more coal on every web request, and the LLM oligarchs will happily burn more coal if it reduces competition from upstart LLMs.

Sam Altman's goal is to turn the Internet into an unmitigated LLM training network, and to get humans to stop using traditional browsing altogether, interacting solely via the LLM device Jony Ive is making for him.

Based on the current trajectory, I think he might get his way, if only because the web is so enshittified that we eventually won't have another way to reach mainstream media other than via LLMs.

jerf

"It pains me to say this, but I think that differentiating humans from bots on the web is a lost cause."

Ah, but this isn't doing that. All this is doing is raising friction. Taking web pages from 0.00000001 cents to load to 0.001 at scale is a huge shift for people who just want to slurp up the world, yet for most human users, the cost is lost in the noise.

All this really does is bring the costs into some sort of alignment. Right now it is too cheap to access web pages that may be expensive to generate. Maybe the page has a lot of nontrivial calculations to run. Maybe the server is just overwhelmed by the sheer size of the scraping swarm and the resulting asymmetry of a huge corporation on one side and a $5/month server on the other. A proof-of-work system doesn't change the server's costs much but now if you want to scrape the entire site you're going to have to pay. You may not have to pay the site owner, but you will have to pay.

If you want to prevent bots from accessing a page that it really wants to access, that's another problem. But, that really is a different problem. The problem this solves is people using small amounts of resources to wholesale scrape entire sites that take a lot of resources to provide, and if implemented at scale, would pretty much solve that problem.

It's not a perfect solution, but no such thing is on the table anyhow. "Raising friction" doesn't mean that bots can't get past it. But it will mean they're going to have to be much more selective about what they do. Even the biggest server farms need to think twice about suddenly dedicating hundreds of times more resources to just doing proof-of-work.

It's an interesting economic problem... the web's relationship to search engines has been fraying slowly but surely for decades now. Widespread deployment of this sort of technology is potentially a doom scenario for them, as well as AI. Is AI the harbinger of the scrapers extracting so much from the web that the web finally finds it economically efficient to strike back and try to normalize the relationship?

ChocolateGod

People are using LLMs because search results (due to SEO overload, Google's bad algorithm etc) are terrible, Anubis makes these already bad search results even worse by trying to block indexing, meaning people will want to use LLMs even more.

So the existence of Anubis will mean even more incentive for scraping.

account42

> This is rather unfortunate, but the way Anubis works, you will only get the PoW test once.

Actually I will get it zero times because I refuse to enable javashit for sites that shouldn't need it and move on to something run by someone competent.

RHSeeger

> sites that shouldn't need it

There's lots of ways to define "shouldn't" in this case

- Shouldn't need it, but include it to track you

- Shouldn't need it, but include it to enhance the page

- Shouldn't need it, but include it to keep their costs down (for example, by loading parts of the page dynamically / per person and caching the rest of the page)

- Shouldn't need it, but include it because it help stop the bots that are costing them more than the site could reasonably expected to make

I get it, JS can be used in a bad way, and you don't like it. But the pillar of righteousness that you seem to envision yourself standing on it not as profound as you seem to think it is.

odo1242

Well, everything’s a tradeoff. I know a lot of small websites that had to shut down because LLM scraping was increasing their CPU and bandwidth load to the point where it was untenable to host the site.

jgalt212

> I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.

I dunno. How much work do you really need in PoW systems to make the scrapers go after easier targets? My guess is not so much that you impair a human's UX. And if you do, then you have not fine-tuned your PoW algo, or you have very determined adversaries / scrapers.

ChocolateGod

Any PoW that doesn't impact end users is not going to impact LLM scrapers.

MyPasswordSucks

"Any" is a pretty mighty word to throw around.

As has been stated multiple times in this thread and basically any thread involving conversation on the topic, a PoW with a negligible cost (either of time/money/pain-in-the-ass factor) will not impact end users, but will affect LLM scrapers due to the scales involved.

The problem is trying to create a PoW that actually fits that model, is economical to implement, and can't easily be gamed.

But saying "any" seems to imply that it's a theoretical impossibility ("any machine that moves will encounter friction and lose energy to heat conversion, ergo perpetual motion machines are impossible"), when in fact it's a theoretical possibility, just not yet a practical reality.

eric__cartman

My phone is a piece of junk from 8 years ago and I haven't noticed any degradation in browsing experience. A website takes like two extra seconds to load, not a big deal.

sznio

I'd really like this, since it wouldn't impact my scraping stuff.

I like to scrape websites and make alternative, personalized frontends for them. Captchas are really painful for me. Proof of work would be painful for a massive scraping operation, but I wouldn't have an issue with spending some CPU time to get the latest posts from a site which doesn't have an RSS feed or an API.

diggan

Yeah, to me PoW makes a lot of sense in this way too. Captchas are hard for (some) people to solve, and very annoying to fill out, but easy for vision-enabled LLMs to solve (or even use 3rd party services where you pay for N/solves, available for every major captcha service). PoW instead are hard to deal with in a distributed/spammy design, but very easy for any user to just sit and wait a second or two. And all personal scraping tooling just keeps working, just slightly slower.

Sounds like an OK solution to a shitty problem that has a bunch of other shitty solutions.

benregenspan

At a media company, our web performance monitoring tool started flagging long-running clientside XHR requests, which I couldn't reproduce in a real browser. It turned out that an analytics vendor was injecting a script which checked if it looked like the client was a bot. If so, they would then essentially use the client as a worker to perform their own third-party API requests (for data like social share counts). So there's definitely some prior art for this kind of thing.

apitman

This is really interesting. One naive thought that immediately came to mind is that bots might be capable of making cross site requests. The logical conclusion of this entire arms race is that bots will eventually have no choice but to run actual browsers. Not sure that fact will appreciably reduce their scaping abilities though.

dragonwriter

> The logical conclusion of this entire arms race is that bots will eventually have no choice but to run actual browsers.

I thought bots using (headless) browsers was an existing workaround for a number of existing issues with simpler bots, so this doesn't seem to be a big change.

benregenspan

> The logical conclusion of this entire arms race is that bots will eventually have no choice but to run actual browsers

I think this is almost already the case now. Services like Cloudflare do a pretty good job of classifying primitive bots and if site operators want to block all (or at least vast majority), they can. The only reliable way through is a real browser. (Which does put a floor on resource needs for scraping)

DaSHacka

Surprised there hasn't been a fork of Anubis that changes the artificial PoW into a simple Monero mining PoW yet.

Would be hilarious to trap scraper bots into endless labyrinths of LLM-generated mediawiki pages, getting them to mine hashes with each progressive article.

At least then we would be making money off these rude bots.

g-b-r

It would be a much bigger incentive to add them with little care for the innocents impacted.

Although admittedly millions of sites already ruined themselves with cloudflare without that incentive

albrewer

There a company awhile back that did almost exactly this called CoinHive

xnorswap

The bots could check if they've hit the jackpot themselves and keep the valid hashes for themselves and only return when they're worthless.

Then it's the bots who are making money from work they need to do for the captchas.

gus_massa

IIRC the mined block has an instruction like

fake quote > Please add the reward and fees to: 187e6128f96thep00laddr3s9827a4c629b8723d07809

And if you make a fake block that changes the address, then the fake block is not a good one.

This avoid the same problem with people stealing from pools, and also evil people listening to new mined blocks that pretend that they found it and send a fake one.

nssnsjsjsjs

1. The problem is the bot needs to understand the program it is running to do that. Akin to the halting problem.

2. There is no money in mining on the kinda hardware scrapers will run on. Power costs more that they'd earn.

immibis

Realistically, the bot owner could notice you're running MoneroAnubis and then would specifically check for MoneroAnubis, for example with a file hash, or a comment saying /* MoneroAnubis 1.0 copyright blah blah GPL license blah */. The bot wouldn't be expected to somehow determine this by itself automatically.

Also, the ideal Monero miner is a power-efficient CPU (so probably in-order). There are no Monero ASICs by design.

forty

We need an oblivious crypto currency mining algorithm ^^

hypeatei

> Then it's the bots who are making money from work they need to do for the captchas.

Wouldn't it be easier to mine crypto themselves at that point? Seems like a very roundabout way to go about mining crypto.

kmeisthax

This is a good idea for honeypotting scrapers, though as per [0] I hope nobody actually tries to use it on a real website anyone would want to use.

[0] https://news.ycombinator.com/item?id=44117591

avastel

Reposting a similar point I made recently about CAPTCHA and scalpers, but it’s even more relevant for scrapers.

PoW can help against basic scrapers or DDoS, but it won’t stop anyone serious. Last week I looked into a Binance CAPTCHA solver that didn’t use a browser at all, just a plain HTTP client. https://blog.castle.io/what-a-binance-captcha-solver-tells-u...

The attacker had fully reverse engineered the signal collection and solved-state flow, including obfuscated parts. They could forge all the expected telemetry.

This kind of setup is pretty standard in bot-heavy environments like ticketing or sneaker drops. Scrapers often do the same to cut costs. CAPTCHA and PoW mostly become signal collection protocols, if those signals aren’t tightly coupled to the actual runtime, they get spoofed.

And regarding PoW: if you try to make it slow enough to hurt bots, you also hurt users on low-end devices. Someone even ported PerimeterX’s PoW to CUDA to accelerate solving: https://github.com/re-jevi/PerimiterXCudaSolver/blob/main/po...

persnickety

> An LLM scraper is operating in a hostile environment [...] because you can't particularly tell a JavaScript proof of work system from JavaScript that does other things. [..] for people who would like to exploit your scraper's CPU to do some cryptocurrency mining, or [...] want to waste as much of your CPU as possible).

That's a valid reason to serve JS-based PoW systems scares LLM operators: there's a chance the code might actually be malicious.

That's not a valid reason to serve JS-based PoW systems to human users: the entire reason those proofs work against LLMs is the threat that the code is malicious.

In other words, PoW works against LLM scrapers not because of PoW, but because they could contain malicious code. Why would you threaten your users with that?

And if you can apply the threat only to LLMs, then why don't you cut the PoW garbage start with that instead?

I know, it's because it's not so easy. So instead of wielding the Damocles sword of malware, why not standardize on some PoW algorithm that people can honestly apply without the risks?

pjc50

I don't think this is "malicious" so much as it is "expensive" (in CPU cycles), which is already a problem for ad-heavy sites.

captainmuon

I don't know, Sandbox escape from a browser is a big deal, a million dollars bounty kind of deal. I feel safe to put an automated browser in a container or a VM and let it run with a timeout.

And if a site pulls something like that on me, then I just don't take their data. Joke is on them, soon if something is not visible to AI it will not 'exist', like it is now when you are delisted from Google.

berkes

> Why would you threaten your users with that?

Your users - we, browsing the web - are already threatened with this. Adding a PoW changes nothing here.

My browser already has several layers of protection in place. My browser even allows me to improve this protection with addons (ublock etc) and my OSes add even more protection to this. This is enough to allow PoW-thats-legit but block malicious code.

account42

Not safety-conscious users who disable javascript.

berkes

Those aren't threatened by PoW or malicious versions thereof either.

dannyw

This is a poor take. All the major LLM scrapers already run and execute JavaScript, Googlebot has been doing it for probably a decade.

Simple limits on runtime atop crypto mining from being too big of a problem.

jeroenhd

And by making bots hit that limit, scrapers don't get access to the protected pages, so the system works.

Bots can either risk being turned into crypto miners, or risk not grabbing free data to train AIs on.

account42

Real users also have a limit where they will close the tab.

TZubiri

"Googlebot has been doing it for probably a decade."

This is why Google developed a browser, turns out scraping the web requires one to pretty much develop a V8 engine, so why not publish it as a browser .

motoxpro

This is so obvious when you say it, but what an awesome insight.

nssnsjsjsjs

Except it doesn't make sense. Why not just use Firefox. Or improve the JS engine of Firefox.

I reckon they made the browser to control the browser market.

rkangel

It's not quite that simple. I think that having that skillset and knowledge in house already probably led to it being feasible, but that's not why they did it. They created Chrome because it was in their best interests for rich web applications to run well.

rob_c

You don't work anywhere near the as industry then, people have been grumbling about this for the whole 10 years now

mschuster91

... and the fact that even with a browser, content gated behind Macromedia Flash or ActiveX applets was / is not indexable is why Google pushed so hard to expand HTML5 capabilities.

chrisco255

Was it really a success though in that regard? HTML5 was great and all, but it never did replace Flash. Websites mainly just became more static. I suspect the lack of mobile integration had more to do with Flash dying than HTML5 getting better. It's a shame in some sense, because Flash was a lot of fun.

maeln

But it is the whole point of the article ? Big scrapers can hardly tell if the JS that takes their runtimes is a crypto miner or an anti-scrapping system, and so they will have to give up "useful" scrapping, so PoW might just work.

rob_c

No they point is there's really advanced PoW challenges out there to prove you're not a bot (those websites that take >3s to fingerprint you are doing this!)

The idea is to abuse the abusers and if you suspect it's a bot change the PoW from a GPU/machine/die fingerprint computation to something like a few ticks of Monero or whatever the crypto of choice is this week.

Sounds useless, but you forget 0.5s of that on their farm x1e4 scraping nodes and you're into something.

The catch is not getting caught out by impacting the 0.1% of tor running anti ad "users" out there who will try and decompile your code when their personal chrome build fails to work. I say "users" because they will be visiting a non free site espousing their perceived right to be there, no different to a bot for someone paying the bills.

nitwit005

> Simple limits on runtime atop crypto mining from being too big of a problem.

If they put in a limit, you've won. You just make your site be above that limit, and the problem is gone.

dxuh

I always thought that JavaScript cryptomining is a better alternative to ads for monetizing websites (as long as people don't depend on those websites and website owners don't take it too far). I'd much rather give you a second of my CPU instead of space in my brain. Why is this so frowned upon? And in the same way I thought Anubis should just mine crypto instead of wasting power.

captainbland

I'd imagine it's pretty much impossible to make a crypto system which doesn't introduce unreasonable latency/battery drain on low-end mobile devices which is also sufficiently difficult for scrapers running on bleeding edge hardware.

If you decide that low end devices are a worthy sacrifice then you're creating e-waste. Not to mention the energy burden.

thedanbob

> Why is this so frowned upon?

Maybe because while ad tech these days is no less shady than crypto mining, the concept of ads is something people understand. Most people don't really understand crypto so it gets lumped in with "hackers" and "viruses".

Alternatively, for those who do understand ad tech and crypto, crypto mining still subjectively feels (to me at least) more like you're being stolen from than ads. Same with Anubis, wasting power on PoW "feels" more acceptable to me than mining crypto. One of those quirks of the human psyche I guess.

matheusmoreira

Running proof of work on user machines without their consent is theft of their computing and energy resources. Any site doing so for any purpose whatsoever is serving malware and should be treated as such.

Advertising is theft of attention which is extremely limited in supply. I'd even say it's mind rape. They forcibly insert their brands and trademarks into our minds without our consent. They deliberately ignore and circumvent any and all attempts to resist. It's all "justified" though, business interests excuse everything.

x-complexity

> Advertising is theft of attention which is extremely limited in supply. I'd even say it's mind rape. They forcibly insert their brands and trademarks into our minds without our consent. They deliberately ignore and circumvent any and all attempts to resist.

(1): Attention from any given person is fundamentally limited. Said attention has an inherent value.

(2): Running *any* website costs money, doubly so for video playback. This is not even mentioning the moderation & copyright mechanisms that a video sharing platform like YouTube has to have in order to keep copyright lawsuits away from YouTube itself.

(3): Products do not spawn in with their presence known to the general population. For the product to be successful, people have to know it exists in the first place.

Advertising is the consequence of wanting attention to be drawn to (3), and willing to pay for said attention on a given platform (1). (2)'s costs, alongside any payouts to videographers that garner attention to their videos, can be paid for with the money in (1), by placing ads around/before the video itself.

You're allowed to not have advertising shown to you, but in exchange, the money to pay for (2) & the people who made the video have to come from somewhere.

ge96

I think some sites that stream content (illegally) do this

thunderfork

[dead]

bob1029

I think this is not a battle that can be won in this way.

Scraping content for an LLM is not a hyper time sensitive thing. You don't need to scrape every page every day. Sam Altman does not need a synchronous replica of the internet to achieve his goals.

CGamesPlay

That is one view of the problem, but the one people are fixing with proof of work systems is the (unintentional) DDoS that LLM scrapers are operating against these sites. Just reducing the amount of traffic to manageable levels lets me get back to the work of doing whatever my site is supposed to be doing. I personally don't care if Sam Altman has a copy of my git server's rendition of the blame of every commit in my open source repo, because he could have just cloned my git repo and gotten the same result.

bob1029

I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.

spiffyk

> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

It is very real and the reason why Anubis has been created in the first place. It is not plain hostility towards LLMs, it is *first and foremost* a DDoS protection against their scrapers.

https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

https://social.kernel.org/notice/AsgziNL6zgmdbta3lY

https://xeiaso.net/notes/2025/amazon-crawler/

xena

I've set up a few honeypot servers. Right now OpenAI alone accounts for 4 hours of compute for one of the honeypots in a span of 24 hours. It's not hypothetical.

2000UltraDeluxe

25k+ hits/minute here. And that's just the scrapers that doesn't just identify themselves as a browsers.

Not sure why you believe massive repeated scraping isn't a problem. It's not like there is just one single actor out there, and ignoring robits.txt seems to be the norm nowadays.

heinrich5991

> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

Yes, there are sites being DDoSed by scrapers for LLMs.

> If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.

This isn't about one request per week or per month. There were reports from many sites that they're being hit by scrapers that request from many different IP addresses, one request each.

lelanthran

> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

There are already a few dozens of thousands of scrapers right now trying to get even more training data.

It will only get worse. We all want more training data. I want more training data. You want more training data.

We all want the most up to date data there is. So, yeah, it will only get worse as time goes on.

null

[deleted]

fl0id

For the model it’s not. But I think many of these bots are also from tool usage or ‚research‘ or whatever they call it these days. And for that it doesnatter

berkes

I doubt any "anti-scraper" system will actually work.

But if one is found, it will pave the way for a very dangerous counter-attack: Browser vendors with need for data (i.e. Google) simply using the vast fleet of installed browsers to do this scraping for them. Chrome, Safari, Edge, sending the pages you visit to their data-centers.

reginald78

This feels like it already was half happening anyway so it isn't to big of a leap.

I also think this is the endgame of things like Recall in windows. Steal the training data right off your PC, no need to wait for the sucker to upload it to the web first.

lionkor

This is why we need https://ladybird.org/

forty

Can we have proof of work algorithm that compute something actually useful? Like finding large prime numbers or something like this that have distributed computation programs. This way all this power wasted is at least not completely lost.

xena

Anubis author here. I looked into protein folding. The problem is that protein folding requires scientific data, which can easily get into the gigabyte range. That is more data than I want to serve to clients. Unless there's a way to solve the big data problem, a lot of "compute for good" schemes are frankly unworkable.

mvid

Zero knowledge proofs are basically arbitrary proof of work models. There is some interesting work being done with MPC and ZK proving, so only a small part needs to live on the client. I wonder if this would make it feasible again

g-b-r

Unfortunately useful things usually require much more computation to find a useful result, they require distributing the search, and so can't reliably verify that you performed the work (most searches will not find anything, and you can just pretend to not have found anything without doing any work).

If a service had enough concurrent clients to reliably hit useful results quickly, you could verify that most did the work by checking if a hit was found, and either let everyone in or block everyone according to that; but then you're relying on the large majority being honest for your service to work at all, and some dishonest clients would still slip through.

HN

A thought on JavaScript "proof of work" anti-scraper systems

A thought on JavaScript "proof of work" anti-scraper systems