It seems like the AI crawlers learned how to solve the Anubis challenges
104 comments
·August 15, 2025xena
logicprog
That sucks. Keep fighting the good fight, and I wish you all the best. We need people working on this problem (unfortunately).
xena
Thanks! I just wish I could afford to work on this full time, or at least even part time. It would help me a lot and prevent me from having to work what is effectively two full time jobs. Rent and food keep getting more expensive in Canada.
veqq
Good luck, I'm sorry for all of this speculation and people attacking your solution instead of suggesting concrete improvements to help fight the problem.
xena
Thanks. It means a lot. Today has not been a good day for me. It will be fixed. Things will get better, but this has to rank up there in terms of the worst ways to find out about security issues. It sucks lol.
grayhatter
I'll double down on what veqq said; Thoes that can, do. Those who have no idea where to start complain on internet threads.
There will always be bots, they were here before anubis, they'll be there long after you block them again. Take care of yourself first. There's no need to make a bad day worse trying to sprint down a marathon.
rapnie
You are doing a tremendous job, and we are really thankful for the great work you've done. Personal matters come first though imho. Take care <3
ziml77
I saw you touching grass, so I hope that's at least helping you get through the day <3
dkiebd
Is it a security issue? I thought it was just the crawlers spending the energy in solving the challenges?
zahlman
I actually don't understand who Anubis is supposed to "make sure you're not a bot". It seems to be more of a rate limiter than anything else. It self-describes:
> Anubis sits in the background and weighs the risk of incoming requests. If it asks a client to complete a challenge, no user interaction is required.
> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums. Anubis has a customizable difficulty for this proof-of-work challenge, but defaults to 5 leading zeroes.
When I go to Codeberg or any other site using it, I'm never asked to perform any kind of in-browser task. It just has my browser run some JavaScript to do that calculation, or uses a signed JWT to let me have that process cached.
Why shouldn't an automated agent be able to deal with that just as easily, by just feeding that JavaScript to its own interpreter?
yabones
My understanding is that it just increases the "expense" of mass crawling just enough to put it out of reach. If it costs fractional pennies per page scrape with just a python or go bot, it costs nickels and dimes to run a headless chromium instance to do the same thing. The purpose is economical - make it too expensive to scrape the "open web". Whether it achieves that goal is another thing.
blibble
what do AI companies have more than everyone else? compute
anubis directly incentivises the adversary, at expense of everyone else
it's what you would deploy if you want to exclude everyone else
(conspiracy theorists note that the author worked for an AI firm)
solid_fuel
Anubis increases the minimum amount of compute required to request and crawl a page. How does that incentivize the adversary?
jerf
"what do AI companies have more than everyone else? compute"
"Everyone else" actually has staggering piles of compute, utterly dwarfing the cloud, utterly dwarfing all the AI companies, dwarfing everything. It's also generally "free" on the margin. That is, if your web page takes 10 seconds to load due to an Anubis challenge, in principle you can work out what it is costing me but in practice it's below my noise floor of life expenses, pretty much rolled in to the cost of the device and my time. Whereas the AI companies will notice every increase of the Anubis challenge strength as coming straight out of their bottom line.
This is still a solid and functional approach. It was always going to be an arms race, not a magic solution, but this approach at least slants the arms race in the direction the general public can win.
(Perhaps tipping it in the direction of something CPUs can do but not GPUs would help. Something like an scrypt-based challenge instead of a SHA-256 challenge. https://en.wikipedia.org/wiki/Scrypt Or some sort of problem where you need to explore a structure in parallel but the branches have to cross-talk all the time and the RAM is comfortably more than a single GPU processing element can address. Also I think that "just check once per session" is not going to make it but there are ways you can make a user generate a couple of tokens before clicking the next link so it looks like they only have to check once per page, unless they are clicking very quickly.)
xboxnolifes
"Everyone else" (individually) isn't going to millions of webpages per day.
dathinab
it's indeed not a "bot/crawler protection"
it's a "I don't want my server to be _overrun_ by crawlers" protection which works by
- taking advantage that many crawlers are made very badly/cheaply
- increasing the cost of crawling
thats it, simple but good enough to shake of the dumbest crawlers and to make it worth it for AI agents to e.g. cache site crawling so that they don't craws your site a 1000 times a day but instead just once
Spivak
You have it right. The problem Anubis is intended to solve isn't bots per se, the problem is that bot networks have figured out how to bypass rate limits by sending requests from newly minted, sometimes residential, ip addresses/ranges for each request. Anubis tries to help somewhat by making each (client, address) perform a proof-of-work. For normal users this should be an infrequent inconvenience but for those bot networks they have to do it every time. And if they solve the challenge and keep the cookie then the server "has them" so to speak and can apply ip rate limits normally.
homebrewer
I think the only requests it was able to block are plain http requests made over curl or Go's stdlib http client. I see enough of both in httpd logs. Now the cancer has adapted by using a fully featured headless web browser that can complete challenges just like any other client.
As other commenters say, it was completely predictable from the start.
null
null
joe_the_user
Near as I can guess, the idea is that the code is optimized for what browsers can do and gpus/servers/crawlers/etc can't do as easily (or relatively as easily, just taking up the whole server for a bit might a big cost). Indeed it seems like only a matter of time before something like that would be broken.
yogorenapan
I've seen a lot of traffic from Huawei bypassing Anubis on some of the things I host as well. The funny thing is, I work for Huawei... Asking around, it seems most of it is coming from Huawei Cloud (like AWS) but their artifactory cache also shows a few other captcha bypassing libraries for Arkose/funcaptcha so they're definitely doing it themselves too.
Anonymous account for obvious reasons.
electroly
Presumably they just finally decided they were willing to spend ($) the CPU time to pass the Anubis check. That was always my understanding of Anubis--of course a bot can pass it, it's just going to cost them a bunch of CPU time (and therefore money) to do it.
zelphirkalt
I think so too. Maybe the compute cost needs to be upped some more. I am OK with waiting a bit longer when I access the site.
delusional
If I worked at a billion dollar firm, where doing this was actually a profitable endeavor, I'd reimplement the Anubis algorithm in optimized native code and run that. I wouldn't be surprised if you could lower the cost of generating the proof by a couple of orders of magnitude, enough to make it trivial. If you then batch it, or distribute it across your GPU farm, well now it's practically free.
zeropointsh
How about using on-chain proof-of-work? It flips the script.
If a bot wants access, let it earn it—and let that work be captured, not discarded. Each request becomes compensation to the site itself. The crawler feeds the very system it scrapes. Its computational effort directly funds the site owner's wallet, joining the pool to complete its proof.
The cost becomes the contract.
viraptor
The check has to apply to people and bot visitors the same. If you're expecting a blockchain registered spend before the content is visible, basically nobody will visit your website.
apetresc
I don’t think OP meant you pay directly, just that you volunteer to do some part of the PoW (of some chain designed for this purpose) on behalf of the site, to its credit.
That’s not much of a different ask from Anubis. It just commandeers the compute for some useful purpose.
xena
If you can tell me how to implement protein folding without having to have gigabytes of scientific data involved, I'll implement it today.
delusional
Making a proof of work algorithm do some actually useful work is very much an unsolved problem.
akoboldfrying
Which blockchain? An existing mainstream one like Bitcoin?
Because if so, I don't yet see how to "smooth out" the wins. If the crawler manages to solve the very-high-difficulty puzzle for you and get you 1BTC, great, but it will be a long time between wins.
If you're proposing a new (or non-mainstream) blockchain: What makes those coins valuable?
recursivecaveat
Isn't it the same problem as public mining pools? I remember I ran my little desktop in one for a day or 2 and got paid some micro-coins despite not personally winning a block. I'm not sure how they verify work and prevent cheating, but they appear to do so. I don't know if it scales down to a small enough size to be appropriate for 1 webpage though.
Retr0id
Last time I checked, Anubis used SHA256 for PoW. This is very GPU/ASIC friendly, so there's a big disparity between the amount of compute available in a legit browser vs a datacentre-scale scraping operation.
A more memory-hard "mining" algorithm could help.
jsnell
A different algorithm would not help.
Here's the basic problem: the fully loaded cost of a server CPU core is ~1 cent/hour. The most latency you can afford to inflict on real users is a couple of seconds. That means the cost of passing a challenge the way the users pass it, with a CPU running Javascript, is about 1/1000th of a cent. And then that single proof of work will let them scrape at a minimum hundreds, but more likely thousands, of pages.
So a millionth of a cent per page. How much engineering effort is worth spending on optimizing that? Basically none, certainly not enough to offload to GPUs or ASICs.
Retr0id
No matter where the bar is there will always be scrapers willing to jump over it, but if you can raise the bar while holding the user-facing cost constant, that's a win.
jsnell
No, but what I'm saying is that these scrapers are already not using GPUs or ASICs. It just doesn't make any economical sense to do that in the first place. They are running the same Javascript code on the same commodity CPUs and the same Javascript engine as the real users. So switching to an ASIC-resistant algorithm will not raise the bar. It's just going to be another round of the security theater that proof of work was in the first place.
Havoc
Really feels like this needs some sort of unified possibly legal approach to get these fkers to behave.
Search era clearly proved it is possible to crawl respectfully - the AI crawlers have just decided not to. They need to be disincentivized from doing this
dathinab
the problem in many cases is that even if such a law is made it likely
- is hard to enforce
- misses bite, i.e. it makes you more money to break it then any penalties
but in general yes, a site which indicates they don't want to be crawled by AI bots but still gets crawled should be handled similar to someone with house ban on a shop forcing them self into the shop
given how severely messed up some millennia cyber security laws are I wonder if crawlers bypassing Anubis could be interpreted as "circumventing digital access controls/protections" or similar, especially given that its done to make copies of copyrighted material ;=)
jjangkke
I really don't get this type of hostility
If you put something in public domain people are going to access it unless you put it behind a paywall but you don't want to do it because that would limit access or people wouldn't pay for it to begin with (ex. your blog nobody wants to pay for)
There's no law against scraping, and we've already past the CFAA argument
myaccountonhn
Look at it from a lens of harm rather than legality. The hostility comes from people having to pay thousands in bandwidth costs and having services degraded. These AI companies incur huge costs from their wasteful negligence. It's not reasonable.
bargainbin
It’s not quite as simple as “putting something in public domain”. The problem is the server costs to keep that thing in the public domain.
TJSomething
The problem here is that some websites, in this case independent open source code forges, can't be put behind paywalls and cannot maintain availability under the load of scrapers.
hollow-moe
Really looks like the last solution is a legal one, using the DMCA against them using the digital protection or access control circumvention clause or smth.
jjangkke
DMCA only applies to hosted content and we've established that LLM aren't hosting copyrighted content as there is significant transformation which you would otherwise need to prove yourself by training and replicating their entire model.
There is no legal recourse here, if you don't want AI crawlers accessing your content 1) put it behind a paywall 2) remove from public access
varenc
Are AI crawlers equipped to get past reCAPTCHA or hCAPTCHA? This seems like exactly the thing these services were meant to stop.
black_puppydog
So the problem is a bunch of AI companies mining our web content for training data without asking and without regard for hosters' effort/bandwidth and the users' service quality.
And the proposed remedy is to give them human-labeled data directly in the form of captchas, even more severely degrading the user experience and thus website viability?
Color me unconvinced.
logicprog
I'm not anti-the-tech-behind-AI, but this behavior is just awful, and makes the world worse for everyone. I wish AI companies would instead, I don't know, fund common crawl or something so that they can have a single organization and set of bots collecting all the training data they need and then share it, instead of having a bunch of different AI companies doing duplicated work and resulting in a swath of duplicated requests. Also, I don't understand why they have to make so many requests so often. Why wouldn't like one crawl of each site a day, at a reasonable rate, be enough? It's not like up to the minute info is actually important since LLM training cutoffs are always out of date anyway. I don't get it.
barbazoo
Greed. It's never enough money, never enough data, we must have everything all the time and instantly. It's also human nature it seems, looking at how we consume like there's no tomorrow.
logicprog
Which is why internalizing externalities is so important, but that's also extremely hard to do right (leads to a lot of "nerd harder" problems).
msgodel
It doesn't even make sense to crawl this way. It's just destructive for almost no beinifit.
barbazoo
Maybe they assume there'll be only one winner and think, "what if this gives me an edge over the others". And money is no object. Imagine if they cared about "the web".
logicprog
That's what's annoying and confusing about it to me.
oortoo
The time to regulate tech was like 15 years ago, and we didn't. Why would any tech company expect to have to start following "rules" now?
logicprog
Yeah, I don't think we can regulate this problem away personally. Because whatever regulations will be made will either be technically impossible and nonsensical products of people who don't understand what they're regulating that will produce worse side effects (@simonw extracted a great quote from recent Doctorow post on this: https://simonwillison.net/2025/Aug/14/cory-doctorow/) or just increase regulatory capture and corporate-state bonds, or even facilitate corp interests, because the big corps are the ones with economic and lobbying power.
thewebguyd
> fund common crawl or something so that they can have a single organization and set of bots collecting all the training data they need and then share it
That, or, they could just respect robots.txt and we could put enforcement penalties for not respecting the web service's request to not be crawled. Granted, we probably need a new standard but all these AI companies are just shitting all over the web, being disrespectful of site owners because who's going to stop them? We need laws.
logicprog
> That, or, they could just respect robots.txt
IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.
> we could put enforcement penalties for not respecting the web service's request to not be crawled... We need laws.
How would that be enforceable? A central government agency watching network traffic? A means of appealing to a bureaucracy like the FCC? Setting it up so you can sue companies that do it? All of those seem like bad options to me.
thewebguyd
> IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.
I disagree. Whether or not content should be available to be crawled is dependent on the content's license, and what the site owner specifies in robots.txt (or, in the case of user submitted content, whatever the site's ToS allows)
It should be wholly possible to publish a site intended for human consumption only.
> How would that be enforceable?
Making robots.txt or something else a legal standard instead of a voluntary one. Make it easy for site owners to report violations along with logs, legal action taken against the violators.
notatoad
laws are inherently national, which the internet is not. by all means write a law that crawlers need to obey robots.txt, but how are you going to make russia or china follow that law?
superkuh
This isn't AI. This is corporations doing things because they have a profit motive. The issue here is the non-human corporations and their complete lack of accountability even if someone brings legal charges against them. Their structure is designd to abstract away responsibility and they behave that way.
Same old problem. Corps are gonna corp.
logicprog
Yeah, that's why I said I'm not against AI as a technology, but against the behavior of the corporations currently building it. What I'm confused by (not really confused, I understand its just negligence and not giving a fuck, but, frustrated and confused in a sort of helpless sense of not being able to get into the mindset) is just that while there isn't a profit motive against doing this (obviously) there's also not clearly a profit motive to do it, it seems like they're wasting their own resources too on unnecessarily frequent data collection, and also it'd be cheaper to pool data collection efforts.
nektro
[flagged]
hyghjiyhu
Crazy thought but what if you made the work required to access the site equal the work required to host site. Host the public part of the database on something like webtorrent. Render website from db locally. You want to ruin expensive queries? Suit yourself. Not easy, but maybe possible?
nine_k
Why not ask it to directly mine some bitcoin, or do some protein folding? Let's make proof-of-work challenges proof-of-useful-work challenges. The server could even directly serve status 402 with the challenge.
SkiFire13
Note that the work needs to produce a result that's quickly verifiable by the server.
jjangkke
Do you want people to mine bitcoin or do protein folding to read your blog or access your web application?
More importantly do you want to now compete with those that do not bottleneck and lose your traffic ?
This is the paradox, the length you go to protect your content only increases costs for everybody else who isn't an AI crawler.
nine_k
People, no! Robots which can't pass for people, yes.
I just found out about this when it came to the front page of Hacker News. I really wish I was given advanced notice. I haven't been able to put as much energy into Anubis as I've wanted because I've been incredibly overwhelmed by life and need to be able to afford to make this my full time job. Support contracts are being roadblocked, and I just wish I had the time and energy to focus on this without having to worry about being the single income for the household.