You Don't Need Anubis

89 comments

·November 2, 2025

notpushkin

My favourite thing about Anubis is that (in default configuration) it completely bypasses the actual challenge altogether if you set User-Agent header to curl.

E.g. if you open this in browser, you’ll get the challenge: https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4...

But if you run this, you get the page content straight away:

  curl https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4e6d8360c24d6b8434cbb8869b

I’m pretty sure this gets abused by AI scrapers a lot. If you’re running Anubis, take a moment to configure it properly, or better put together something that’s less annoying for your visitors like the OP.

rezonant

It only challenges user agents with Mozilla in their name by design, because user agents that do otherwise are already identifiable. If Anubis makes the bots change their user agents, it has done its job, as that traffic can now be addressed directly.

samlinnfer

This has basically been Wikipedia's bot policy for a long long time. If you run a bot you should identify it via the UserAgent.

https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Found...

hshdhdhehd

What if everyone requests from the bot has a different UA?

skylurk

Success. The goal is to differentiate users and bots who are pretending to be users.

trenchpilgrim

Then you can tell the bots apart from legitimate users through normal WAF rules, because browsers froze the UA a while back.

hsbauauvhabzb

Can you explain what you mean by this? Why Mozilla specifically and not WebKit or similar?

gucci-on-fleek

Due to weird historical reasons [0] [1], every modern browser's User-Agent starts with "Mozilla/5.0", even if they have nothing to do with Firefox.

[0]: https://en.wikipedia.org/wiki/User-Agent_header#Format_for_h...

[1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

yellow_lead

Anubis should be something that doesn't inconvenience all the real humans that visit your site.

I work with ffmpeg so I have to access their bugtracker and mailing list site sometimes. Every few days, I'm hit with the Anubis block. And 1/3 - 1/5 of the time, it fails completely. The other times, it delays me by a few seconds. Over time, this has turned me sour on the Anubis project, which was initially something I supported.

opan

I only had issues with it on GNOME's bug tracker and could work around it with a UA change, meanwhile Cloudflare challenges are often unpassable in qutebrowser no matter what I do.

throwaway290

I understand why ffmpeg does it. No one is expected to pay for it. Until this age of LLMs when bot traffic became dominant on the web ffmpeg site was probably acceptable expense. But they probably don't want to be unpaid data provider for big LLM operators who get to extract a few bucks from their users.

It's like airplane checkin. Are we inconvenienced? Yes. Who is there to blame? Probably not the airline or the company who provides the services. Probably people who want to fly without a ticket or bring in explosives.

As long as Anubis project and people on it don't try to play both sides and don't make the LLM situation worse (mafia racket style), I think if it works it works.

TJSomething

I know it's beside the point, but I think a chunk of the reason for many of the security measures in airports is because creating the appearance of security increases people's willingness to fly.

mariusor

I don't understand the hate when people look at a countermeasure against unethical shit and complain about it instead of being upset at the unethical shit. And it's funny when it's the other way around, like cookie banners being blamed on GDPR not on the scumminess of some web operators.

elashri

I don't understand that some people don't realize that you can be upset about status que that both sides of the equation sucks. And you can hate thing and also the countermeasure that someone deploy against. These are not mutually exclusive.

mariusor

I didn't see parent be upset about both sides on this one. I don't see it implied anywhere that they even considered it.

bakql

[flagged]

trenchpilgrim

Unfortunately in countries like Brazil and India, where a majority of humans collectively live, better computers are taxed at extremely high rates and are practically unaffordable.

bakql

[flagged]

uqers

> Unfortunately, the price LLM companies would have to pay to scrape every single Anubis deployment out there is approximately $0.00.

The math on the site linked here as a source for this claim is incorrect. The author of that site assumes that scrapers will keep track of the access tokens for a week, but most internet-wide scrapers don't do so. The whole purpose of Anubis is to be expensive for bots that repeatedly request the same site multiple times a second.

drum55

The "cost" of executing the JavaScript proof of work is fairly irrelevant, the whole concept just doesn't make sense with a pessimistic inspection. Anubis requires the users to do an irrelevant amount of sha256 hashes in slow javascript, where a scraper can do it much faster in native code; simply game over. It's the same reason we don't use hashcash for email, the amount of proof of work a user will tolerate is much lower than the amount a professional can apply. If this tool provides any benefit, it's due to it being obscure and non standard.

When reviewing it I noticed that the author carried the common misunderstanding that "difficulty" in proof of work is simply the number of leading zero bytes in a hash, which limits the granularity to powers of two. I realize that some of this is the cost of working in JavaScript, but the hottest code path seems to be written extremely inefficiently.

    for (; ;) {
        const hashBuffer = await calculateSHA256(data + nonce);
        const hashArray = new Uint8Array(hashBuffer);

        let isValid = true;
        for (let i = 0; i < requiredZeroBytes; i++) {
          if (hashArray[i] !== 0) {
            isValid = false;
            break;
          }
        }

It wouldn’t be exaggerating to say that a native implementation of this with even a hair of optimization could reduce the “proof of work” to being less time intensive than the ssl handshake.

jsnell

That is not a productive way of thinking about it, because it will lead you to the conclusion that all you need is a smarter proof of work algorithm. One that's GPU-resistant, ASIC-resistant, and native code resistant. That's not the case.

Proof of work can't function as a counter-abuse challenge even if you assume that the attackers have no advantage over the legitimate users (e.g. both are running exactly the same JS implementation of the challenge). The economics just can't work. The core problem is that the attackers pay in CPU time, which is fungible and incredibly cheap, while the real users pay in user-observable latency which is hellishly expensive.

aniviacat

They do use SubtleCrypto digest [0] in secure contexts, which does the hashing natively.

Specifically for Firefox [1] they switch to the JavaScript fallback because that's actually faster [2] (because of overhead probably):

> One of the biggest sources of lag in Firefox has been eliminated: the use of WebCrypto. Now whenever Anubis detects the client is using Firefox (or Pale Moon), it will swap over to a pure-JS implementation of SHA-256 for speed.

[0] https://developer.mozilla.org/en-US/docs/Web/API/SubtleCrypt...

[1] https://github.com/TecharoHQ/anubis/blob/main/web/js/algorit...

[2] https://github.com/TecharoHQ/anubis/releases/tag/v1.22.0

tptacek

Right, but that's the point. It's not that the idea is bad. It's that PoW is the wrong fit for it. Internet-wide scrapers don't keep state? Ok, then force clients to do something that requires keeping state. You don't need to grind SHA2 puzzles to do that; you don't need to grind anything at all.

echelon

This whole thing is pointless.

OpenAI Atlas defeats all of this by being a user's web browser. They got between you and the user you're trying to serve content, and they slurp up everything the user browses to return it back for training.

The firewall is now moot.

The bigger AI company, Google, has already been doing this for decades. They were the middlemen between your reader and you, and that position is unassailable. Without them, you don't have readers.

At this point, the only people you're keeping out with LLM firewalls are the smaller players, which further entrenches the leaders.

OpenAI and Google want you to block everybody else.

happyopossum

> Google, has already been doing this for decades

Do you have any proof, or even circumstantial evidence to point to this being the case?

If chrome actually scraped every site ever you visited and sent it off to Google, it’d be trivially simple to find some indication of that in network traffic, or heck - even chromium code.

Dylan16807

Is it confirmed that site loads go into the training database?

But for anyone whose main concern is their server staying up, Atlas isn't a problem. It's not doing a million extra loads.

valicord

The point is that the scrapers can easily bypass this if they cared to do so

uqers

How so?

tecoholic

Hmm… by setting the verified=1 cookie on every request to the website?

Am I missing something here? All this does is set an unencrypted cookie and reload the page right?

null

[deleted]

agnishom

Exactly. I don't understand what computation you can afford to do in 10 seconds on a small number of cores that bots running on large data centers cannot

juliangmp

The point of anubis isn't to make the scraping impossible, but make it more expensive.

gbuk2013

The Caddy config in the parent article uses status code 418. This is cute, but wouldn’t this break search engine indexing? Why not use 307 code?

paweladamczuk

Internet in its current form, where I can theoretically ping any web server on earth from my bedroom, doesn't seem sustainable. I think it will have to end at some point.

I can't fully articulate it but I feel like there is some game theory aspect of the current design that's just not compatible with the reality.

weinzierl

"Unfortunately, Cloudflare is pretty much the only reliable way to protect against bots."

With footnote:

"I don’t know if they have any good competition, but “Cloudflare” here refers to all similar bot protection services."

That's the crux. Cloudflare is the default, no one seems to bother to take the risk with a competitor for some reason. They seem to exist but when asked people can't even name them.

(For what it's worth I've been using AWS Cloudfront but I had to think a moment to remember its name.)

praptak

There are reasons to choose the slightly annoying solution on purpose though. I'm thinking of a political statement along the lines "We have a problem with asshole AI companies and here's how they make everyone's life slightly worse."

geokon

Big picture, why does everyone scrape the web?

Why doesn't one company do it and then resell the data? Is it a legal/liability issue? If you scrape it's a legal grey area, but if you sell what you scrap it's clearly copyright infringement?

utopiah

My bet is that they believe https://commoncrawl.org isn't good enough and, precisely as you are suggesting, the "rest" is where is their competitive advantage might stem from.

Jackson__

Thinking that there is anything worth scraping past the llm-apocalypse is pure hubris imo. It is slop city out there, and unless you have an impossibly perfect classifier to detect it, 99.9% of all the great new "content" you scrape will be AI written.

E: In fact this whole idea is so stupid that I am forced to consider if it is just a DDoS in the original sense. Scrape everything so hard it goes down, just so that your competitors can't.

jchw

I was briefly messing around with Pangolin, which is supposed to be a self-hosted Cloudflare Tunnels sort of thing. Pretty cool.

One thing I noticed though was that the Digital Ocean Marketplace image asks you if you want to install something called Crowdsec, which is described as a "multiplayer firewall", and while it is a paid service, it appears there is a community offering that is well-liked enough. I actually was really wondering what downsides it has (except for the obvious, which is that you are definitely trading some user privacy in service of security) but at least in principle the idea seems kind of a nice middleground between Cloudflare and nothing if it works and the business model holds up.

bootsmann

Not sure crowdsec is fit for this purpose. Its more a fail2ban replacement than a ddos challenge.

jchw

One of the main ways that Cloudflare is able to avoid presenting CAPTCHAs to a lot of people while still filtering tons of non-human traffic is exactly that, though: just having a boatload of data across the Internet.

tptacek

This came up before (and this post links to the Tavis Ormandy post that kicked up the last firestorm about Anubis) and without myself shading the intent or the execution on Anubis, just from a CS perspective, I want to say again that the PoW thing Anubis uses doesn't make sense.

Work functions make sense in password hashes because they exploit an asymmetry: attackers will guess millions of invalid passwords for every validated guess, so the attacker bears most (really almost all) of the cost.

Work functions make sense in antispam systems for the same reason: spam "attacks" rely on the cost of an attempt being so low that it's efficient to target millions of victims in the expectation of just one hit.

Work functions make sense in Bitcoin because they function as a synchronization mechanism. There's nothing actually valorous about solving a SHA2 puzzle, but the puzzles give the whole protocol a clock.

Work functions don't make sense as a token tax; there's actually the opposite of the antispam asymmetry there. Every bot request to a web page yields tokens to the AI company. Legitimate users, who far outnumber the bots, are actually paying more of a cost.

None of this is to say that a serious anti-scraping firewall can't be built! I'm fond of pointing to how Youtube addressed this very similar problem, with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.

The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.

Gander5739

But youtube can still be scraped with yt-dlp, so apparently it wasn't enough.

mariusor

With all due respect, but almost all I see in this thread is people looking down their nose at a proven solution, and giving advice instead of doing the work. I can see how you are a _very important person_ with bills to pay and money to make, but at least have the humility of understanding that the solution we got is better than the solution that could be better if only there was someone else to think of it and build it.

gucci-on-fleek

> Work functions don't make sense as a token tax; there's actually the opposite of the antispam asymmetry there. Every bot request to a web page yields tokens to the AI company. Legitimate users, who far outnumber the bots, are actually paying more of a cost.

Agreed, residential proxies are far more expensive than compute, yet the bots seem to have no problem obtaining millions of residential IPs. So I'm not really sure why Anubis works—my best guess is that the bots have some sort of time limit for each page, and they haven't bothered to increase it for pages that use Anubis.

> with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.

> The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.

They did [0], but it doesn't work [1]. Of course, the Anubis implementation is much simpler than YouTube's, but (1) Anubis doesn't have dozens of employees who can test hundreds of browser/OS/version combinations to make sure that it doesn't inadvertently block human users, and (2) it's much trickier to design an open-source program that resists reverse-engineering than a closed-source program, and I wouldn't want to use Anubis if it went closed-source.

[0]: https://anubis.techaro.lol/docs/admin/configuration/challeng...

[1]: https://github.com/TecharoHQ/anubis/issues/1121

tptacek

Google's content-protection system didn't simply make sure you could run client-side Javascript. It implemented an obfuscating virtual machine that, if I'm remembering right (I may be getting some of the detailed blurred with Blu Ray's BD+ scheme) built up a hash input of runtime artifacts. As I understand it, it was one person's work, not the work of a big team. The "source code" we're talking about here is clientside Javascript.

Either way: what Anubis does now --- just from a CS perspective, that's all --- doesn't make sense.

indrora

The problem is that increasingly, they are running JS.

In the ongoing arms race, we're likely to see simple things like this sort of check result in a handful of detection systems that look for "set a cookie" or at least "open the page in headless chrome and measure the cookies."

moebrowne

> increasingly, they are running JS.

Does anyone have any proof of this?

utopiah

> increasingly, they are running JS.

I mean they have access to a mind-blowing amount of computing resources so to they using a fraction of that to improve the quality of the data because they have this fundamental belief (because it's convenient for their situation) that scale is everything, why not use JS too. Heck if they have to run on a container full a browser, not even headless, they will.

typpilol

Chrome even released a dev tools mcp they gives any LLM full tool access to do anything in the browser.

Navigate, screenshots, etc. it has like 30 tools in it alone.

Now we can just run real browsers with LLMs attached. Idk how you even think about defeating that.

HN

You Don't Need Anubis

You Don't Need Anubis