Anubis Works

159 comments

·April 12, 2025

raggi

It's amusing that Xe managed to turn what was historically mostly a joke/shitpost into an actually useful product. They did always say timing was everything.

I am kind of surprised how many sites seem to want/need this. I get the slow git pages problem for some of the git servers that are super deep, lack caches, serve off slow disks, etc.

Unesco surprised me some, the sub-site in question is pretty big, it has thousands of documents of content, but the content is static - this should be trivial to serve, so what's going on? Well it looks like it's a poorly deployed Wordpress on top of Apache, with no caching enabled, no content compression, no HTTP 2/3. It would likely be fairly easy to get this serving super cheap on a very small machine, but of course doing so requires some expertise, and expertise still isn't cheap.

Sure you could ask an LLM, but they still aren't good at helping when you have no clue what to ask - if you don't even really know the site is slower than it should be, why would you even ask? You'd just hear about things getting crushed and reach for the furry defender.

adrian17

> but of course doing so requires some expertise, and expertise still isn't cheap

Sure, but at the same time, the number of people with expertise to set up Anubis (not that it's particularly hard, but I mean: even be aware that it exists) is surely even lower than of people with Wordpress administration experience, so I'm still surprised.

If I were to guess, the reasons for not touching Wordpress were unrelated, like: not wanting to touch a brittle instance, or organization permissions, or maybe the admins just assumed that WP is configured well already.

mrweasel

> I am kind of surprised how many sites seem to want/need this.

The AI scrapers are not only poorly written, they also go out of their way to do cache busting. So far I've seen a few solutions, CloudFlare, require a login, Anubis, or just insane amounts of infrastructure. Some site have reported 60% of their traffic coming from bots not, smaller sites is probably much higher.

jtbayly

My site that I’d like this for has a lot of posts, but there are links to a faceted search system based on tags that produces an infinite number of possible combinations and pages for each one. There is no way to cache this, and the bots don’t respect the robots file, so they just constantly request URLs, getting the posts over and over in different numbers and combinations. It’s a pain.

cedws

PoW anti-bot/scraping/DDOS was already being done a decade ago, I’m not sure why it’s only catching on now. I even recall a project that tried to make the PoW useful.

xena

Xe here. If I had to guess in two words: timing and luck. As the G-man said: the right man in the wrong place can make all the difference in the world. I was the right shitposter in the right place at the right time.

And then the universe blessed me with a natural 20. Never had these problems before. This shit is wild.

underdeserver

Squeeze that lemon as far as it'll go mate, god speed and may the good luck continue.

gyomu

If you’re confused about what this is - it’s to prevent AI scraping.

> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums

https://anubis.techaro.lol/docs/design/how-anubis-works

This is pretty cool, I have a project or two that might benefit from it.

null

[deleted]

x3haloed

I’ve been wondering to myself for many years now whether the web is for humans or machines. I personally can’t think of a good reason to specifically try to gate bots when it comes to serving content. Trying to post content or trigger actions could obviously be problematic under many circumstances.

But I find that when it comes to simple serving of content, human vs. bot is not usually what you’re trying to filter or block on. As long as a given client is not abusing your systems, then why do you care if the client is a human?

xboxnolifes

> As long as a given client is not abusing your systems, then why do you care if the client is a human?

Well, that's the rub. The bots are abusing the systems. The bots are accessing the contents at rates thousands of times faster and more often than humans. The bots also have access patterns unlike your expected human audience (downloading gigabytes or terabytes of data multiples times, over and over).

And these bots aren't some being with rights. They're tools unleashed by humans. It's humans abusing the systems. These are anti-abuse measures.

immibis

Then you look up their IP address's abuse contact, send an email and get them to either stop attacking you or get booted off the internet so they can't attack you.

And if that doesn't happen, you go to their ISP's ISP and get their ISP booted off the Internet.

Actual ISPs and hosting providers take abuse reports extremely seriously, mostly because they're terrified of getting kicked off by their ISP. And there's no end to that - just a chain of ISPs from them to you and you might end with convincing your ISP or some intermediary to block traffic from them. However, as we've seen recently, rules don't apply if enough money is involved. But I'm not sure if these shitty interim solutions come from ISPs ignoring abuse when money is involved, or from not knowing that abuse reporting is taken seriously to begin with.

Anyone know if it's legal to return a never-ending stream of /dev/urandom based on the user-agent?

bbor

Well, that's the meta-rub: if they're abusing, block abuse. Rate limits are far simpler, anyway!

In the interest of bringing the AI bickering to HN: I think one could accurately characterize "block bots just in case they choose to request too much data" as discrimination! Robots of course don't have any rights so it's not wrong, but it certainly might be unwise.

praptak

The good thing about proof of work is that it doesn't specifically gate bots.

It may have some other downsides - for example I don't think that Google is possible in a world where everyone requires proof of work (some may argue it's a good thing) but it doesn't specifically gate bots. It gates mass scraping.

t-writescode

> I personally can’t think of a good reason to specifically try to gate bots

There's been numerous posts on HN about people getting slammed, to the tune of many, many dollars and terabytes of data from bots, especially LLM scrapers, burning bandwidth and increasing server-running costs.

ronsor

I'm genuinely skeptical that those are all real LLM scrapers. For one, a lot of content is in CommonCrawl and AI companies don't want to redo all that work when they can get some WARC files from AWS.

I'm largely suspecting that these are mostly other bots pretending to be LLM scrapers. Does anyone even check if the bots' IP ranges belong to the AI companies?

gbear605

The issue is not whether it’s a human or a bot. The issue is whether you’re sending thousands of requests per second for hours, effectively DDOSing the site, or if you’re behaving like a normal user.

laserbeam

The reason is: bots DO spam you repeatedly and increase your network costs. Humans don’t abuse the same way.

starkrights

Example problem that I’ve seen posted about a few times on HN: LLM scrapers (or at least, an explosion of new scrapers) exploding and mindlessly crawling every singly HTTP endpoint of a hosted git-service, instead of just cloning the repo. (entirely ignoring robots.txt)

The point of this is that there has recently been a massive explosion in the amount of bots that blatantly, aggressively, and maliciously ignore and attempt to bypass (mass ip/VPN switching, user agent swapping, etc) anti-abuse gates.

mieses

There is hope for misguided humans.

gnabgib

Related Anubis: Proof-of-work proxy to prevent AI crawlers (100 points, 23 days ago, 58 comments) https://news.ycombinator.com/item?id=43427679

namanyayg

"It also uses time as an input, which is known to both the server and requestor due to the nature of linear timelines"

A funny line from his docs

xena

OMG lol I forgot that I left that in. Hilarious. I think I'm gonna keep it.

didgeoridoo

I didn’t even blink at this, my inner monologue just did a little “well, naturally” in a Redditor voice and kept reading.

mkl

BTW Xe, https://xeiaso.net/pronouns is 404 since sometime last year, but it is still linked to from some places like https://xeiaso.net/blog/xe-2021-08-07/ (I saw "his" above and went looking).

xena

I'm considering making it come back, but it's just gotten me too much abuse so I'm probably gonna leave it 404-ing until society is better.

AnonC

Those images on the interstitial page(s) while waiting for Anubis to complete its check are so cute! (I’ve always found all the art and the characters in Xe’s blog very beautiful)

Tangentially, I was wondering how this would impact common search engines (not AI crawlers) and how this compares to Cloudflare’s solution to stop AI crawlers, and that’s explained on the GitHub page. [1]

> Installing and using this will likely result in your website not being indexed by some search engines. This is considered a feature of Anubis, not a bug.

> This is a bit of a nuclear response, but AI scraper bots scraping so aggressively have forced my hand.

> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.

[1]: https://github.com/TecharoHQ/anubis/

snvzz

>Those images on the interstitial page(s) while waiting for Anubis to complete its check are so cute!

Love them too, and abhor knowing that someone is bound to eventually remove them because found to be "problematic" in one way or another.

mentalgear

Seems like a great idea, but I'd be nice if the project had a simple description. (and not use so much anime, as it gives an unprofessional impression)

This is what it actually does: Instead of only letting the provider bear the cost of content hosting (traffic, storage), the client also bears costs when accessing in form of computation. Basically it runs additional expansive computation on the client, which makes accessing 1000s of your webpages at high interval expansive for crawlers.

> Anubis uses a proof of work in order to validate that clients are genuine. The reason Anubis does this was inspired by Hashcash, a suggestion from the early 2000's about extending the email protocol to avoid spam. The idea is that genuine people sending emails will have to do a small math problem that is expensive to compute, but easy to verify such as hashing a string with a given number of leading zeroes. This will have basically no impact on individuals sending a few emails a week, but the company churning out industrial quantities of advertising will be required to do prohibitively expensive computation.

roenxi

I like the idea but this should probably be something that is pulled down into the protocol level once the nature of the challenge gets sussed out. It'll ultimately be better for accessibility if the PoW challenge is closer to being part of TCP than implemented in JavaScript individually by each website.

pona-a

There's Cloudflare PrivacyPass that became an IETF standard [0], but it's rather weird, and the reference implementation is a bug nest.

[0] https://datatracker.ietf.org/wg/privacypass/about/

prologic

I've read about Anubis, cool project! Unfortunately, as pointed out in the comments, requires your site's visitors to have Javascript™ enabled. This is totally fine for sites that require Javascript™ anyway to enhance the user experience, but not so great for static sites and such that require no JS at all.

I built my own solution that effectively blocks these "Bad Bots" at the network level. I effectively block the entirety of several large "Big Tech / Big LLM" networks entirely at the ASN (BGP) by utilizing MaxMind's database and a custom WAF and Reverse Proxy I put together.

xyzzy_plugh

A significant portion of the bot traffic TFA is designed to handle originates from consumer/residential space. Sure, there are ASN games being played alongside reputation fraud, but it's very hard to combat. A cursory investigation of our logs showed these bots (which make ~1 request from a given residential IP) are likely in ranges that our real human users occupy as well.

Simply put you risk blocking legitimate traffic. This solution does as well but for most humans the actual risk is much lower.

As much as I'd love to not need JavaScript and to support users who run with it disabled, I've never once had a customer or end user complain about needing JavaScript enabled.

It is an incredible vocal minority who disapprove of requiring JavaScript, the majority of whom, upon encountering a site for which JavaScript is required, simply enable it. I'd speculate that, even then, only a handful ever release a defeated sigh.

prologic

This is true. I had some bad actors from the ComCast Network at one point. And unfortunately also valid human users of some of my "things". So I opted not to block the ComCast ASN at that point.

prologic

I would be interested to hear of any other solutions that guarantee to either identity or block non-Human traffic. In the "small web" and self-hosting, we typically don't really want Crawlers, and other similar software hitting our services, because often the software is either buggy in the first place (Example: Runaway Claude Bot) or you don't want your sites indexed by them in the first place.

xyzzy_plugh

Exactly. We've all been down this rabbit hole, collectively, and that's why Anubis has taken off. It works shockingly well.

Cyphase

For anyone wondering, Oracle holds the trademark for "JavaScript": https://javascript.tm/

prologic

Which arguably they should let go of

runxiyu

Do you have a link to your own solution?

prologic

Not yet unfortunately. But if you're interested, please reach out! I currently run it in a 3-region GeoDNS setup with my self-hosted infra.

jadbox

How do you know it's an LLM and not a VPN? How do you use this MaxMind's database to isolate LLMs?

prologic

I don't distinguish actually. There are two things I do normally:

- Block Bad Bots. There's a simple text file called `bad_bots.txt` - Block Bad ASNs. There's a simple text file called `bad_asns.txt`

There's also another for blocking IP(s) and IP-ranges called `bad_ips.txt` but it's often more effective to block an much larger range of IPs (At the ASN level).

To give you an concrete idea, here's some examples:

$ cat etc/caddy/waf/bad_asns.txt # CHINANET-BACKBONE No.31,Jin-rong Street, CN # Why: DDoS 4134

# CHINA169-BACKBONE CHINA UNICOM China169 Backbone, CN # Why: DDoS 4837

# CHINAMOBILE-CN China Mobile Communications Group Co., Ltd., CN # Why: DDoS 9808

# FACEBOOK, US # Why: Bad Bots 32934

# Alibaba, CN # Why: Bad Bots 45102

# Why: Bad Bots 28573

deknos

I wish, there was also an tunnel software (client+server) where

* the server appears on the outside as an https server/reverse proxy * the server supports self-signed-certificates or letsencrypt * when a client goes to a certain (sub)site or route, http auth can be used * after http auth, all traffic tunnel over that subsite/route is protected against traffic analysis, for example like the obfsproxy does it.

Does anyone know something like that? I am tempted to ask xeiaso to add such features, but i do not think his tool is meant for that...

rollcat

Your requirements are quite specific, and HTTP servers are built to be generic and flexible. You can probably put something together with nginx and some Lua, aka OpenResty: <https://openresty.org/>

> his

I believe it's their.

deknos

ups, yes, sorry, their.

immibis

Tor's Webtunnel?

deknos

but i do not want to go OVER tor, i just want a service over clearnet? or is this something else? do you have an URL?

tripdout

The bot detection takes 5 whole seconds to solve on my phone, wow.

bogwog

I'm using Fennec (a Firefox fork on F-Droid) and a Pixel 9 Pro XL, and it takes around ~8 seconds at difficulty 4.

Personally, I don't think the UX is that bad since I don't have to do anything. I definitely prefer it to captchas.

Hakkin

Much better than infinite Cloudflare captcha loops.

gruez

I've never had that, even with something like tor browser. You must be doing something extra suspicious like an user agent spoofer.

praisewhitey

Firefox with Enhanced Tracking Protection turned on is enough to trigger it.

megous

Proper response here is "fuck cloudflare", instead of blaming the user.

xena

Apparently user-agent switchers don't work for fetch() requests, which means that Anubis can't work with people that do that. I know of someone that set up a version of brave from 2022 with a user-agent saying it's chrome 150 and then complaining about it not working for them.

oynqr

Lucky. Took 30s for me.

nicce

For me it is like 0.5s. Interesting.

pabs3

It works to block users who have JavaScript disabled, that is for sure.

udev4096

Exactly, it's a really poor attempt to make it appealing to the larger audience. Unless they roll out a version for nojs, they are the same as "AI" scrapers on enshittyfying the web

udev4096

PoW captchas are not new. What's different with Anubis? How can it possibly prevent "AI" scrapers if the bots have enough compute to solve the PoW challenge? AI companies have quite a lot of GPUs at their disposal and I wouldn't be surprised if they used it for getting around PoW captchas

relistan

The point is to make it expensive to crawl your site. Anyone determined to do so is not blocked. But why would they be determined to do so for some random site? The value to the AI crawler likely does not match the cost to crawl it. It will just move on to another site.

So the point is not to be faster than the bear. It’s to be faster than your fellow campers.

genewitch

Why not have them hash pow for btc then?