Skip to content(if available)orjump to list(if available)

Humanely dealing with humungus crawlers

Humanely dealing with humungus crawlers

33 comments

·September 12, 2025

bobbiechen

>We’ve already done the work to render the page, and we’re trying to shed load, so why would I want to increase load by generating challenges and verifying responses? It annoys me when I click a seemingly popular blog post and immediately get challenged, when I’m 99.9% certain that somebody else clicked it two seconds before me. Why isn’t it in cache? We must have different objectives in what we’re trying to accomplish. Or who we’re trying to irritate.

+1000 I feel like so much bot detection (and fraud prevention against human actors, too) is so emotionally-driven. Some people hate these things so much, they're willing to cut off their nose to spite their face.

bayindirh

My view on this is simple:

If you're a bot which will ignore all the licenses I put on that content, then I don't want to you to be able to reach that content.

No, any amount of monetary compensation is not welcome either. I use these licenses as a matter of principle, and my principles are not for sale.

That's all, thanks.

beeflet

I think the problem is that despite the effort, you will still end up in the dataset. So it's futile

warkdarrior

How can you tell a bot will ignore all your content licenses?

bayindirh

Currently all AI companies argue that the content they use falls under fair use, and disregard all licenses. This means any future ones respecting these licenses needs to be whitelisted.

Vegenoid

I think it’s better viewed through a lens of effort. Implementing systems that try harder to not challenge humans takes more work than just throwing up a catch-all challenge wall.

The author’s goal is admirable: “My primary principle is that I’d rather not annoy real humans more than strictly intended”. However, the primary goal for many people hosting content will be “block bots and allow humans with minimal effort and tuning”.

andy99

Also, google and cloudflare have been able to cement their monopolies by pushing for challenges. Google uses re-captch primarily to punish people who use alternate browsers or don't allow tracking, and cloudflare wants to be the gatekeeper of the internet. So they present themselves as providing "protection" and convince website owners it's needed. Seems like a familiar racket...

jitl

Really? If I’m an unsophisticated blog not using a CDN, and I get a $1000 bill for bandwidth overage or something, I’m gonna google a solve and slap it on there because I don’t want to pay another $1000 for Big Basilisk. I don’t think that’s emotional response, it’s common sense.

marginalia_nu

Seems like you've made profoundly questionable hosting or design choices for that to happen. Flat rate web hosting exists, and blogs (especially unsophisticated ones) do not require much bandwidth or processing power.

Misbehaving crawlers are a huge problem but bloggers are among the least affected by them. Something like a wiki or a forum is a better example, as they're in a category of websites where each page visit is almost unavoidably rendered on the fly using multiple expensive SQL queries due to the rapidly mutating nature of their datasets.

Git forges, like the one TFA is discussing, are also fairly expensive, especially as crawlers traverse historical states. When the crawler is poorly implemented they'll get stuck doing this basically forever. Detecting and dealing with git hosts is an absolute must for any web crawler due to this.

mtlynch

>Flat rate web hosting exists, and blogs (especially unsophisticated ones) do not require much bandwidth or processing power.

I actually find this surprisingly difficult to find.

I just want static hosting (like Netlify or Firebase Hosting), but there aren't many hosts that offer that.

There are lots of providers where I can buy a VPS somewhere and be in charge of configuring and patching it, but if I just want to hand someone a set of HTML files and some money in exchange for hosting, not many hosts fit the bill.

phantompeace

Wouldn't it be easier to put the unsophisticated blog behind cloudflare

mhuffman

As much as I like to shit on cloudflare at every opportunity, it would obviously be easier to put it behind CF than install bot detection plugins.

michaeljx

For some reason I thought this would be about dealing with very large insects

hyperman1

I've been wondering about how to make a challenge that AI won't do. Some possibilities:

* Type this sentence, taken from a famous copyrighted work.

* Type Tienanmen protests.

* Type this list of swear words or sexual organs.

dweinus

> Type this list of swear words

1998: I swear at the computer until the page loads

2025: I swear at the computer until the page loads

nektro

it's sad we've gotten to the point where mitigations against this have to be such a consideration when hosting a site

zkmon

Sorry, what's the point of this blog? I hope people would write a quick abstract/summary in the first few lines and then go on elaborating. Or at least put that summary at the end, in old-fashioned way.

politelemon

The point of a blog is whatever the author would like it to be. It doesn't have to follow a structure or expectations. We just happen to be consuming it.

bayindirh

> Sorry, what's the point of this blog?

Being a blog the way the author dreamed of it.

> I hope people would write a quick abstract/summary in the first few lines and then go on elaborating.

I hope people continue doing what makes them happy. It's their site, they owe nothing to anyone (maybe except hosting / network fees, but that's not my business, either).

> Or at least put that summary at the end, in old-fashioned way.

Or maybe people can spend a couple of minutes to read and understand it, with the MSI (MeatSpaceIntelligence) which comes bundled with all human beings.

It's free, too!

zkmon

Maybe, maybe .. you get some pleasure in forcing people to read every bit of what you write, just to get what the heck it is. But unfortunately it is the age of AI summaries and short attention spans. Not the times when you read half-foot thick novels end-to-end multiple times. TL;DR!

bayindirh

I write my digital garden and blog for myself. They are just happen to be public. The pleasurable part is putting it out there, not forcing people to read it.

If people prefer to have short attention spans and leave what I put out after 30 seconds, it's their own choice. My blog has minimal analytics (provided by the platform), and digital garden has no analytics whatsoever, so I don't care and get bothered what humans do with my site.

I personally don't use any AI tools whatsoever, and still prefer to read half-foot thick novels end-to-end. Hyperion Cantos (4 x 700 pages) was great. My next target is Foundation by Asimov (7 volumes incl. expansions).