Blocking LLM crawlers without JavaScript

DeepYogurt

Has anyone done a talk/blog/whatever on how llm crawlers are different than classical crawlers? I'm not up on the difference.

klodolph

The only real difference that LLM crawlers tend to not respect /robots.txt and some of them hammer sites with some pretty heavy traffic.

The trap in the article has a link. Bots are instructed not to follow the link. The link is normally invisible to humans. A client that visits the link is probably therefore a poorly behaved bot.

daveoc64

Seems pretty easy to cause problems for other people with this.

If you follow the link at the end of my comment, you'll be flagged as an LLM.

You could put this in an img tag on a forum or similar and cause mischief.

Don't follow the link below:

https://www.owl.is/stick-och-brinn/

If you do follow that link, you can just clear cookies for the site to be unblocked.

kazinator

You do not have a meta refresh timer that will skip your entire comment and redirect to the good page in a fraction of a second too short for a person to react.

You also have not used <p hidden> to conceal the paragraph with the link from human eyes.

null

[deleted]

SquareWheel

That may work for blocking bad automated crawlers, but an agent acting on behalf of a user wouldn't follow robots.txt. They'd run the risk of hitting the bad URL when trying to understand the page.

klodolph

That sounds like the desired outcome here. Your agent should respect robots.txt, OR it should be designed to not follow links.

Springtime

I wonder what the venn diagram of end users who disable Javascript and also block cookies by default looks like. As the former is already something users have to do very deliberately so I feel the likelihood of the latter among such users is higher.

There's no cookies disabled error handling on the site, so the page just infinitely reloads in such cases (Cloudflare's check for comparison informs the user cookies are required—even if JS is also disabled).

behnamoh

Any ideas on how to block LLMs from reading/analyzing a PDF? I don't want to submit a paper to journals only for them to use ChatGPT to review it...

(it has happened before)

Edit: I'm starting to get downvoted. Perhaps by the lazy-ass journal reviewrs?

cortesoft

If someone can read it, they can put it through an LLM. There is no possible way to prevent that. Even with crazy DRM, you could take a picture of your screen and OCR it.

They are trying to block automated LLM scraping, which at least has some possibility of having some success.

jadbox

Short answer is no. There are pdf black magic DRM tricks that could be used, but most PDF libraries used for AIs will decode it, making it mute. It's better just to add a note for the humans that "This PDF is meant to best enjoyed by humans" or something of that note.

nektro

nice post

petesergeant

I wish blockers would distinguish between crawlers that index, and agentic crawlers serving an active user's request. npm blocking Claude Code is irritating

klodolph

I think of those two, agentic crawlers are worse.

specialp

Agentic crawlers are worse. I run a primary source site and the ai "thinking" user agents will hit your site 1000+ times in a minute at any time of the day

superkuh

I thought this was cool because it worked even in my old browser. So cool I went to add their RSS feed to my feed reader. But then my feed reader got blocked by the system. So now it doesn't seem so cool.

If the site author reads this: make an exception for https://www.owl.is/blogg/index.xml

This is a common mistake and the author is in good company. Science.org once blocked all of their hosted blogs' feeds for 3 months when they deployed a default cloudflare setup across all their sites.

HN

Blocking LLM crawlers without JavaScript

Blocking LLM crawlers without JavaScript