Skip to content(if available)orjump to list(if available)

Amazon's AI crawler is making my Git server unstable

Animats

It's time for a lawyer letter. See the Computer Fraud and Abuse Act prosecution guidelines.[1] In general, the US Justice Department will not consider any access to open servers that's not clearly an attack to be "unauthorized access". But,

"However, when authorizers later expressly revoke authorization—for example, through unambiguous written cease and desist communications that defendants receive and understand—the Department will consider defendants from that point onward not to be authorized."

So, you get a lawyer to write an "unambiguous cease and desist" letter. You have it delivered to Amazon by either registered mail or a process server, as recommended by the lawyer. Probably both, plus email.

Then you wait and see if Amazon stops.

If they don't stop, you can file a criminal complaint. That will get Amazon's attention.

[1] https://www.justice.gov/jm/jm-9-48000-computer-fraud

xena

Honestly, I figure that being on the front page of Hacker News like this is more than shame enough to get a human from the common sense department to read and respond to the email I sent politely asking them to stop scraping my git server. If I don't get a response by next Tuesday, I'm getting a lawyer to write a formal cease and desist letter.

amarcheschi

It's computer science, nothing changes on corpo side until they get a lawyer letter.

And even then, it's probably not going to be easy

gazchop

No one gives a fuck in this industry until someone turns up with bigger lawyers. This is behaviour which is written off with no ethical concerns as ok until that bigger fish comes along.

Really puts me off it.

DrBenCarson

Lol you really think an ephemeral HN ranking will make change?

usefulcat

It's not unheard of. But neither would I count on it.

xena

There's only one way to find out!

Aurornis

> Then you wait and see if Amazon stops.

That’s if the requests are actually coming from Amazon, which seems very unlikely. The Amazon bot should come from known Amazon IP ranges and respect robots.txt. An Amazon engineer confirmed it in another comment: https://news.ycombinator.com/item?id=42751729

The blog post mentions things like changing user agent strings, ignoring robots.txt, and residential IP blocks. If the only thing that matches Amazon is the “AmazonBot” User Agent string but not the IP ranges or behavior then lighting your money on fire would be just as effective as hiring a lawyer to write a letter to Amazon.

armchairhacker

I like the solution in this comment: https://news.ycombinator.com/item?id=42727510.

Put a link somewhere in your site that no human would visit, disallow it in robots.txt (under a wildcard because apparently OpenAI’s crawler specifically ignores wildcards), and when an IP address visits the link ban it for 24 hours.

Szpadel

I had to deal with some bot activities that used huge address space, and I tried something very similar, when condition confirming bot was detected I banned that IP for 24h

but due to amount of IPs involved this did not have any impact on about if traffic

my suggestion is to look very closely on headers that you receive (varnishlog in very nice of this and of you stare long enough at then you might stop something that all those requests have in common that would allow you to easily identify them (like very specific and usual combination of reported language and geo location, or the same outdated browser version, etc)

conradev

My favorite example of this was how folks fingerprinted the active probes of the Great Firewall of China. It has a large pool of IP addresses to work with (i.e. all ISPs in China), but the TCP timestamps were shared across a small number of probing machines:

"The figure shows that although the probers use thousands of source IP addresses, they cannot be fully independent, because they share a small number of TCP timestamp sequences"

https://censorbib.nymity.ch/pdf/Alice2020a.pdf

null

[deleted]

superjan

Why work hard… Train a model to recognize the AI bots!

js4ever

Because you have to decide in less than 1ms, using AI is too slow in that context

aaomidi

Maybe ban ASNs /s

koito17

This was indeed one mitigation used by a site to prevent bots hosted on AWS from uploading CSAM and generating bogus reports to the site's hosting provider.[1]

In any case, I agree with the sarcasm. Blocking data center IPs may not help the OP, because some of the bots are resorting to residential IP addresses.

[1] https://news.ycombinator.com/item?id=26865236

to11mtm

Uggh, web crawlers...

8ish years ago, at the shop I worked at we had a server taken down. It was an image server for vehicles. How did it go down? Well, the crawler in question somehow had access to vehicle image links we had due to our business. Unfortunately, the perfect storm of the image not actually existing (can't remember why, mighta been one of those weird cases where we did a re-inspection without issuing new inspection ID) resulted in them essentially DOSing our condition report image server. Worse, there was a bug in the error handler somehow, such that the server process restarted when this condition happened. This had the -additional- disadvantage of invalidating our 'for .NET 2.0, pretty dang decent' caching implementation...

It comes to mind because, I'm pretty sure we started doing some canary techniques just to be safe (Ironically, doing some simple ones were still cheaper than even adding a different web server.... yes we also fixed the caching issue... yes we also added a way to 'scream' if we got too many bad requests on that service.)

shakna

When I was writing a crawler for my search engine (now offline), I found almost no crawler library actually compliant with the real world. So I ended up going to a lot of effort to write one that complied with Amazon and Google's rather complicated nested robots files, including respecting the cool off periods as requested.

... And then found their own crawlers can't parse their own manifests.

bb010g

Could you link the source of your crawler library?

trebor

Upvoted because we’re seeing the same behavior from all AI and Seo bots. They’re BARELY respecting Robots.txt, and hard to block. And when they crawl, they spam and drive up load so high they crash many servers for our clients.

If AI crawlers want access they can either behave, or pay. The consequence will almost universal blocks otherwise!

herpdyderp

> The consequence will almost universal blocks otherwise!

How? The difficulty of doing that is the problem, isn't it? (Otherwise we'd just be doing that already.)

ADeerAppeared

> (Otherwise we'd just be doing that already.)

Not quite what the original commenter meant but: WE ARE.

A major consequence of this reckless AI scraping is that it turbocharged the move away from the web and into closed ecosystems like Discord. Away from the prying eyes of most AI scrapers ... and the search engine indexes that made the internet so useful as an information resource.

Lots of old websites & forums are going offline as their hosts either cannot cope with the load or send a sizeable bill to the webmaster who then pulls the plug.

gundmc

What do you mean by "barely" respecting robots.txt? Wouldn't that be more binary? Are they respecting some directives and ignoring others?

unsnap_biceps

I believe that a number of AI bots only respect robot.txt entries that explicitly define their static user agent name. They ignore wildcards in user agents.

That counts as barely imho.

I found this out after OpenAI was decimating my site and ignoring the wildcard deny all. I had to add entires specifically for their three bots to get them to stop.

joecool1029

Even some non-profit ignore it now, Internet Archive stopped respecting it years ago: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

noman-land

This is highly annoying and rude. Is there a complete list of all known bots and crawlers?

null

[deleted]

LukeShu

Amazonbot doesn't respect the `Crawl-Delay` directive. To be fair, Crawl-Delay is non-standard, but it is claimed to be respected by the other 3 most aggressive crawlers I see.

And how often does it check robots.txt? ClaudeBot will make hundreds of thousands of requests before it re-checks robots.txt to see that you asked it to please stop DDoSing you.

Animats

Here's Google, complaining of problems with pages they want to index but I blocked with robots.txt.

    New reason preventing your pages from being indexed

    Search Console has identified that some pages on your site are not being indexed 
    due to the following new reason:

        Indexed, though blocked by robots.txt

    If this reason is not intentional, we recommend that you fix it in order to get
    affected pages indexed and appearing on Google.
    Open indexing report
    Message type: [WNC-20237597]

emmelaich

If they're AI bots it might be fun to feed them nonsense. Just send hack megabytes of "Bezos is a bozo" or something like that. Even more fun if you could cooperate with many other otherwise-unrelated websites, e.g. via time settings in a modified tarpit.

Vampiero

> The consequence will almost universal blocks otherwise!

Who cares? They've already scraped the content by then.

jsheard

Bold to assume that an AI scraper won't come back to download everything again, just in case there's any new scraps of data to extract. OP mentioned in the other thread that this bot had pulled 3TB so far, and I doubt their git server actually has 3TB of unique data, so the bot is probably pulling the same data over and over again.

xena

FWIW that includes other scrapers, Amazon's is just the one that showed up the most in the logs.

_heimdall

If they only needed a one-time scrape we really wouldn't be seeing noticeable not traffic today.

ksec

Is there some way website can sell those Data to AI bot in a large zip file rather than being constantly DDoS?

Or they could at least have the curtesy to scrap during night time / off peak hours.

jsheard

No, because they won't pay for anything they can get for free. There's only one situation where an AI company will pay for data, and that's when it's owned by someone with scary enough lawyers to pressure them into paying up. Hence why OpenAI has struck licensing deals with a handful of companies while continuing to bulk-scrape unlicensed data from everyone else.

mschuster91

Global tarpit is the solution. It makes sense anyway even without taking AI crawlers into account. Back when I had to implement that, I went the semi manual route - parse the access log and any IP address averaging more than X hits a second on /api gets a -j TARPIT with iptables [1].

Not sure how to implement it in the cloud though, never had the need for that there yet.

[1] https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62f...

jks

One such tarpit (Nepenthes) was just recently mentioned on Hacker News: https://news.ycombinator.com/item?id=42725147

Their site is down at the moment, but luckily they haven't stopped Wayback Machine from crawling it: https://web.archive.org/web/20250117030633/https://zadzmo.or...

marcus0x62

Quixotic[0] (my content obfuscator) includes a tarpit component, but for something like this, I think the main quixotic tool would be better - you run it against your content once, and it generates a pre-obfuscated version of it. It takes a lot less of your resources to serve than dynamically generating the tarpit links and content.

0 - https://marcusb.org/hacks/quixotic.html

kazinator

How do you know their site is down? You probably just hit their tarpit. :)

bwfan123

i would think public outcry by influencers on social media (such as this thread) is a better deterrent, and also establishes a public datapoint and exhibit for future reference.. as it is hard to scale the tarpit.

idlewords

This doesn't work with the kind of highly distributed crawling that is the problem now.

Aurornis

I don’t think I’d assume this is actually Amazon. The author is seeing requests from rotating residential IPs and changing user agent strings

> It's futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more.

Impersonating crawlers from big companies is a common technique for people trying to blend in. The fact that requests are coming from residential IPs is a big red flag that something else is going on.

cmeacham98

I work for Amazon, but not directly on web crawling.

Based on the internal information I have been able to gather, it is highly unlikely this is actually Amazon. Amazonbot is supposed to respect robots.txt and should always come from an Amazon-owned IP address (You can see verification steps here: https://developer.amazon.com/en/amazonbot).

I've forwarded this internally just in case there is some crazy internal team I'm not aware of pulling this stunt, but I would strongly suggest the author treats this traffic as malicious and lying about its user agent.

AyyEye

> The author is seeing requests from rotating residential IPs and changing user agent strings

This type of thing is commercially available as a service[1]. Hundreds of Millions of networks backdoored and used as crawlers/scrapers because of an included library somewhere -- and ostensibly legal because somewhere in some ToS they had some generic line that could plausibly be extended to using you as a patsy for quasi-legal activities.

[1] https://brightdata.com/proxy-types/residential-proxies

Aurornis

Yes, we know, but the accusation is that Amazon is the source of the traffic.

If the traffic is coming from residential IPs then it’s most likely someone using these services and putting “AmazonBot” as a user agent to trick people.

paranoidrobot

I wouldn't put it past any company these days doing crawling in an aggressive manner to use proxy networks.

smileybarry

With the amount of "if cloud IP then block" rules in place for many things (to weed out streaming VPNs and "potential" ddos-ing) I wouldn't doubt that at all.

Ndymium

I had this same issue recently. My Forgejo instance started to use 100 % of my home server's CPU as Claude and its AI friends from Meta and Google were hitting the basically infinite links at a high rate. I managed to curtail it with robots.txt and a user agent based blocklist in Caddy, but who knows how long that will work.

Whatever happened to courtesy in scraping?

jsheard

> Whatever happened to courtesy in scraping?

Money happened. AI companies are financially incentivized to take as much data as possible, as quickly as possible, from anywhere they can get it, and for now they have so much cash to burn that they don't really need to be efficient about it.

bwfan123

not only money, but also a culture of "all your data belong to us" because our ai going to save you and the world.

the hubris reminds me of dot-com era. that bust left a huge wreckage. not sure how this one is going to land.

__loam

It's gonna be rough. If you can't make money charging people $200 a month for your service then something is deeply wrong.

nicce

Need to act fast before the copyright cases in the court gets handled.

to11mtm

> Whatever happened to courtesy in scraping?

When various companies got signal that at least for now they have a huge overton window of what is acceptable for AI to ingest, they are going to take all they can before regulation even tries to clamp down.

The bigger danger, is that one of these companies even (or, especially) one that claims to be 'Open', does so but gets to the point of being considered 'too big to fail' from an economic/natsec interest...

baobun

Mind sharing a decent robots.txt and/or user-agent list to block the AI crawlers?

hooloovoo_zoo

Any of the big chat models should be able to reproduce it :)

Analemma_

The same thing that happened to courtesy in every other context: it only existed in contexts where there was no profit to be made in ignoring it. The instant that stopped being true, it was ejected.

ThinkBeat

The best way to fight this would not to block them, that does not cause Amazon/others anything. (clearly).

What if instead it was possible to feed the bots clearly damaging and harmfull content?

If done on a larger scale, and Amazon discovers the poisoned pills they could have to spend money rooting it out, quick like, and make attempts to stop their bots to ingest it.

Of course nobody wants to have that tuff on their own site though. That is the biggest problem with this.

ADeerAppeared

> What if instead it was possible to feed the bots clearly damaging and harmfull content?

With all respect, you're completely misunderstanding the scope of AI companies' misbehaviour.

These scrapers already gleefully chow down on CSAM and all other likewise horrible things. OpenAI had some of their Kenyan data-tagging subcontractors quit on them over this. (2023, Time)

The current crop of AI firms do not care about data quality. Only quantity. The only thing you can do to harm them is to hand them 0 bytes.

You would go directly to jail for things even a tenth as bad as Sam Altman has authorized.

amarcheschi

If with damaging content you mean incorrect content, another comment in this thread has a user doing what you said https://marcusb.org/hacks/quixotic.html

smeggysmeg

I've seen this tarpit recommended for this purpose. it creates endless nests of directories and endless garbage content, as the site is being crawled. The bot can spend hours on it.

https://zadzmo.org/code/nepenthes/

idlewords

My site (Pinboard) is also getting hammered by what I presume are AI crawlers. It started out this summer with Chinese and Singapore IPs, but now I can't even block by IP range, and have to resort to captchas. The level of traffic is enough to immediately crash the site, and I don't even have any interesting text for them to train on, just link.

I'm curious how OP figured out it's Amazon's crawler to blame. I would love to point the finger of blame.

thayne

Are you sure it isn't a DDoS masquerading as Amazon?

Requests coming from residential ips is really suspicious.

Edit: the motivation for such a DDoS might be targeting Amazon, by taking down smaller sites and making it look like amazon is responsible.

If it is Amazon one place to start is blocking all the the ip ranges they publish. Although it sounds like there are requests outside those ranges...

OptionOfT

You should check your websites like grass dot io (I refuse to give them traffic).

They pay you for your bandwidth while they resell it to 3rd parties, which is why a lot of bot traffic looks like it comes from residential IPs.

Aurornis

Yes, but the point is that big company crawlers aren’t paying for questionably sourced residential proxies.

If this person is seeing a lot of traffic from residential IPs then I would be shocked if it’s really Amazon. I think someone else is doing something sketchy and they put “AmazonBot” in the user agent to make victims think it’s Amazon.

You can set the user agent string to anything you want, as we all know.

null

[deleted]

voakbasda

I wonder if anyone has checked whether Alexa devices serve as a private proxy network for AmazonBot’s use.

ninkendo

They could be using echo devices to proxy their traffic…

Although I’m not necessarily gonna make that accusation, because it would be pretty serious misconduct if it were true.

dafelst

I worked for Microsoft doing malware detection back 10+ years ago, and questionably sourced proxies were well and truly on the table

baobun

> Yes, but the point is that big company crawlers aren’t paying for questionably sourced residential proxies

You'd be surprised...

skywhopper

It’s not residential proxies. It’s Amazon using IPs they sublease from residential ISPs.

SOLAR_FIELDS

Wild. While I'm sure the service is technically legal since it can be used for non-nefarious purposes, signing up for a service like that seems like a guarantee that you are contributing to problematic behavior.

xena

I'd love it if Amazon could give me some AWS credit as a sign of good faith to make up for the egress overages their and other bots are causing, but the ads on this post are likely going to make up for it. Unblock ads and I come out even!

Aloisius

Using status code 418 (I'm a teapot), while cute, actually works against you since even well behaved bots don't know how to handle it and thus might not treat it as a permanent status causing them to try to recrawl again later.

Plus you'll want to allow access to /robots.txt.

Of course, if they're hammering new connections, then automatically adding temporary firewall rules if the user agent requests anything but /robots.txt might be the easiest solution. Well or just stick Cloudflare in front of everything.

LukeShu

Before I configured Nginx to block them:

- Bytespider (59%) and Amazonbot (21%) together accounted for 80% of the total traffic to our Git server.

- ClaudeBot drove more traffic through our Redmine in a month than it saw in the combined 5 years prior to ClaudeBot.

neilv

Can demonstrable ignoring of robots.txt help the cases of copyright infringement lawsuits against the "AI" companies, their partners, and customers?

thayne

Probably not copyright infringement. But it is probably (hopefully?) a violation of CFAA, both because it is effectively DDoSing you, and they are ignoring robots.txt.

Maybe worth contacting law enforcement?

Although it might not actually be Amazon.

to11mtm

Big thing worth asking here. Depending on what 'amazon' means here (i.e. known to be Amazon specific IPs vs Cloud IPs) it could just be someone running a crawler on AWS.

Or, folks failing the 'shared security model' of AWS and their stuff is compromised with botnets running on AWS.

Or, folks that are quasi-spoofing 'AmazonBot' because they think it will have a better not-block rate than anonymous or other requests...

thayne

From the information in the post, it sounds like the last one to me. That is, someone else spoofing an Amazonbot user agent. But it could potentially be all three.

adastra22

On what legal basis?

flir

In the UK, the Computer Misuse Act applies if:

* There is knowledge that the intended access was unauthorised

* There is an intention to secure access to any program or data held in a computer

I imagine US law has similar definitions of unauthorized access?

`robots.txt` is the universal standard for defining what is unauthorised access for bots. No programmer could argue they aren't aware of this, and ignoring it, for me personally, is enough to show knowledge that the intended access was unauthorised. Is that enough for a court? Not a goddamn clue. Maybe we need to find out.

pests

> `robots.txt` is the universal standard

Quite the assumption, you just upset a bunch of alien species.

readyplayernull

Terms of use contract violation?

hipadev23

Robots.txt is completely irrelevant. TOU/TOS are also irrelevant unless you restrict access to only those who have agreed to terms.

bdangubic

good thought but zippy chance this holds up in Court