The Web Is Broken – Botnet Part 2
243 comments
·April 19, 2025aorth
rollcat
Try Anubis: <https://anubis.techaro.lol>
It's a reverse proxy that presents a PoC challenge to every new visitor. It shifts the initial cost of accessing your server's resources back at the client. Assuming your uplink can handle 300k clients requesting a single 70kb web page, it should solve most of your problems.
For science, can you estimate your peak QPS?
luckylion
I've seen a few attacks where the operators placed malicious code on high-traffic sites (e.g. some government thing, larger newspapers), and then just let browsers load your site as an img. Did you see images, css, js being loaded from these IPs? If they were expecting images, they wouldn't parse the HTML and not load other resources.
It's a pretty effective attack because you get large numbers of individual browsers to contribute. Hosters don't care, so unless the site owners are technical enough, they can stay online quite a bit.
If they work with Referrer Policy, they should be able to mask themselves fairly well - the ones I saw back then did not.
hubraumhugo
We all agree that AI crawlers are a big issue as they don't respect any established best practices, but we rarely talk about the path forward. Scraping has been around for as long as the internet, and it was mostly fine. There are many very legitimate use cases for browser automation and data extraction (I work in this space).
So what are potential solutions? We're somehow still stuck with CAPTCHAS, a 25 years old concept that wastes millions of human hours and billions in infra costs [0].
How can enable beneficial automation while protecting against abusive AI crawlers?
mjaseem
I wrote an article about a possible proof of personhood solution idea: https://mjaseem.github.io/tech/2025/04/12/proof-of-humanity.....
The broad idea is to use zero knowledge proofs with certification. It sort of flips the public key certification system and adds some privacy.
To get into place, the powers in charge need to sway.
CaptainFever
My pet peeve is that using the term "AI crawler" for this conflates things unnecessarily. There's some people who are angry at it due to anti-AI bias and not wishing to share information, while there are others who are more concerned about it due to the large amount of bandwidth and server overloading.
Not to mention that it's unknown if these are actually from AI companies, or from people pretending to be AI companies. You can set anything as your user agent.
It's more appropriate to mention the specific issue one haves about the crawlers, like "they request things too quickly" or "they're overloading my server". Then from there, it is easier to come to a solution than just "I hate AI". For example, one would realize that things like Anubis have existed forever, they are just called DDoS protection, specifically those using proof-of-work schemes (e.g. https://github.com/RuiSiang/PoW-Shield).
This also shifts the discussion away from something that adds to the discrimination against scraping in general, and more towards what is actually the issue: overloading servers, or in other words, DDoS.
udev4096
Blame the "AI" companies for that. I am glad the small web is pushing hard against these scrapers, with the rise of Anubis as a starting point
lelanthran
> Blame the "AI" companies for that. I am glad the small web is pushing hard towards these scrapers, with the rise of Anubis as a starting point
Did you mean "against"?
udev4096
Corrected, thanks
0manrho
> So what are potential solutions?
It won't fully solve the problem, but with the problem relatively identified, you must then ask why people are engaging in this behavior. Answer: money, for the most part. Therefore, follow the money and identify the financial incentives driving this behavior. This leads you pretty quickly to a solution most people would reject out-of-hand: turn off the financial incentive that is driving the enshittification of the web. Which is to say, kill the ad-economy.
Or at least better regulate it while also levying punitive damages that are significant enough to both disuade bad-actors and encourage entities to view data-breaches (or the potential therein) and "leakage[0]" as something that should actually be effectively secured against. Afterall, there are some upsides to the ad-economy that, without it, would present some hard challenges (eg, how many people are willing to pay for search? what happens to the vibrant sphere of creators of all stripes that are incentivized by the ad-economy? etc).
Personally, I can't imagine this would actually happen. Pushback from monied interests aside, most people have given up on the idea of data-privacy or personal-ownership of their data, if they ever even cared in the first place. So, in the absence of willing to do do something about the incentive for this maligned behavior, we're left with few good options.
0: https://news.ycombinator.com/item?id=43716704 (see comments on all the various ways people's data is being leaked/leached/tracked/etc)
jeroenhd
The best solution I've seen is to hit everyone with a proof of work wall and whitelist the scrapers that are welcome (search engines and such).
Running SHA hash calculations for a second or so once every week is not bad for users, but with scrapers constantly starting new sessions they end up spending most of their time running useless Javascript, slowing the down significantly.
The most effective alternative to proof of work calculations seems to be remote attestation. The downside is that you're getting captchas if you're one of the 0.1% who disable secure boot and run Linux, but the vast majority of web users will live a captcha free life. This same mechanism could in theory also be used to authenticate welcome scrapers rather than relying on pure IP whitelists.
eastbound
But people don’t interact with your website anymore; they as an AI. So the AI crawler is a real user.
I say we ask Google Analytics to count an AI crawler as a real view. Let’s see who’s most popular.
Quarrel
FWIW, Trend Micro wrote up a decent piece on this space in 2023.
It is still a pretty good lay-of-the-land.
https://www.trendmicro.com/vinfo/us/security/news/vulnerabil...
zahlman
> I am now of the opinion that every form of web-scraping should be considered abusive behaviour and web servers should block all of them. If you think your web-scraping is acceptable behaviour, you can thank these shady companies and the “AI” hype for moving you to the bad corner.
I imagine that e.g. Youtube would be happy to agree with this. Not that it would turn them against AI generally.
Centigonal
yeah, but you can't, that's the problem. Plenty of service operators would like to block every scraper that doesn't obey their robots.txt, but there's no good way to do that without blocking human traffic too (Anubis et al are okay, but they are half-measures).
On a separate note, I believe open web scraping has been a massive benefit to the internet on net, and almost entirely positive pre-2021. Web scraping & crawling enables search engines, services like Internet Archive, walled-garden-busting (like Invidious, yt-dlp, and Nitter), mashups (Spotube, IFTT, and Plaid would have been impossible to bootstrap without web scraping), and all kinds of interesting data science projects (e.g. scraping COVID-19 stats from local health departments to patch together a picture of viral spread for epidemiologists).
udev4096
We should have a way to verify the user-agents of the valid and useful scrapers such as Internet Archive by having some kind of cryptographic signature of their user-agents and being able to validate it with any reverse proxy seems like a good start
nottorp
Self signed, I hope.
Or do you want a central authority that decides who can do new search engines?
lelanthran
> Plenty of service operators would like to block every scraper that doesn't obey their robots.txt, but there's no good way to do that without blocking human traffic too (Anubis et al are okay, but they are half-measures)
Why is Anubis-type mitigations a half-measure?
Centigonal
Anubis, go-away, etc are great, don't get me wrong -- but what Anubis does is impose a cost on every query. The website operator is hoping that the compute will have a rate-limiting effect on scrapers while minimally impacting the user experience. It's almost like chemotherapy, in that you're poisoning everyone in the hope that the aggressive bad actors will be more severely affected than the less aggressive good actors. Even the Anubis readme calls it a nuclear option. In practice it appears to work pretty well, which is great!
It's a half-measure because:
1. You're slowing down scrapers, not blocking them. They will still scrape your site content in violation of robots.txt.
2. Scrapers with more compute than IP proxies will not be significantly bottlenecked by this.
3. This may lead to an arms race where AI companies respond by beefing up their scraping infrastructure, necessitating more difficult PoW challenges, and so on. The end result of this hypothetical would be a more inconvenient and inefficient internet for everyone, including human users.
To be clear: I think Anubis is a great tool for website operators, and one of the best self-hostable options available today. However, it's a workaround for the core problem that we can't reliably distinguish traffic from badly behaving AI scrapers from legitimate user traffic.
BlueTemplar
Yeah, also this means the death of archival efforts like the Internet Archive.
jeroenhd
Welcome scrapers (IA, maybe Google and Bing) can publish their IP addresses and get whitelisted. Websites that want to prevent being on the Internet Archive can pretty much just ask for their website to be excluded (even retroactively).
[Cloudflare](https://developers.cloudflare.com/cache/troubleshooting/alwa...) tags the internet archive as operating from 207.241.224.0/20 and 208.70.24.0/21 so disabling the bot-prevention framework on connections from there should be enough.
areyourllySorry
a large chunk of internet archive's snapshots are from archiveteam, where "warriors" bring their own ips (and they crawl respectfully!). save page now is important too, but you don't realise what is useful until you lose it.
realusername
That's basically asking to close the market in favor of the current actors.
New actors have the right to emerge.
trinsic2
This sounds like it would be a good idea. Create a whitelist of IPs and block the rest.
aucisson_masque
It's interesting but so far there is no definitive proof it's happening.
People are jumping to conclusions a bit fast over here, yes technically it's possible but this kind of behavior would be relatively easy to spot because the app would have to make direct connections to the website it wants to scrap.
Your calculator app for instance connecting to CNN.com ...
iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
Android by Google doesn't have such a useful feature of course, but you can run third party firewall like pcapdroid, which I recommend highly.
Macos (little snitch).
Windows (fort firewall).
Not everyone run these app obviously, only the most nerdy like myself but we're also the kind of people who would report on app using our device to make, what is in fact, a zombie or bot network.
I'm not saying it's necessarily false but imo it remains a theory until proven otherwise.
andelink
This is a hilariously optimistic, naive, disconnected from reality take. What sort of "proof" would be sufficient for you? TFA includes of course data from the authors own server logs^, but it also references real SDKs and business selling this exact product. You can view the pricing page yourself, right next to stats on how many IPs are available for you to exploit. What else do you need to see?
^ edit: my mistake, the server logs I mentioned were from the authors prior blog post on this topic, linked to at the top of TFA: https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
jshier
> iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
Privacy reports do not include that information. They include broad areas of information the app claims to gather. There is zero connection between those claimed areas and what the app actually does unless app review notices something that doesn't match up. But none of that information is updated dynamically, and it has never actually included the domains the app connects to. You may be confusing it with the old domain declarations for less secure HTTP connections. Once the connections met the system standards you no longer needed to declare it.
zargon
I wasn't aware of this feature. But apparently it does include that information. I just enabled it and can see the domains that apps connect to. https://support.apple.com/en-us/102188
hoc
Pretty neat, actually. Thanks for looking uo that link.
Galanwe
There is already a lot of proof. Just ask for a sales pitch from companies selling these data and they will gladly explain everything to you.
Go to a data conference like Neudata and you will see. You can have scraped data from user devices, real-time locations, credit card, Google analytics, etc.
throwaway519
Given 5his is a thing even in browser plugins, and that so very few people analyse their firewalls, I'd not discount it at all. Much of the world's users hve no clue and app stores are notoriously bad at reacting even with publicsed malware e.g. 'free' VPNs in iOS Store.
abaymado
> iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
How often is the average calculator app user checking there Privacy Report? My guess, not many!
gruez
All it takes is one person to find out and raise the alarm. The average user doesn't read the source code behind openssl or whatever either, that doesn't mean there's no gains in open sourcing it.
nottorp
The real solution is to add a permission for network access, with the default set to deny.
dewey
The average user is also not reading these raised “alarms”. And if an app has a bad name, another one will show up with a different name on the same day.
CharlesW
Botnets as a Service are absolutely happening, but as you allude to, the scope of the abuse is very different on iOS than, say, Windows.
jeroenhd
> So there is a (IMHO) shady market out there that gives app developers on iOS, Android, MacOS and Windows money for including a library into their apps that sells users network bandwidth
AKA "why do Cloudflare and Google make me fill out these CAPTCHAs all day"
I don't know why Play Protect/MS Defender/whatever Apple has for antivirus don't classify apps that embed such malware as such. It's ridiculous that this is allowed to go on when detection is so easy. I don't know a more obvious example of a trojan than an SDK library making a user's device part of a botnet.
dx4100
Cloudflare and Google use CAPTCHAs to sell web scrapers? I don't get your point. I was under the impression the data is used to train models.
aloha2436
The implication is that the users that are being constantly presented with CAPTCHAs are experiencing that because they are unwittingly proxying scrapers through their devices via malicious apps they've installed.
pentae
.. or that other people on their network/Shared public IP have installed
jeroenhd
When a random device on your network gets infected with crap like this, your network becomes a bot egress point, and anti bot networks respond appropriately. Cloudflare, Akamai, even Google will start showing CAPTCHAs for every website they protect when your network starts hitting random servers with scrapers or DDoS attacks.
This is even worse with CG-NAT if you don't have IPv6 to solve the CG-NAT problem.
I don't think the data they collect is used to train anything these days. Cloudflare is using AI generated images for CAPTCHAs and Google's actual CAPTCHAs are easier for bots than humans at this point (it's the passive monitoring that makes it still work a little bit).
cuu508
Trojans in your mobile apps ruin your IP's reputation which comes back to you in the form of frequent, annoying CAPTCHAs.
areyourllySorry
it's not technically malware, you agreed to it when you accepted the terms of service :^)
L-four
It's malware it does something malicious.
null
Liftyee
I don't know if I should be surprised about what's described in this article, given the current state of the world. Certainly I didn't know about it before, and I agree with the article's conclusion.
Personally, I think the "network sharing" software bundled with apps should fall into the category of potentially unwanted applications along with adware and spyware. All of the above "tag along" with something the user DID want to install, and quietly misuse the user's resources. Proxies like this definitely have an impact for metered/slow connections - I'm tempted to start Wireshark'ing my devices now to look for suspicious activity.
There should be a public repository of apps known to have these shady behaviours. Having done some light web scraping for archival/automation before, it's a pity that it'll become collateral damage in the anti-AI-botfarm fight.
zzo38computer
I agree, this should be called spyware, and malware. There are many other kind of software that also should, but netcat and ncat (probably) aren't malware.
akoboldfrying
I agree, but the harm done to the users is only one part of the total harm. I think it's quite plausible that many users wouldn't mind some small amount of their bandwidth being used, if it meant being able to use a handy browser extension that they would otherwise have to pay actual dollars for -- but the harm done to those running the servers remains.
null
karmanGO
Has anyone tried to compile a list of software that uses these libraries? It would be great to know what apps to avoid
lelanthran
> Has anyone tried to compile a list of software that uses these libraries? It would be great to know what apps to avoid
I wouldn't mind reading a comprehensive report on SOTA with regard to bot-blocking.
Sure, there's Anubis (although someone elsethread called it a half-measure, and I'd like to know why), there's captcha's, there's relying on a monopoly (cloudflare, etc) who probably also wants to run their own bots at some point, but what else is there?
mzajc
In the case of Android, εxodus has one[1], though I couldn't find the malware library listed in TFA. Aurora Store[2], a FOSS Google Play Store client, also integrates it.
[1] https://reports.exodus-privacy.eu.org/en/trackers/ [2] https://f-droid.org/packages/com.aurora.store/
takluyver
That seems to be looking at tracking and data collection libraries, though, for things like advertising and crash reporting. I don't see any mention of the kind of 'network sharing' libraries that this article is about. Have I missed it?
arewethereyeta
No but here's the thing. Being in the industry for many years I know they are required to mention it in the TOS when using the SDKs. A crawler pulling app TOSs and parsing them could be a thing. List or not, it won't be too useful outside this tech community.
il-b
A good portion of free VPN apps sell their traffic. This was the thing even before the AI bot explosion.
null
__MatrixMan__
The broken thing about the web is that in order for data to remain readable, a unique sysadmin somewhere has to keep a server running in the face of an increasingly hostile environment.
If instead we had a content addressed model, we could drop the uniqueness constraint. Then these AI scrapers could be gossiping the data to one another (and incidentally serving it to the rest of us) without placing any burden on the original source.
Having other parties interested in your data should make your life easier (because other parties will host it for you), not harder (because now you need to work extra hard to host it for them).
areyourllySorry
there is no incentive for different companies to share data with each other, or with anyone really (facebook leeching books?)
Timwi
Are there any systems like that, even if experimental?
jevogel
IPFS
alakra
I had high hopes for IPFS, but even it has vectors for abuse.
See https://arxiv.org/abs/1905.11880 [Hydras and IPFS: A Decentralised Playground for Malware]
akoboldfrying
Assuming the right incentives can be found to prevent widespread leeching, a distributed content-addressed model indeed solves this problem, but introduces the problem of how to control your own content over time. How do you get rid of a piece of content? How do you modify the content at a given URL?
I know, as far as possible it's a good idea to have content-immutable URLs. But at some point, I need to make www.myexamplebusiness.com show new content. How would that work?
XorNot
Except no one wants content addressed data - because if you knew what it was you wanted, then you would already have stored it. The web as we know it is an index - it's a way to discover that data is available and specifically we usually want the latest data that's available.
AI scrapers aren't trying to find things they already know exist, they're trying to discover what they didn't know existed.
__MatrixMan__
Yes, for the reasons you describe, you can't be both a useful web-like protocol and also 100% immutable/hash-linked.
But there's a lot middle ground to explore here. Loading a modern web page involves making dozens of requests to a variety of different servers, evaluating some javascript, and then doing it again a few times, potentially moving several Mb of data. The part people want, the thing you don't already know exist, it's hidden behind that rather heavy door. It doesn't have to be that way.
If you already know about one thing (by its cryptographic hash, say) and you want to find out which other hashes it's now associated with--associations that might not have existed yesterday--that's much easier than we've made it. It can be done:
- by moving kB not Mb, we're just talking about a tuple of hashes here, maybe a public key and a signature
- without placing additional burden on whoever authored the first thing, they don't even have to be the ones who published the pair of hashes that your scraper is interested in
Once you have the second hash, you can then reenter immutable-space to get whatever it references. I'm not sure if there's already a protocol for such things, but if not then we can surely make one that's more efficient and durable than what we're doing now.
XorNot
But we already have HEAD requests and etags.
It is entirely possible to serve a fully cached response that says "you already have this". The problem is...people don't implement this well.
akoboldfrying
> because if you knew what it was you wanted, then you would already have stored it.
"Content-addressable" has a broader meaning than what you seem to be thinking of -- roughly speaking, it applies if any function of the data is used as the "address". E.g., git commits are content-addressable by their SHA1 hashes.
null
kastden
Are there any lists with known c&c servers for these services that can be added to Pihole/etc?
udev4096
You can use one of the list from here: https://github.com/hagezi/dns-blocklists
Pesthuf
We need a list of apps that include these libraries and any malware scanner - including Windows Defender, Play Protect and whatever Apple calls theirs - need to put infected applications into quarantine immediately. Just because it's not directly causing damage to the device running the malware is running on, that doesn't mean it's not malware.
philippta
Apps should be required to ask for permission to access specific domains. Similar to the tracking protection, Apple introduced a while ago.
Not sure how this could work for browsers, but the other 99% of apps I have on my phone should work fine with just a single permitted domain.
snackernews
My iPhone occasionally displays an interrupt screen to remind me that my weather app has been accessing my location in the background and to confirm continued access.
It should also do something similar for apps making chatty background requests to domains not specified at app review time. The legitimate use cases for that behaviour are few.
klabb3
On the one hand, yes this could work for many cases. On the other hand, good bye p2p. Not every app is a passive client-server request-response. One needs to be really careful with designing permission systems. Apple has already killed many markets before they had a chance to even exist, such as companion apps for watches and other peripherals.
nottorp
> On the other hand, good bye p2p.
You mean, good bye using my bandwidth without my permission? That's good. And if I install a bittorrent client on my phone, I'll know to give it permission.
> such as companion apps for watches and other peripherals
That's just apple abusing their market position in phones to push their watch. What does it have to do with p2p?
kmeisthax
P2P was practically dead on iPhone even back in 2010. The whole "don't burn the user's battery" thing precludes mobile phones doing anything with P2P other than leeching off of it. The only exceptions are things like AirDrop; i.e. locally peer-to-peer things that are only active when in use and don't try to form an overlay or mesh network that would require the phone to become a router.
And, AFAIK, you already need special permission for anything other than HTTPS to specific domains on the public Internet. That's why apps ping you about permissions to access "local devices".
Pesthuf
Maybe there could be a special entitlement that Apple's reviewers would only grant to applications that have a legitimate reason to require such connections. Then only applications granted that permission would be able to make requests to arbitrary domains / IP addresses.
That's how it works with other permissions most applications should not have access to, like accessing user locations. (And private entitlements third party applications can't have are one way Apple makes sure nobody can compete with their apps, but that's a separate issue.)
udev4096
Android is so fucking anti-privacy that they still don't have an INTERNET access revoke toggle. The one they have currently is broken and can easily be bypassed with google play services (another highly privileged process running for no reason other than to sell your soul to google). GrapheneOS has this toggle luckily. Whenever you install an app, you can revoke the INTERNET access at the install screen and there is no way that app can bypass it
mjmas
Asus added this to their phones which is nice.
vbezhenar
Do you suggest to outright forbid TCP connections for user software? Because you can compile OpenSSL or any other TLS library and do a TCP connection to port 443 which will be opaque for operating system. They can do wild things like kernel-level DPI for outgoing connections to find out host, but that quickly turns into ridiculous competition.
internetter
> but that quickly turns into ridiculous competition.
Except the platform providers hold the trump card. Fuck around, if they figure it out you'll be finding out.
zzo38computer
I think capability based security with proxy capabilities is the way to do it, and this would make it possible for the proxy capability to intercept the request and ask permission, or to do whatever else you want it to do (e.g. redirections, log any accesses, automatically allow or disallow based on a file, use or ignore the DNS cache, etc).
The system may have some such functions built in, and asking permission might be a reasonable thing to include by default.
XorNot
Try actually using a system like this. OpenSnitch and LittleSnitch do it for Linux and MacOS respectively. Fedora has a pretty good interface for SELinux denials.
I've used all of them, and it's a deluge: it is too much information to reasonably react to.
Your broad is either deny or accept but there's no sane way to reliably know what you should do.
This is not and cannot be an individual problem: the easy part is building high fidelity access control, the hard part is making useful policy for it.
tzury
Vast majority of revenues in the mobile apps ecosystem are ads, which by design pulled from 3rd parties (and are part of the broader problem discussed in this post).
I am waiting for Apple to enable /etc/hosts or something similar on iOS devices.
jay_kyburz
Oh, that's an interesting idea. A local DNS where I have to add every entry. A white list rather than Australia's national blacklist.
null
reconnecting
Residential IP proxies have some weaknesses. One is that they ofter change IP addresses during a single web session. Second, if IP come from the same proxies provider, they are often concentrated within a sing ASN, making them easier to detect.
We are working on an open‑source fraud prevention platform [1], and detecting fake users coming from residential proxies is one of its use cases.
andelink
The first blog post in this series[1], linked to at the top of TFA, offers an analysis on the potential of using ASNs to detect such traffic. Their conclusion was that ASNs are not helpful for this use-case, showing that across the 50k IPs they've blocked, there is less than 4 IP addresses per ASN, on average.
[1] https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
reconnecting
What was done manually in the first blog is exactly what tirreno helps to achieve by analyzing traffic, here is live example [1]. Blocking an entire ASN should not be considered a strategy when real users are involved.
Regarding the first post, it's rare to see both datacenter network IPs and mobile proxy IP addresses used simultaneously. This suggests the involvement of more than one botnet. The main idea is to avoid using IP addresses as the sole risk factor. Instead, they should be considered as just one part of the broader picture of user behavior.
gruez
>One is that they ofter change IP addresses during a single web session. Second, if IP come from the same proxies provider, they are often concentrated within a sing ASN, making them easier to detect.
Both are pretty easy to mitigate with a geoip database and some smart routing. One "residential proxy" vendor even has session tokens so your source IP doesn't randomly jump between each request.
gbcfghhjj
At least here in the US most residential ISPs have long leases and change infrequently, weeks or months.
Trying to understand your product, where is it intended to sit in a network? Is it a standalone tool that you use to identify these IPs and feed into something else for blockage or is it intended to be integrated into your existing site or is it supposed to proxy all your web traffic? The reason I ask is it has fairly heavyweight install requirements and Apache and PHP are kind of old school at this point, especially for new projects and companies. It's not what they would commonly be using for their site.
reconnecting
Indeed, if it's a real user from a residential IP address, in most cases it will be the same network. However, if it's a proxy from residential IPs, there could be 10 requests from one network, the 11th request from a second network, and the 12th request back from the same network. This is a red flag.
Thank you for your question. tirreno is a standalone app that needs to receive API events from your main web application. It can work perfectly with 512GB Postgres RAM or even lower, however, in most cases we're talking about millions of events that request resources.
It's much easier to write a stable application without dependencies based on mature technologies. tirreno is fairly 'boring software'.
sroussey
My phone will be on the home network until I walk out of the house and then it will change networks. This should not be a red flag.
In the last week I've had to deal with two large-scale influxes of traffic on one particular web server in our organization.
The first involved requests from 300,000 unique IPs in a span of a few hours. I analyzed them and found that ~250,000 were from Brazil. I'm used to using ASNs to block network ranges sending this kind of traffic, but in this case they were spread thinly over 6,000+ ASNs! I ended up blocking all of Brazil (sorry).
A few days later this same web server was on fire again. I performed the same analysis on IPs and found a similar number of unique addresses, but spread across Turkey, Russia, Argentina, Algeria and many more countries. What is going on?! Eventually I think I found a pattern to identify the requests, in that they were using ancient Chrome user agents. Chrome 40, 50, 60 and up to 90, all released 5 to 15 years ago. Then, just before I could implement a block based on these user agents, the traffic stopped.
In both cases the traffic from datacenter networks was limited because I already rate limit a few dozen of the larger ones.
Sysadmin life...