The Web Is Broken – Botnet Part 2
300 comments
·April 19, 2025aorth
rollcat
Try Anubis: <https://anubis.techaro.lol>
It's a reverse proxy that presents a PoC challenge to every new visitor. It shifts the initial cost of accessing your server's resources back at the client. Assuming your uplink can handle 300k clients requesting a single 70kb web page, it should solve most of your problems.
For science, can you estimate your peak QPS?
marginalia_nu
Anubis is a good choice because it whitelists legitimate and well behaved crawlers based on IP + user-agent. Cloudflare works as well in that regard but then you're MITM:ing all your visitors.
Imustaskforhelp
Also, I was just watching brodie robertson video about how United Nations has this random search page of unesco which actually has anubis.
Crazy how I remember the HN post where anubis's blog post was first made. Though, I always thought it was a bit funny with anime and it was made by frustration of (I think AWS? AI scrapers who won't follow general rules and it was constantly giving requests to his git server and it actually made his git server down I guess??) I didn't expect it to blow up to ... UN.
xena
Her*
It was frustration at AWS' Alexa team and their abuse of the commons. Amusingly if they had replied to my email before I wrote my shitpost of an implementation this all could have turned out vastly differently.
martin82
This looks very cool, but isn't it just a matter of months until all scrapers get updated and can easily beat this challenge and are able to compute modern JS stuff?
nodogoto
My company's site has also been getting hammered by Brazilian IPs. They're focused on a single filterable table of fewer than 100 rows, querying it with various filter combinations every second of every minute of every day.
luckylion
I've seen a few attacks where the operators placed malicious code on high-traffic sites (e.g. some government thing, larger newspapers), and then just let browsers load your site as an img. Did you see images, css, js being loaded from these IPs? If they were expecting images, they wouldn't parse the HTML and not load other resources.
It's a pretty effective attack because you get large numbers of individual browsers to contribute. Hosters don't care, so unless the site owners are technical enough, they can stay online quite a bit.
If they work with Referrer Policy, they should be able to mask themselves fairly well - the ones I saw back then did not.
ninkendo
I seem to remember a thing china did 10 years back where they injected JavaScript into every web request that went through their Great Firewall to target GitHub… I think it’s known as the “Great Cannon” because they can basically make every Chinese internet user’s browser hit your website in a DoS attack.
Digging it up: https://www.washingtonpost.com/news/the-switch/wp/2015/04/10...
luckylion
Wow, that had passed me by completely, thanks for sharing!
Very similar indeed. The attacks I witnessed where easy to block once you identified the patterns (referrer was visible and they used predictable ?_=... query parameters to try and bypass caches), but very effective otherwise.
I suppose in the event of a hot war, the Internet will be cut quickly to defend against things like the "Great Cannon".
hubraumhugo
We all agree that AI crawlers are a big issue as they don't respect any established best practices, but we rarely talk about the path forward. Scraping has been around for as long as the internet, and it was mostly fine. There are many very legitimate use cases for browser automation and data extraction (I work in this space).
So what are potential solutions? We're somehow still stuck with CAPTCHAS, a 25 years old concept that wastes millions of human hours and billions in infra costs [0].
How can enable beneficial automation while protecting against abusive AI crawlers?
marginalia_nu
Proof-of-work works in terms of preventing large-scale automation.
As for letting well behaved crawlers in, I've had an idea for something like DKIM for crawlers. Should be possible to set up a fairly cheap cryptographic solution that enables crawlers a persistent identity that can't be forged.
Basically put a header containing first a string including today's date, the crawler's IP, and a domain name, then a cryptographic signature of the string. The domain has a TXT record with a public key for verifying the identity. It's cheap because you really only need to verify the string it once on the server side, and the crawler only needs to regenerate it once per day.
With that in place, crawlers can crawl with their reputation at stake. The big problem with these rogue scrapers are that they're basically impossible to identify or block, which means they don't have any incentives to behave well.
lesostep
> Proof-of-work works in terms of preventing large-scale automation.
It wouldn't work to prevent the type of behavior shown in a title story
CaptainFever
My pet peeve is that using the term "AI crawler" for this conflates things unnecessarily. There's some people who are angry at it due to anti-AI bias and not wishing to share information, while there are others who are more concerned about it due to the large amount of bandwidth and server overloading.
Not to mention that it's unknown if these are actually from AI companies, or from people pretending to be AI companies. You can set anything as your user agent.
It's more appropriate to mention the specific issue one haves about the crawlers, like "they request things too quickly" or "they're overloading my server". Then from there, it is easier to come to a solution than just "I hate AI". For example, one would realize that things like Anubis have existed forever, they are just called DDoS protection, specifically those using proof-of-work schemes (e.g. https://github.com/RuiSiang/PoW-Shield).
This also shifts the discussion away from something that adds to the discrimination against scraping in general, and more towards what is actually the issue: overloading servers, or in other words, DDoS.
johnnyanmac
It's become unbearable in the "AI era". So it's appropriate to blame AI for it, ib my eyes. Especially since so much defense is based aroind training LLMs.
It's just like how not all Ddoss's are actually hackers or bots. Sometimes a server just can't take the traffic of a large site flooding in. But the result is the same until something is investigated.
queenkjuul
It's not a coincidence that this wasn't a major problem until everybody and their dog started trying to build the next great LLM.
udev4096
Blame the "AI" companies for that. I am glad the small web is pushing hard against these scrapers, with the rise of Anubis as a starting point
lelanthran
> Blame the "AI" companies for that. I am glad the small web is pushing hard towards these scrapers, with the rise of Anubis as a starting point
Did you mean "against"?
udev4096
Corrected, thanks
jeroenhd
The best solution I've seen is to hit everyone with a proof of work wall and whitelist the scrapers that are welcome (search engines and such).
Running SHA hash calculations for a second or so once every week is not bad for users, but with scrapers constantly starting new sessions they end up spending most of their time running useless Javascript, slowing the down significantly.
The most effective alternative to proof of work calculations seems to be remote attestation. The downside is that you're getting captchas if you're one of the 0.1% who disable secure boot and run Linux, but the vast majority of web users will live a captcha free life. This same mechanism could in theory also be used to authenticate welcome scrapers rather than relying on pure IP whitelists.
ognarb
The issue is that it would require normal user to also do the same, which is suboptimal from a privacy point of view.
mjaseem
I wrote an article about a possible proof of personhood solution idea: https://mjaseem.github.io/tech/2025/04/12/proof-of-humanity.....
The broad idea is to use zero knowledge proofs with certification. It sort of flips the public key certification system and adds some privacy.
To get into place, the powers in charge need to sway.
0manrho
> So what are potential solutions?
It won't fully solve the problem, but with the problem relatively identified, you must then ask why people are engaging in this behavior. Answer: money, for the most part. Therefore, follow the money and identify the financial incentives driving this behavior. This leads you pretty quickly to a solution most people would reject out-of-hand: turn off the financial incentive that is driving the enshittification of the web. Which is to say, kill the ad-economy.
Or at least better regulate it while also levying punitive damages that are significant enough to both disuade bad-actors and encourage entities to view data-breaches (or the potential therein) and "leakage[0]" as something that should actually be effectively secured against. Afterall, there are some upsides to the ad-economy that, without it, would present some hard challenges (eg, how many people are willing to pay for search? what happens to the vibrant sphere of creators of all stripes that are incentivized by the ad-economy? etc).
Personally, I can't imagine this would actually happen. Pushback from monied interests aside, most people have given up on the idea of data-privacy or personal-ownership of their data, if they ever even cared in the first place. So, in the absence of willing to do do something about the incentive for this maligned behavior, we're left with few good options.
0: https://news.ycombinator.com/item?id=43716704 (see comments on all the various ways people's data is being leaked/leached/tracked/etc)
caelinsutch
CAPTCHAS are also quickly becoming irrelevant / not enough. Fingerprint based approaches seem to be the only realistic way forward in the cat / mouse game
CalRobert
I hate this but I suspect a login-only deanonymised web (made simple with chrome and WEI!) is the future. Firefox users can go to hell.
spookie
I'm still surprised by people everyday, after all these years. This is one of those times. Crazy how anyone would ever want a single point of identifying everything you do.
CalRobert
I don't want this - It's the exact opposite of what I want.
ArinaS
We won't.
CalRobert
To elaborate (if anyone sees this) I use Firefox on Linux. I don't LIKE this future! I just think it's where the web is headed.
zahlman
> I am now of the opinion that every form of web-scraping should be considered abusive behaviour and web servers should block all of them. If you think your web-scraping is acceptable behaviour, you can thank these shady companies and the “AI” hype for moving you to the bad corner.
I imagine that e.g. Youtube would be happy to agree with this. Not that it would turn them against AI generally.
Centigonal
yeah, but you can't, that's the problem. Plenty of service operators would like to block every scraper that doesn't obey their robots.txt, but there's no good way to do that without blocking human traffic too (Anubis et al are okay, but they are half-measures).
On a separate note, I believe open web scraping has been a massive benefit to the internet on net, and almost entirely positive pre-2021. Web scraping & crawling enables search engines, services like Internet Archive, walled-garden-busting (like Invidious, yt-dlp, and Nitter), mashups (Spotube, IFTT, and Plaid would have been impossible to bootstrap without web scraping), and all kinds of interesting data science projects (e.g. scraping COVID-19 stats from local health departments to patch together a picture of viral spread for epidemiologists).
udev4096
We should have a way to verify the user-agents of the valid and useful scrapers such as Internet Archive by having some kind of cryptographic signature of their user-agents and being able to validate it with any reverse proxy seems like a good start
nottorp
Self signed, I hope.
Or do you want a central authority that decides who can do new search engines?
lelanthran
> Plenty of service operators would like to block every scraper that doesn't obey their robots.txt, but there's no good way to do that without blocking human traffic too (Anubis et al are okay, but they are half-measures)
Why is Anubis-type mitigations a half-measure?
Centigonal
Anubis, go-away, etc are great, don't get me wrong -- but what Anubis does is impose a cost on every query. The website operator is hoping that the compute will have a rate-limiting effect on scrapers while minimally impacting the user experience. It's almost like chemotherapy, in that you're poisoning everyone in the hope that the aggressive bad actors will be more severely affected than the less aggressive good actors. Even the Anubis readme calls it a nuclear option. In practice it appears to work pretty well, which is great!
It's a half-measure because:
1. You're slowing down scrapers, not blocking them. They will still scrape your site content in violation of robots.txt.
2. Scrapers with more compute than IP proxies will not be significantly bottlenecked by this.
3. This may lead to an arms race where AI companies respond by beefing up their scraping infrastructure, necessitating more difficult PoW challenges, and so on. The end result of this hypothetical would be a more inconvenient and inefficient internet for everyone, including human users.
To be clear: I think Anubis is a great tool for website operators, and one of the best self-hostable options available today. However, it's a workaround for the core problem that we can't reliably distinguish traffic from badly behaving AI scrapers from legitimate user traffic.
BlueTemplar
Yeah, also this means the death of archival efforts like the Internet Archive.
jeroenhd
Welcome scrapers (IA, maybe Google and Bing) can publish their IP addresses and get whitelisted. Websites that want to prevent being on the Internet Archive can pretty much just ask for their website to be excluded (even retroactively).
[Cloudflare](https://developers.cloudflare.com/cache/troubleshooting/alwa...) tags the internet archive as operating from 207.241.224.0/20 and 208.70.24.0/21 so disabling the bot-prevention framework on connections from there should be enough.
realusername
That's basically asking to close the market in favor of the current actors.
New actors have the right to emerge.
areyourllySorry
a large chunk of internet archive's snapshots are from archiveteam, where "warriors" bring their own ips (and they crawl respectfully!). save page now is important too, but you don't realise what is useful until you lose it.
trinsic2
This sounds like it would be a good idea. Create a whitelist of IPs and block the rest.
Quarrel
FWIW, Trend Micro wrote up a decent piece on this space in 2023.
It is still a pretty good lay-of-the-land.
https://www.trendmicro.com/vinfo/us/security/news/vulnerabil...
aucisson_masque
It's interesting but so far there is no definitive proof it's happening.
People are jumping to conclusions a bit fast over here, yes technically it's possible but this kind of behavior would be relatively easy to spot because the app would have to make direct connections to the website it wants to scrap.
Your calculator app for instance connecting to CNN.com ...
iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
Android by Google doesn't have such a useful feature of course, but you can run third party firewall like pcapdroid, which I recommend highly.
Macos (little snitch).
Windows (fort firewall).
Not everyone run these app obviously, only the most nerdy like myself but we're also the kind of people who would report on app using our device to make, what is in fact, a zombie or bot network.
I'm not saying it's necessarily false but imo it remains a theory until proven otherwise.
jshier
> iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
Privacy reports do not include that information. They include broad areas of information the app claims to gather. There is zero connection between those claimed areas and what the app actually does unless app review notices something that doesn't match up. But none of that information is updated dynamically, and it has never actually included the domains the app connects to. You may be confusing it with the old domain declarations for less secure HTTP connections. Once the connections met the system standards you no longer needed to declare it.
zargon
I wasn't aware of this feature. But apparently it does include that information. I just enabled it and can see the domains that apps connect to. https://support.apple.com/en-us/102188
hoc
Pretty neat, actually. Thanks for looking uo that link.
Galanwe
There is already a lot of proof. Just ask for a sales pitch from companies selling these data and they will gladly explain everything to you.
Go to a data conference like Neudata and you will see. You can have scraped data from user devices, real-time locations, credit card, Google analytics, etc.
throwaway519
Given 5his is a thing even in browser plugins, and that so very few people analyse their firewalls, I'd not discount it at all. Much of the world's users hve no clue and app stores are notoriously bad at reacting even with publicsed malware e.g. 'free' VPNs in iOS Store.
abaymado
> iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
How often is the average calculator app user checking there Privacy Report? My guess, not many!
gruez
All it takes is one person to find out and raise the alarm. The average user doesn't read the source code behind openssl or whatever either, that doesn't mean there's no gains in open sourcing it.
dewey
The average user is also not reading these raised “alarms”. And if an app has a bad name, another one will show up with a different name on the same day.
nottorp
The real solution is to add a permission for network access, with the default set to deny.
CharlesW
Botnets as a Service are absolutely happening, but as you allude to, the scope of the abuse is very different on iOS than, say, Windows.
andelink
This is a hilariously optimistic, naive, disconnected from reality take. What sort of "proof" would be sufficient for you? TFA includes of course data from the authors own server logs^, but it also references real SDKs and business selling this exact product. You can view the pricing page yourself, right next to stats on how many IPs are available for you to exploit. What else do you need to see?
^ edit: my mistake, the server logs I mentioned were from the authors prior blog post on this topic, linked to at the top of TFA: https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
null
jeroenhd
> So there is a (IMHO) shady market out there that gives app developers on iOS, Android, MacOS and Windows money for including a library into their apps that sells users network bandwidth
AKA "why do Cloudflare and Google make me fill out these CAPTCHAs all day"
I don't know why Play Protect/MS Defender/whatever Apple has for antivirus don't classify apps that embed such malware as such. It's ridiculous that this is allowed to go on when detection is so easy. I don't know a more obvious example of a trojan than an SDK library making a user's device part of a botnet.
dx4100
Cloudflare and Google use CAPTCHAs to sell web scrapers? I don't get your point. I was under the impression the data is used to train models.
aloha2436
The implication is that the users that are being constantly presented with CAPTCHAs are experiencing that because they are unwittingly proxying scrapers through their devices via malicious apps they've installed.
pentae
.. or that other people on their network/Shared public IP have installed
jeroenhd
When a random device on your network gets infected with crap like this, your network becomes a bot egress point, and anti bot networks respond appropriately. Cloudflare, Akamai, even Google will start showing CAPTCHAs for every website they protect when your network starts hitting random servers with scrapers or DDoS attacks.
This is even worse with CG-NAT if you don't have IPv6 to solve the CG-NAT problem.
I don't think the data they collect is used to train anything these days. Cloudflare is using AI generated images for CAPTCHAs and Google's actual CAPTCHAs are easier for bots than humans at this point (it's the passive monitoring that makes it still work a little bit).
cuu508
Trojans in your mobile apps ruin your IP's reputation which comes back to you in the form of frequent, annoying CAPTCHAs.
areyourllySorry
it's not technically malware, you agreed to it when you accepted the terms of service :^)
L-four
It's malware it does something malicious.
null
Liftyee
I don't know if I should be surprised about what's described in this article, given the current state of the world. Certainly I didn't know about it before, and I agree with the article's conclusion.
Personally, I think the "network sharing" software bundled with apps should fall into the category of potentially unwanted applications along with adware and spyware. All of the above "tag along" with something the user DID want to install, and quietly misuse the user's resources. Proxies like this definitely have an impact for metered/slow connections - I'm tempted to start Wireshark'ing my devices now to look for suspicious activity.
There should be a public repository of apps known to have these shady behaviours. Having done some light web scraping for archival/automation before, it's a pity that it'll become collateral damage in the anti-AI-botfarm fight.
akoboldfrying
I agree, but the harm done to the users is only one part of the total harm. I think it's quite plausible that many users wouldn't mind some small amount of their bandwidth being used, if it meant being able to use a handy browser extension that they would otherwise have to pay actual dollars for -- but the harm done to those running the servers remains.
zzo38computer
I agree, this should be called spyware, and malware. There are many other kind of software that also should, but netcat and ncat (probably) aren't malware.
null
karmanGO
Has anyone tried to compile a list of software that uses these libraries? It would be great to know what apps to avoid
mzajc
In the case of Android, εxodus has one[1], though I couldn't find the malware library listed in TFA. Aurora Store[2], a FOSS Google Play Store client, also integrates it.
[1] https://reports.exodus-privacy.eu.org/en/trackers/ [2] https://f-droid.org/packages/com.aurora.store/
takluyver
That seems to be looking at tracking and data collection libraries, though, for things like advertising and crash reporting. I don't see any mention of the kind of 'network sharing' libraries that this article is about. Have I missed it?
arewethereyeta
No but here's the thing. Being in the industry for many years I know they are required to mention it in the TOS when using the SDKs. A crawler pulling app TOSs and parsing them could be a thing. List or not, it won't be too useful outside this tech community.
lelanthran
> Has anyone tried to compile a list of software that uses these libraries? It would be great to know what apps to avoid
I wouldn't mind reading a comprehensive report on SOTA with regard to bot-blocking.
Sure, there's Anubis (although someone elsethread called it a half-measure, and I'd like to know why), there's captcha's, there's relying on a monopoly (cloudflare, etc) who probably also wants to run their own bots at some point, but what else is there?
il-b
A good portion of free VPN apps sell their traffic. This was the thing even before the AI bot explosion.
null
api
This is nasty in other ways too. What happens when someone uses these B2P residential proxies to commit crimes that get traced back to you?
Anything incorporating anything like this is malware.
reconnecting
Many years ago cybercriminals used to hack computers to use them as residential proxies, now they purchase them online as a service.
In most cases they are used for conducting real financial crimes, but the police investigators are also aware that there is a very low chance that sophisticated fraud is committed directly from a residential IP address.
kastden
Are there any lists with known c&c servers for these services that can be added to Pihole/etc?
udev4096
You can use one of the list from here: https://github.com/hagezi/dns-blocklists
__MatrixMan__
The broken thing about the web is that in order for data to remain readable, a unique sysadmin somewhere has to keep a server running in the face of an increasingly hostile environment.
If instead we had a content addressed model, we could drop the uniqueness constraint. Then these AI scrapers could be gossiping the data to one another (and incidentally serving it to the rest of us) without placing any burden on the original source.
Having other parties interested in your data should make your life easier (because other parties will host it for you), not harder (because now you need to work extra hard to host it for them).
akoboldfrying
Assuming the right incentives can be found to prevent widespread leeching, a distributed content-addressed model indeed solves this problem, but introduces the problem of how to control your own content over time. How do you get rid of a piece of content? How do you modify the content at a given URL?
I know, as far as possible it's a good idea to have content-immutable URLs. But at some point, I need to make www.myexamplebusiness.com show new content. How would that work?
__MatrixMan__
As for how to get rid of a piece of content... I think that one's a lost cause. If the goal is to prevent things that make content unavailable (e.g. AI scrapers) then you end up with a design that prevents things that makes content unavailable (e.g. legitimate deletions). The whole point is that you're not the only one participating in propagating the content, and that comes with trade-offs.
But as for updating, you just format your URLs like so: {my-public-key}/foo/bar
And then you alter the protocol so that the {my-public-key} part resolves to the merkle-root of whatever you most recently published. So people who are interested in your latest content end up with a whole new set of hashes whenever you make an update. In this way, it's not 100% immutable, but the mutable payload stays small (it's just a bunch of hashes) and since it can be verified (presumably there's a signature somewhere) it can be gossiped around and remain available even if your device is not.
You can soft-delete something just by updating whatever pointed to it to not point to it anymore. Eventually most nodes will forget it. But you can't really prevent a node from hanging on to an old copy if they want to. But then again, could you ever do that? Deleting something on on the web has always been a bit of a fiction.
akoboldfrying
> But then again, could you ever do that?
True in the absolute sense, but the effect size is much worse under the kind of content-addressable model you're proposing. Currently, if I download something from you and you later delete that thing, I can still keep my downloaded copy; under your model, if anyone ever downloads that thing from you and you later delete that thing, with high probability I can still acquire it at any later point.
As you say, this is by design, and there are cases where this design makes sense. I think it mostly doesn't for what we currently use the web for.
XorNot
Except no one wants content addressed data - because if you knew what it was you wanted, then you would already have stored it. The web as we know it is an index - it's a way to discover that data is available and specifically we usually want the latest data that's available.
AI scrapers aren't trying to find things they already know exist, they're trying to discover what they didn't know existed.
__MatrixMan__
Yes, for the reasons you describe, you can't be both a useful web-like protocol and also 100% immutable/hash-linked.
But there's a lot middle ground to explore here. Loading a modern web page involves making dozens of requests to a variety of different servers, evaluating some javascript, and then doing it again a few times, potentially moving several Mb of data. The part people want, the thing you don't already know exist, it's hidden behind that rather heavy door. It doesn't have to be that way.
If you already know about one thing (by its cryptographic hash, say) and you want to find out which other hashes it's now associated with--associations that might not have existed yesterday--that's much easier than we've made it. It can be done:
- by moving kB not Mb, we're just talking about a tuple of hashes here, maybe a public key and a signature
- without placing additional burden on whoever authored the first thing, they don't even have to be the ones who published the pair of hashes that your scraper is interested in
Once you have the second hash, you can then reenter immutable-space to get whatever it references. I'm not sure if there's already a protocol for such things, but if not then we can surely make one that's more efficient and durable than what we're doing now.
XorNot
But we already have HEAD requests and etags.
It is entirely possible to serve a fully cached response that says "you already have this". The problem is...people don't implement this well.
akoboldfrying
> because if you knew what it was you wanted, then you would already have stored it.
"Content-addressable" has a broader meaning than what you seem to be thinking of -- roughly speaking, it applies if any function of the data is used as the "address". E.g., git commits are content-addressable by their SHA1 hashes.
__MatrixMan__
But when you do a "git pull" you're not pulling from someplace identified by a hash, but rather a hostname. The learning-about-new-hashes part has to be handled differently.
It's a legit limitation on what content addressing can do, but it's one we can overcome by just not having everything be content addressed. The web we have now is like if you did a `git pull` every time you opened a file.
The web I'm proposing is like how we actually use git--periodically pulling new hashes as a separate action, but spending most of our time browsing content that we already have hashes for.
Timwi
Are there any systems like that, even if experimental?
jevogel
IPFS
alakra
I had high hopes for IPFS, but even it has vectors for abuse.
See https://arxiv.org/abs/1905.11880 [Hydras and IPFS: A Decentralised Playground for Malware]
areyourllySorry
there is no incentive for different companies to share data with each other, or with anyone really (facebook leeching books?)
null
__MatrixMan__
I figure we'd create that incentive by configuring our devices to only talk to devices controlled by people we trust. If they want the data at all, they have to gain our trust, and if they want that, they have to seed the data. Or you know, whatever else the agreement ends up being. Maybe we make them pay us.
null
reconnecting
Residential IP proxies have some weaknesses. One is that they ofter change IP addresses during a single web session. Second, if IP come from the same proxies provider, they are often concentrated within a sing ASN, making them easier to detect.
We are working on an open‑source fraud prevention platform [1], and detecting fake users coming from residential proxies is one of its use cases.
andelink
The first blog post in this series[1], linked to at the top of TFA, offers an analysis on the potential of using ASNs to detect such traffic. Their conclusion was that ASNs are not helpful for this use-case, showing that across the 50k IPs they've blocked, there is less than 4 IP addresses per ASN, on average.
[1] https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
reconnecting
What was done manually in the first blog is exactly what tirreno helps to achieve by analyzing traffic, here is live example [1]. Blocking an entire ASN should not be considered a strategy when real users are involved.
Regarding the first post, it's rare to see both datacenter network IPs and mobile proxy IP addresses used simultaneously. This suggests the involvement of more than one botnet. The main idea is to avoid using IP addresses as the sole risk factor. Instead, they should be considered as just one part of the broader picture of user behavior.
gruez
>One is that they ofter change IP addresses during a single web session. Second, if IP come from the same proxies provider, they are often concentrated within a sing ASN, making them easier to detect.
Both are pretty easy to mitigate with a geoip database and some smart routing. One "residential proxy" vendor even has session tokens so your source IP doesn't randomly jump between each request.
reconnecting
And this is the exact reason why IP addresses cannot be considered as the one and only signal for fraud prevention.
gbcfghhjj
At least here in the US most residential ISPs have long leases and change infrequently, weeks or months.
Trying to understand your product, where is it intended to sit in a network? Is it a standalone tool that you use to identify these IPs and feed into something else for blockage or is it intended to be integrated into your existing site or is it supposed to proxy all your web traffic? The reason I ask is it has fairly heavyweight install requirements and Apache and PHP are kind of old school at this point, especially for new projects and companies. It's not what they would commonly be using for their site.
reconnecting
Indeed, if it's a real user from a residential IP address, in most cases it will be the same network. However, if it's a proxy from residential IPs, there could be 10 requests from one network, the 11th request from a second network, and the 12th request back from the same network. This is a red flag.
Thank you for your question. tirreno is a standalone app that needs to receive API events from your main web application. It can work perfectly with 512GB Postgres RAM or even lower, however, in most cases we're talking about millions of events that request resources.
It's much easier to write a stable application without dependencies based on mature technologies. tirreno is fairly 'boring software'.
sroussey
My phone will be on the home network until I walk out of the house and then it will change networks. This should not be a red flag.
In the last week I've had to deal with two large-scale influxes of traffic on one particular web server in our organization.
The first involved requests from 300,000 unique IPs in a span of a few hours. I analyzed them and found that ~250,000 were from Brazil. I'm used to using ASNs to block network ranges sending this kind of traffic, but in this case they were spread thinly over 6,000+ ASNs! I ended up blocking all of Brazil (sorry).
A few days later this same web server was on fire again. I performed the same analysis on IPs and found a similar number of unique addresses, but spread across Turkey, Russia, Argentina, Algeria and many more countries. What is going on?! Eventually I think I found a pattern to identify the requests, in that they were using ancient Chrome user agents. Chrome 40, 50, 60 and up to 90, all released 5 to 15 years ago. Then, just before I could implement a block based on these user agents, the traffic stopped.
In both cases the traffic from datacenter networks was limited because I already rate limit a few dozen of the larger ones.
Sysadmin life...