Ban me at the IP level if you don't like me

257 comments

·August 25, 2025

8organicbits

I've been working on a web crawler and have been trying to make it as friendly as possible. Strictly checking robots.txt, crawling slowly, clear identification in the User Agent string, single IP source address. But I've noticed some anti-bot tricks getting applied to the robot.txt file itself. The latest was a slow loris approach where it takes forever for robots.txt to download. I accidentally treated this as a 404, which then meant I continued to crawl that site. I had to change the code so a robots.txt timeout is treated like a Disallow /.

It feels odd because I find I'm writing code to detect anti-bot tools even though I'm trying my best to follow conventions.

navane

That's like deterring burglars by hiding your doorbell

lwansbrough

We solved a lot of our problems by blocking all Chinese ASNs. Admittedly, not the friendliest solution, but there were so many issues originating from Chinese clients that it was easier to just ban the entire country.

It's not like we can capitalize on commerce in China anyway, so I think it's a fairly pragmatic approach.

sugarpimpdorsey

There's some weird ones you'd never think of that originate an inordinate amount of bad traffic. Like Seychelles. A tiny little island nation in the middle of the ocean inhabited by... bots apparently? Cyprus is another one.

Re: China, their cloud services seem to stretch to Singapore and beyond. I had to blacklist all of Alibaba Cloud and Tencent and the ASNs stretched well beyond PRC borders.

grandinj

There is a Chinese player that has taken effective control of various internet-related entities in the Seychelles. Various ongoing court-cases currently.

So the seychelles traffic is likely really disguised chinese traffic.

supriyo-biswas

I don't think these are "Chinese players" and is linked to [1], although it may be that the hands changed many times that the IP addresses have been leased or bought by Chinese entities.

[1] https://mybroadband.co.za/news/internet/350973-man-connected...

galaxy_gas

this all from Cloud Innovation vpns,proxies,spam,bots CN Seychelles IP holder

sylware

I forgot about that: all the nice game binaries from them running directly on nearly all systems...

sylware

omg... that's why my self-hosted servers are getting nasty trafic from SC all the time.

The explanation is that easy??

lwansbrought

> So the seychelles traffic is likely really disguised chinese traffic.

Soon: chineseplayer.io

seanhunter

The Seychelles has a sweetheart tax deal with India such that a lot of corporations who have an India part and a non-India part will set up a Seychelles corp to funnel cash between the two entities. Through the magic of "Transfer Pricing"[1] they use this to reduce the amount of tax they need to pay.

It wouldn't surprise me if this is related somehow. Like maybe these are Indian corporations using a Seychelles offshore entity to do their scanning because then they can offset the costs against their tax or something. It may be that Cyprus has similar reasons. Istr that Cyprus was revealed to be important in providing a storefront to Russia and Putin-related companies and oligarchs.[2]

So Seychelles may be India-related bots and Cyprus Russia-related bots.

[1] https://taxjustice.net/faq/what-is-transfer-pricing/#:~:text...

[2] Yup. My memory originated in the "Panama Papers" leaks https://www.icij.org/investigations/cyprus-confidential/cypr...

sim7c00

its not weird .its companies putting themselves in places where regulations favor their business models.

it wont be all chinese companies or ppl doing the scraping. its well known that a lot of countries dont mind such traffic as long as it doesnt target themselves or for the west also some allies.

laws arent the same everywhere and so companies can get away with behavior in one place which seem almost criminal in another.

and what better place to put your scrapers than somewhere where there is no copyright.

russia also had same but since 2012 or so they changed laws and a lot of traffic reduced. companies moved to small islands or small nation states (favoring them with their tax payouts, they dont mind if j bring money for them) or few remaining places like china who dont care for copyrights.

its pretty hard to get really rid of such traffic. you can block stuff but mostly it will just change the response your server gives. flood still knockin at the door.

id hope someday maybe ISPs or so get more creative but maybe they dont have enough access and its hard to do this stuff without the right access into the traffic (creepy kind) or running into accidentally censoring the whole thing.

johnisgood

Yeah I am banning whole Singapore and China, for one.

I am getting down-voted for saying I ban whole Singapore and China? Oh lord... OK. Please all the down-voters list your public facing websites. I do not care if people from China cannot access my website. They are not the target audience, and they are free to use VPNs if they so wish, or Tor, or whatever works for them, I have not banned them yet, for my OWN PERSONAL SHITTY WEBSITE, inb4 you want to moderate the fuck out of what I can and cannot do on my own server(s). Frankly, fuck off, or be a hero and die a martyr. :D

nkrisc

People get weird when you do what you want with your own things.

Want to block an entire country from your site? Sure, it’s your site. Is it fair? Doesn’t matter.

lmz

> be a hero and die a martyr

I believe it's "an hero".

pabs3

Which site is it?

sylware

ucloud ("based in HK") has been an issue (much less lately though), and I had to ban the whole digital ocean AS (US). google cloud, aws and microsoft have also some issues...

hostpapa in the US seems to become the new main issue (via what seems a 'ip colocation service'... yes, you read well).

adzicg

We solved a similar issue by blocking free user traffic from data centres (and whitelisted crawlers for SEO). This eliminated most fraudulent usage over VPNs. Commercial users can still access, but free just users get a prompt to pay.

CloudFront is fairly good at marking if someone is accessing from a data centre or a residential/commercial endpoint. It's not 100% accurate and really bad actors can still use infected residential machines to proxy traffic, but this fix was simple and reduced the problem to a negligent level.

lxgr

Why stop there? Just block all non-US IPs!

If it works for my health insurance company, essentially all streaming services (including not even being able to cancel service from abroad), and many banks, it’ll work for you as well.

Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?

raffraffraff

And across the water, my wife has banned US IP addresses from her online shop once or twice. She runs a small business making products that don't travel well, and would cost a lot to ship to the US. It's a huge country with many people. Answering pointless queries, saying "No, I can't do that" in 50 different ways and eventually dealing with negative reviews from people you've never sold to and possibly even never talked to... Much easier to mass block. I call it network segmentation. She's also blocked all of Asia, Africa, Australia and half of Europe.

The blocks don't stay in place forever, just a few months.

silisili

Google Shopping might be to blame here, and I don't at all blame the response.

I say that because I can't count how many times Google has taken me to a foreign site that either doesn't even ship to the US, or doesn't say one way or another and treat me like a crazy person for asking.

lxgr

As long as your customer base never travels and needs support, sure, I guess.

The only way of communicating with such companies are chargebacks through my bank (which always at least has a phone number reachable from abroad), so I’d make sure to account for these.

silisili

I'm not precisely sure the point you're trying to make.

In my experience running rather lowish traffic(thousands hits a day) sites, doing just that brought every single annoyance from thousands per day to zero.

Yes, people -can- easily get around it via various listed methods, but don't seem to actually do that unless you're a high value target.

lxgr

It definitely works, since you’re externalizing your annoyance to people you literally won’t ever hear from because you blanket banned them based. Most of them will just think your site is broken.

motorest

> Why stop there? Just block all non-US IPs!

This is a perfectly good solution to many problems, if you are absolutely certain there is no conceivable way your service will be used from some regions.

> Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?

Not a problem. Bad actors which are motivated enough to use VPNd or botnets are a different class of attacks that have different types of solutions. If you eliminate 95% of your problems with a single IP filter them you have no good argument to make against it.

calgoo

This. If someone wants to target you, they will target you. What this does is remove the noise and 90%+ of crap.

Basically the same thing as changing the ssh port on a public facing server, reduce the automated crap attacks.

paulcole

> if you are absolutely certain there is no conceivable way your service will be used from some regions.

This isn’t the bar you need to clear.

It’s “if you’re comfortable with people in some regions not being able to use your service.”

runroader

Oddly, my bank has no problem with non-US IPs, but my City's municipal payments site doesn't. I always think it's broken for a moment before realizing I have my VPN turned on.

lwansbrough

Don't care, works fine for us.

yupyupyups

And that's perfectly fine. Nothing is completely bulletproof anyway. If you manage to get rid of 90% of the problem then that's a good thing.

mort96

The percentage of US trips abroad which are to China must be minuscule, and I bet nobody in the US regularly uses a VPN to get a Chinese IP address. So blocking Chinese IP addresses is probably going to have a small impact on US customers. Blocking all abroad IP addresses, on the other hand, would impact people who just travel abroad or use VPNs. Not sure what your point is or why you're comparing these two things.

mvdtnz

You think all streaming services have banned non US IPs? What world do you live in?

lxgr

This is based on personal experience. At least two did not let me unsubscribe from abroad in the past.

sylware

Won't help: I get scans and script kiddy hack attempts from digital ocean, microsoft cloud (azure, stretchoid.com), google cloud, aws, and lately "hostpapa" via its 'IP colocation service'. Ofc it is instant fail-to-ban (it is not that hard to perform a basic email delivery to an existing account...).

Traffic should be "privatize" as much as possible between IPv6 addresses (because you still have 'scanners' doing the whole internet all the time... "the nice guys scanning the whole internet for your protection... never to sell any scan data ofc).

Public IP services are done for: going to be hell whatever you do.

The right answer seems significantly big 'security and availability teams' with open and super simple internet standards. Yep the javascript internet has to go away and the app private protocols have too. No more whatng cartel web engine, or the worst: closed network protocols for "apps".

And the most important: hardcore protocol simplicity, but doing a good enough job. It is common sense, but the planned obsolescence and kludgy bloat lovers won't let you...

snickerbockers

Lmao I came here to post this. My personal server was making constant hdd grinding noises before I banned the entire nation of China. I only use this server for jellyfin and datahoarding so this was all just logs constantly rolling over from failed ssh auth attempts (PSA: always use public-key, don't allow root, and don't use really obvious usernames like "webadmin" or <literally just the domain>).

Xiol32

Changing the SSH port also helps cut down the noise, as part of a layered strategy.

dotancohen

Are you familiar with port knocking? My servers will only open port 22, or some other port, after two specific ports have been knocked on in order. It completely eliminates the log files getting clogged.

azthecx

Did you really notice a significant drop off in connection attempts? I tried this some years ago and after a few hours on a random very high port number I was already seeing connections.

gessha

I have my jellyfin and obsidian couchdb sync on my Tailscale and they don’t see any public traffic.

johnisgood

Most of the traffic comes from China and Singapore, so I banned both. I might have to re-check and ban other regions who would never even visit my stupid website anyway. The ones who want to are free to, through VPN. I have not banned them yet.

thrown-0825

Block Russia too, thats where i see most of my bot traffic coming from

debesyla

And usually hackers/malicious actors from that country are not afraid to attack anyone that is not russian, because their local law permits attacking targets in other countries.

(It sometimes comes to funny situations where malware doesn't enable itself on Windows machines if it detects that russian language keyboard is installed.)

imiric

Lately I've been thinking that the only viable long-term solution are allowlists instead of blocklists.

The internet has become a hostile place for any public server, and with the advent of ML tools, bots will make up far more than the current ~50% of all traffic. Captchas and bot detection is a losing strategy as bot behavior becomes more human-like.

Governments will inevitably enact privacy-infringing regulation to deal with this problem, but for sites that don't want to adopt such nonsense, allowlists are the only viable option.

I've been experimenting with a system where allowed users can create short-lived tokens via some out-of-band mechanism, which they can use on specific sites. A frontend gatekeeper then verifies the token, and if valid, opens up the required public ports specifically for the client's IP address, and redirects it to the service. The beauty of this system is that the service itself remains blocked at the network level from the world, and only allowed IP addresses are given access. The only publicly open port is the gatekeeper, which only accepts valid tokens, and can run from a separate machine or network. It also doesn't involve complex VPN or tunneling solutions, just a standard firewall.

This should work well for small personal sites, where initial connection latency isn't a concern, but obviously wouldn't scale well at larger scales without some rethinking. For my use case, it's good enough.

seer

I guess this is what "Identity aware proxy" from GCP can do for you? Outsource all of this to google - where you can connect your own identity servers, and then your service will only be accessed after the identity has been verified.

We have been using that instead of VPN and it has been incredibly nice and performant.

bob1029

I think a lot of really smart people are letting themselves get taken for a ride by the web scraping thing. Unless the bot activity is legitimately hammering your site and causing issues (not saying this isn't happening in some cases), then this mostly amounts to an ideological game of capture the flag. The difference being that you'll never find their flag. The only thing you win by playing is lost time.

The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.

phito

My friend has a small public gitea instance, only use by him a a few friends. He's getting thousounds of requests an hour from bots. I'm sorry but even if it does not impact his service, at the very least it feels like harassment

dmesg

Yes and it makes reading your logs needlessly harder. Sometimes I find an odd password being probed, search for it on the web and find an interesting story, that a new backdoor was discovered in a commercial appliance.

In that regard reading my logs led me sometimes to interesting articles about cyber security. Also log flooding may result in your journaling service truncating the log and you miss something important.

wvbdmp

You log passwords?

rollcat

> Sometimes I find an odd password being probed, search for it on the web and find an interesting story [...].

Yeah, this is beyond irresponsible. You know the moment you're pwned, __you__ become the new interesting story?

For everyone else, use a password manager to pick a random password for everything.

null

[deleted]

wraptile

> thousounds of requests an hour from bots

That's not much for any modern server so I genuinely don't understand the frustration. I'm pretty certain gitea should be able to handle thousands of read requests per minute (not per hour) without even breaking a sweat.

q3k

Serving file content/diff requests from gitea/forgejo is quite expensive computationally. And these bots tend to tarpit themselves when they come across eg. a Linux repo mirror.

https://social.hackerspace.pl/@q3k/114358881508370524

bob1029

Thousands of requests per hour? So, something like 1-3 per second?

If this is actually impacting perceived QoS then I think a gitea bug report would be justified. Clearly there's been some kind of a performance regression.

Just looking at the logs seems to be an infohazard for many people. I don't see why you'd want to inspect the septic tanks of the internet unless absolutely necessary.

themafia

The way I get a fast web product is to pay a premium for data. So, no, it's not "lost time" by banning these entities, it's actual saved costs on my bandwidth and compute bills.

The bonus is my actual customers get the same benefits and don't notice any material loss from my content _not_ being scraped. How you see this as me being secretly taken advantage of is completely beyond me.

threeducks

> The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product.

I wonder what all those people are doing that their server can't handle the traffic. Wouldn't a simple IP-based rate limit be sufficient? I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.

rollcat

> I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.

Depends on the computational cost per request. If you're serving static content from memory, 10k/s sounds easy. If you constantly have to calculate diffs across ranges of commits, I imagine a couple dozen can bring your box down.

Also: who's your webhost? $1/m sounds like a steal.

johnnyfaehell

While we may be smart, a lot of us are extremely pedantic about tech things. I think for many if they did nothing it would wind them up the wall while doing something the annoyance is smaller.

ricardo81

>an ideological game of capture the flag

I prefer the whack a mole analogy.

I've seen forums where people spend an inordinate amount of time identifying 'bad bots' for blocking, there'll always be more.

phplovesong

We block China and Russia. DDOS attacks and other hack attempts went down by 95%.

We have no chinese users/customers so in theory this does not effect business at all. Also russia is sanctioned and our russian userbase does not actually live in russia, so blocking russia did not effect users at all.

praptak

How did you choose where to get the IP addresses to block? I guess I'm mostly asking where this problem (i.e. "get all IPs for country X") is on the scale from "obviously solved" to "hard and you need to play catch up constantly".

I did a quick search and found a few databases but none of them looks like the obvious winner.

spc476

I used CYMRU <https://www.team-cymru.com/ip-asn-mapping> to do the mapping for the post.

preinheimer

MaxMind is very common, IPInfo is also good. https://ipinfo.io/developers/database-download

If you want to test your IP blocks, we have servers on both China and Russia, we can try to take a screenshot from there to see what we get (free, no signup) https://testlocal.ly/

tietjens

The common cloud platforms allow you to do geo-blocking.

bakugo

Maxmind's GeoIP database is the industry standard, I believe. You can download a free version of it.

If your site is behind cloudflare, blocking/challenging by country is a built-in feature.

zelphirkalt

Which will ensure you never get customers from those countries. And so the circle closes ...

42lux

The regulatory burden of conducting business with countries like Russia or China is a critical factor that offhand comments like yours consistently overlook.

mavamaarten

Same here. It sucks. But it's just cost vs reward at some point.

firefoxd

Since I posted an article here about using zip bombs [0], I'm flooded with bots. I'm constantly monitoring and tweaking my abuse detector, but this particular bot mentioned in the article seemed to be pointing to an RSS reader. I white listed it at first. But now that I gave it a second look, it's one of the most rampant bot on my blog.

[0]: https://news.ycombinator.com/item?id=43826798

dmurray

If I had a shady web crawling bot and I implemented a feature for it to avoid zip bombs, I would probably also test it by aggressively crawling a site that is known to protect itself with hand-made zip bombs.

sim7c00

also protect yourself fromnsucking up fake generated content. i know some folks here like to feed them all sorts of 'data' . fun stuff :D

JdeBP

One of the few manual deny-list entries that I have made was not for a Chinese company, but for the ASes of the U.S.A. subsidiary of a Chinese company. It just kept coming back again and again, quite rapidly, for a particular page that was 404. Not for any other related pages, mind. Not for the favicon, robots.txt, or even the enclosing pseudo-directory. Just that 1 page. Over and over.

The directory structure had changed, and the page is now 1 level lower in the tree, correctly hyperlinked long since, in various sitemaps long since, and long since discovered by genuine HTTP clients.

The URL? It now only exists in 1 place on the WWW according to Google. It was posted to Hacker News back in 2017.

(My educated guess is that I am suffering from the page-preloading fallout from repeated robotic scraping of old Hacker News stuff by said U.S.A. subsidiary.)

Moru

Rule number one: You do not talk about fight club.

popcorncowboy

Dark forest theory taking root.

Jnr

Externally I use Cloudflare proxy and internally I put Crowdsec and Modsecurity CRS middlewares in front of Traefik.

After some fine-tuning and eliminating false positives, it is running smoothly. It logs all the temporarily banned and reported IPs (to Crowdsec) and logging them to a Discord channel. On average it blocks a few dozen different IPs each day.

From what I see, there are far more American IPs trying to access non-public resources and attempting to exploit CVEs than there are Chinese ones.

I don't really mind anyone scraping publicly accessible content and the rest is either gated by SSO or located in intranet.

For me personally there is no need to block a specific country, I think that trying to block exploit or flooding attempts is a better approach.

poisonborz

Crowdsec: the idea is tempting, but giving away all of the server's traffic to a for-profit is a huge liability.

Jnr

You pass all traffic through Cloudflare. You do not pass any traffic to Crowdsec, you detect locally and only report blocked IPs. And with Modsecurity CRS you don't report anything to anyone but configuring and fine tuning is a bit harder.

jrgifford

The more egregious attempts are likely being blocked by Cloudflare WAF / similar.

Jnr

I don't think they are really blocking anything unless you specifically enable it. But it gives some piece of mind knowing that I could probably enable it quickly if it becomes necessary.

boris

Yes, I've seen this one in our logs. Quite obnoxious, but at least it identifies itself as a bot and, at least in our case (cgit host), does not generate much traffic. The bulk of our traffic comes from bots that pretend to be real browsers and that use a large number of IP addresses (mostly from Brazil and Asia in our case).

I've been playing cat and mouse trying to block them for the past week and here are a couple of observations/ideas, in case this is helpful to someone:

* As mentioned above, the bulk of the traffic comes from a large number of IPs, each issuing only a few requests a day, and they pretend to be real UAs.

* Most of them don't bother sending the referrer URL, but not all (some bots from Huawei Cloud do, but they currently don't generate much traffic).

* The first thing I tried was to throttle bandwidth for URLs that contain id= (which on a cgit instance generate the bulk of the bot traffic). So I set the bandwidth to 1Kb/s and thought surely most of the bots will not be willing to wait for 10-20s to download the page. Surprise: they didn't care. They just waited and kept coming back.

* BTW, they also used keep alive connections if ones were offered. So another thing I did was disable keep alive for the /cgit/ locations. Failed that enough bots would routinely hog up all the available connections.

* My current solution is to deny requests for all URLs containing id= unless they also contain the `notbot` parameter in the query string (and which I suggest legitimate users add in the custom error message for 403). I also currently only do this if the referrer is not present but I may have to change that if the bots adapt. Overall, this helped with the load and freed up connections to legitimate users, but the bots didn't go away. They still request, get 403, but keep coming back.

My conclusion from this experience is that you really only have two options: either do something ad hoc, very specific to your site (like the notbot in query string) that whoever runs the bots won't bother adapting to or you have to employ someone with enough resources (like Cloudflare) to fight them for you. Using some "standard" solution (like rate limit, Anubis, etc) is not going to work -- they have enough resources to eat up the cost and/or adapt.

palmfacehn

Pick an obscure UA substring like MSIE 3.0 or HP-UX. Preemptively 403 these User Agents, (you'll create your own list). Later in the week you can circle back and distill these 403s down to problematic ASNs. Whack moles as necessary.

JdeBP

I (of course) use the djbwares descendent of Bernstein publicfile. I added a static GEMINI UCSPI-SSL tool to it a while back. One of the ideas that I took from the GEMINI specification and then applied to Bernstein's HTTP server was the prohibition on fragments in request URLs (which the Bernstein original allowed), which I extended to a prohibition on query parameters as well (which the Bernstein original also allowed) in both GEMINI and HTTP.

* https://geminiprotocol.net/docs/protocol-specification.gmi#r...

The reasoning for disallowing them in GEMINI pretty much applies to static HTTP service (which is what publicfile provides) as it does to static GEMINI service. They moreover did not actually work in Bernstein publicfile unless a site administrator went to extraordinary lengths to create multiple oddly-named filenames (non-trivial to handle from a shell on a Unix or Linux-based system, because of the metacharacter) with every possible combination of query parameters, all naming the same file.

* https://jdebp.uk/Softwares/djbwares/guide/publicfile-securit...

* https://jdebp.uk/Softwares/djbwares/guide/commands/httpd.xml

* https://jdebp.uk/Softwares/djbwares/guide/commands/geminid.x...

Before I introduced this, attempted (and doomed to fail) exploits against weak CGI and PHP scripts were a large fraction of all of the file not found errors that httpd had been logging. These things were getting as far as hitting the filesystem and doing namei lookups. After I introduced this, they are rejected earlier in the transaction, without hitting the filesystem, when the requested URL is decomposed into its constituent parts.

Bernstein publicfile is rather late to this party, as there are over 2 decades of books on the subject of static sites versus dynamic sites (although in fairness it does pre-date all of them). But I can report that the wisdom when it comes to queries holds up even today, in 2025, and if anything a stronger position can be taken on them now.

To those running static sites, I recommend taking this good idea from GEMINI and applying it to query parameters as well.

Unless you are brave enough to actually attempt to provide query parameter support with static site tooling. (-:

Etheryte

One starts to wonder, at what point might it be actually feasible to do it the other way around, by whitelisting IP ranges. I could see this happening as a community effort, similar to adblocker list curation etc.

bobbiechen

Unfortunately, well-behaved bots often have more stable IPs, while bad actors are happy to use residential proxies. If you ban a residential proxy IP you're likely to impact real users while the bad actor simply switches. Personally I don't think IP level network information will ever be effective without combining with other factors.

Source: stopping attacks that involve thousands of IPs at my work.

BLKNSLVR

Blocking a residential proxy doesn't sound like a bad idea to me.

My single-layer thought process:

If they're knowingly running a residential proxy then they'll likely know "the cost of doing business". If they're unknowingly running a residential proxy then blocking them might be a good way for them to find out they're unknowingly running a residential proxy and get their systems deloused.

throwawayffffas

> If you ban a residential proxy IP you're likely to impact real users while the bad actor simply switches.

Are you really? How likely do you think is a legit customer/user to be on the same IP as a residential proxy? Sure residential IPS get reused, but you can handle that by making the block last 6-8 hours, or a day or two.

richardwhiuk

In these days of CGNAT, a residential IP is shared by multiple customers.

jampa

The Pokémon Go company tried that shortly after launch to block scraping. I remember they had three categories of IPs:

- Blacklisted IP (Google Cloud, AWS, etc), those were always blocked

- Untrusted IPs (residential IPs) were given some leeway, but quickly got to 429 if they started querying too much

- Whitelisted IPs (IPV4 addresses are used legitimately by many people), for example, my current data plan tells me my IP is from 5 states over, so anything behind a CGNAT.

You can probably guess what happens next. Most scrapers were thrown out, but the largest ones just got a modem device farm and ate the cost. They successfully prevented most users from scraping locally, but were quickly beaten by companies profiting from scraping.

I think this was one of many bad decisions Pokémon Go made. Some casual players dropped because they didn't want to play without a map, while the hardcore players started paying for scraping, which hammered their servers even more.

aorth

I have an ad hoc system that is similar, comprised of three lists of networks: known good, known bad, and data center networks. These are rate limited using a geo map in nginx for various expensive routes in my application.

The known good list is IPs and ranges I know are good. The known bad list is specific bad actors. The data center networks list is updated periodically based on a list of ASNs belonging to data centers.

There are a lot of problems with using ASNs, even for well-known data center operators. First, they update so often. Second, they often include massive subnets like /13(!), which can apparently overlap with routes announced by other networks, causing false positives. Third, I had been merging networks (to avoid overlaps causing problems in nginx) with something like https://github.com/projectdiscovery/mapcidr but found that it also caused larger overlaps that introduced false positives from adjacent networks where apparently some legitimate users are. Lastly, I had seen suspicious traffic from data center operators like CATO Networks Ltd and ZScaler that are some kind of enterprise security products that route clients through their clouds. Blocking those resulted in some angry users in places I didn't expect...

And none of the accounts for the residential ISPs that bots use to appear like legitimate users https://www.trendmicro.com/vinfo/us/security/news/vulnerabil....

gunalx

This really seems like they did everything they could and still got abused by borderline criminal activity from scrapers. But i do really think it had an impact on scraping, it is just a matter of attrition and raising the cost so it should hurt more to scrape, the problem really never can go away, because at some point the scrapers can just start paying regular users to collect the data.

lxgr

Many US companies do it already.

It should be illegal, at least for companies that still charge me while I’m abroad and don’t offer me any other way of canceling service or getting support.

withinboredom

I'm pretty sure I still owe t-mobile money. When I moved to the EU, we kept our old phone plans for awhile. Then, for whatever reason, the USD didn't make it to the USD account in time and we missed a payment. Then t-mobile cut off the service and you need to receive a text message to login to the account. Obviously, that wasn't possible. So, we lost the ability to even pay, even while using a VPN. We just decided to let it die, but I'm sure in t-mobile's eyes, I still owe them.

thenthenthen

This! Dealing with European services from China is also terrible. As is the other way around. Welcome to the intranet!

friendzis

It's never either/or: you don't have to choose between white and black lists exclusively and most of the traffic is going to come from grey areas anyway.

Say you whitelist an address/range and some systems detect "bad things". Now what? You remove that address/range from whitelist? Doo you distribute the removal to your peers? Do you communicate removal to the owner of unwhitelisted address/range? How does owner communicate dealing with the issue back? What if the owner of the range is hosting provider where they don't proactively control the content hosted, yet have robust anti-abuse mechanisms in place? And so on.

Whitelist-only is a huge can of worms and whitelists works best with trusted partner you can maintain out-of-band communication with. Similarly blacklists work best with trusted partners, however to determine addresses/ranges that are more trouble than they are worth. And somewhere in the middle are grey zone addresses, e.g. ranges assigned to ISPs with CGNATs: you just cannot reliably label an individual address or even a range of addresses as strictly troublesome or strictly trustworthy by default.

Implement blacklists on known bad actors, e.g. the whole of China and Russia, maybe even cloud providers. Implement whitelists for ranges you explicitly trust to have robust anti-abuse mechanisms, e.g. corporations with strictly internal hosts.

partyguy

That's what I'm trying to do here, PRs welcome: https://github.com/AnTheMaker/GoodBots

aorth

Noble effort. I might make some pull requests, though I kinda feel it's futile. I have my own list of "known good" networks.

delusional

At that point it almost sounds like we're doing "peering" agreements at the IP level.

Would it make sense to have a class of ISPs that didn't peer with these "bad" network participants?

JimDabell

If this didn’t happen for spam, it’s not going to happen for crawlers.

shortrounddev2

Why not just ban all IP blocks assigned to cloud providers? Won't halt botnets but the IP range owned by AWS, GCP, etc is well known

aorth

Tricky to get a list of all cloud providers, all their networks, and then there are cases like CATO Networks Ltd and ZScaler, which are apparently enterprise security products that route clients traffic through their clouds "for security".

jjayj

But my work's VPN is in AWS, and HN and Reddit are sometimes helpful...

Not sure what my point is here tbh. The internet sucks and I don't have a solution

hnlmorg

Because crawlers would then just use a different IP which isn’t owned by cloud vendors.

ygritte

Came here to say something similar. The sheer amount of IP addresses one has to block to keep malware and bots at bay is becoming unmanageable.

worthless-trash

I admin a few local business sites.. I whitelist all the countries isps and the strangeness in the logs and attack counts have gone down.

Google indexes in country, as does a few other search engines..

Would recommend.

coffee_am

Is there a public curated list of "good ips" to whitelist ?

partyguy

> Is there a public curated list of "good ips" to whitelist ?

https://github.com/AnTheMaker/GoodBots

worthless-trash

So, its relatively easy because there is limited ISP's in my country. I imagine its a much harder option for the US.

I looked at all the IP ranges delegated by APNIC, along with every local ISP that I could find, unioned this with

https://lite.ip2location.com/australia-ip-address-ranges

And so far i've not had any complaints. and I think that I have most of them.

At some time in the future, i'll start including https://github.com/ebrasha/cidr-ip-ranges-by-country

zImPatrick

I run an instance of a downloader tool and had lots of chinese IPs mass-download youtube videos with the most generic UA. I started with „just“ blocking their ASNs, but they always came back with another one until I just decided to stop bothering and banned China entirely. I‘m confused on why some chinese ISPs have so many different ASNs - while most major internet providers here have exactly one.

breve

> Alex Schroeder's Butlerian Jihad

That's Frank Herbert's Butlerian Jihad.

dirkc

Interesting to think that the answer to banning thinking computers in Dune was basically to indoctrinate kids from birth (mentats) and/or doing large quantities of drugs (guild navigators).

Hizonner

It's a simple translation error. They really meant "Feed me worthless synthetic shit at the highest rate you feel comfortable with. It's also OK to tarpit me."

teekert

These IP addresses being released at some point, and making their way into something else is probably the reason I never got to fully run my mailserver from my basement. These companies are just massively giving IP addresses a bad reputation, messing them up for any other use and then abandoning them. I wonder what this would look like when plotted: AI (and other toxic crawling) companies slowly consuming the IPv4 address space? Ideally we'd forced them into some corner of the IPv6 space I guess. I mean robots.txt seems not to be of any help here.