Download responsibly

60 comments

·September 22, 2025

teekert

Whenever I read about such issues I always wonder why we all don’t make more use of BitTorrent. Why is it not the underlying protocol for much more stuff? Like container registries? Package repos, etc.

marklit

Amazon, Esri, Grab, Hyundai, Meta, Microsoft, Precisely, Tripadvisor and TomTom, along with 10s of other businesses got together and offer OSM data in Parquet on S3 free of charge. You can query it surgically and run analytics on it needing only MBs of bandwidth on what is a multi-TB dataset at this point. https://tech.marksblogg.com/overture-dec-2024-update.html

If you're using ArcGIS Pro, use this plugin: https://tech.marksblogg.com/overture-maps-esri-arcgis-pro.ht...

maeln

I can imagine a few things :

1. BitTorrent has a bad rep. Most people still associate it with just illegal download.

2. It requires slightly more complex firewall rules, and asking the network admin to put them in place might raise some eyebrow for reason 1. On very restrictive network, they might not want to allow them at all due to the fact that it opens the door for, well, BitTorrent.

3. A BitTorrent client is more complicated than an HTTP client, and not installed on most company computer / ci pipeline (for lack of need, and again reason 1.). A lot of people just want to `curl` and be done with it.

4. A lot of people think they are required to seed, and for some reason that scare the hell of them.

Overall, I think it is mostly 1 and the fact that you can just simply `curl` stuff and have everything working. I do sadden me that people do not understand how good of a file transfer protocol BT is and how it is underused. I do remember some video game client using BT for updates under the hood, and peertube use webtorrent, but BT is sadly not very popular.

simonmales

At least the planet download offers BitTorrent. https://planet.openstreetmap.org/

Fokamul

Lol, bad rep? Interesting, in my country everybody is using it to download movies :D Even more so now, after this botched streaming war. (EU)

maeln

Which is exactly why it has a bad rep. In most people mind BitTorrent = illegal download.

_def

> A lot of people think they are required to seed, and for some reason that scare the hell of them.

Some of the reasons consists of lawyers sending put costly cease and desist letters even to "legitimate" users

dotwaffle

From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available (potentially causing congestion of transit ports affecting everyone) and no reliable way of forecasting where the traffic will come from making capacity planning a nightmare.

Additionally, as anyone who has tried to share an internet connection with someone heavily torrenting, the excessive number of connections means overall quality of non-torrent traffic on networks goes down.

Not to mention, of course, that BitTorrent has a significant stigma attached to it.

The answer would have been a squid cache box before, but https makes that very difficult as you would have to install mitm certs on all devices.

For container images, yes you have pull through registries etc, but not only are these non-trivial to setup (as a service and for each client) the cloud providers charge quite a lot for storage making it difficult to justify when not having a check "works just fine".

The Linux distros (and CPAN and texlive etc) have had mirror networks for years that partially addresses these problems, and there was an OpenCaching project running that could have helped, but it is not really sustainable for the wide variety of content that would be cached outside of video media or packages that only appear on caches hours after publishing.

BitTorrent might seem seductive, but it just moves the problem, it doesn't solve it.

rlpb

> From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available...

As a consumer, I pay the same for my data transfer regardless of the location of the endpoint though, and ISPs arrange peering accordingly. If this topology is common then I expect ISPs to adjust their arrangements to cater for it, just the same as any other topology.

zaphodias

I remember seeing the concept of "torrents with dynamic content" a few years ago, but apparently never became a thing[1]. I kind of wish it did, but I don't know if there are critical problems (i.e. security?).

[1]: https://www.bittorrent.org/beps/bep_0046.html

nativeit

I assume it’s simply the lack of the inbuilt “universal client” that http enjoys, or that devs tend to have with ssh/scp. Not that such a client (even an automated/scripted CLI client) would be so difficult to setup, but then trackers are also necessary, and then the tooling for maintaining it all. Intuitively, none of this sounds impossible, or even necessarily that difficult apart from a few tricky spots.

I think it’s more a matter of how large the demand is for frequent downloads of very large files/sets, which leads to a questions of reliability and seeding volume, all versus the effort involved to develop the tooling and integrate it with various RCS and file syncing services.

Would something like Git LFS help here? I’m at the limit of my understanding for this.

nativeit

I certainly take advantage of BitTorrent mirrors for downloading Debian ISOs, as they are generally MUCH faster.

nopurpose

All Linux ISOs collectors in the world wholeheartedly agree.

mschuster91

Trackers haven't been necessary for well over a decade now thanks to DHT.

trenchpilgrim

> Like container registries?

https://github.com/uber/kraken exists, using a modified BT protocol, but unless you are distributing quite large images to a very large number of nodes, a centralized registry is probably faster, simpler and cheaper

null

[deleted]

vaylian

> Like container registries? Package repos, etc.

I had the same thoughts for some time now. It would be really nice to distribute software and containers this way. A lot of people have the same data locally and we could just share it.

alluro2

People like Geofabrik are why we can (sometimes) have nice things, and I'm very thankful for them.

Level of irresponsibility/cluelessness you can see from developers if you're hosting any kind of an API is astonishing, so downloads are not surprising at all...If someone, a couple of years back, told me things that I've now seen, I'd absolutely dismiss them as making stuff up and grossly exaggerating...

However, on the same token, it's sometimes really surprising how API developers rarely ever think in terms of multiples of things - it's very often just endpoints to do actions on single entities, even if nature of use-case is almost never on that level - so you have no other way than to send 700 requests to do "one action".

Gigachad

Sounds like someone people are downloading it in their CI pipelines. Probably unknowingly. This is why most services stopped allowing automated downloads for unauthenticated users.

Make people sign up if they want a url they can `curl` and then either block or charge users who download too much.

userbinator

I'd consider CI one of the worst massive wastes of computing resources invented, although I don't see how map data would be subject to the same sort of abusive downloading as libraries or other code.

stevage

Let's say you're working on an app that incorporates some Italian place names or roads or something. It's easy to imagine how when you build the app, you want to download the Italian region data from geofabrik then process it to extract what you want into your app. You script it, you put the script in your CI...and here we are:

> Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!

Gigachad

This stuff tends to happen by accident. Some org has an app that automatically downloads the dataset if it's missing, helpful for local development. Then it gets loaded in to CI, and no one notices that it's downloading that dataset every single CI run.

mschuster91

CI itself doesn't have to be a waste. The problem is most people DGAF about caching.

marklit

I suspect web apps that "query" the GPKG files. Parquet can be queried surgically, I'm not sure if there is a way to do the same with GPKG.

aitchnyu

Can we identify requests from CI servers reliably?

IshKebab

You can identify requests from Github's free CI reliably which probably covers 99% of requests.

For example GMP blocked GitHub:

https://www.theregister.com/2023/06/28/microsofts_github_gmp...

This "emergency measure" is still in place, but there are mirrors available so it doesn't actually matter too much.

ncruces

I try to stick to GitHub for GitHub CI downloads.

E.g. my SQLite project downloads code from the GitHub mirror rather than Fossil.

Gigachad

Sure, have a js script involved in generating a temporary download url.

That way someone manually downloading the file is not impacted, but if you try to put the url in a script it won’t work.

stevage

>Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!

Whenever I have done something like that, it's usually because I'm writing a script that goes something like:

1. Download file 2. Unzip file 3. Process file

I'm working on step 3, but I keep running the whole script because I haven't yet built a way to just do step 3.

I've never done anything quite that egregious though. And these days I tend to be better at avoiding this situation, though I still commit smaller versions of this crime.

xmprt

My solution to this is to only download if the file doesn't exist. An additional bonus is that the script now runs much faster because it doesn't need to do any expensive networking/downloads.

stanac

10,000 times a day is on average 8 times a second. No way someone has 8 fixes per second, this is more like someone wanted to download a new copy every day, or every hour but they messed up milliseconds config or something. Or it's simply malicious user.

edit: bad math, it's 1 download every 8 seconds

gblargg

When I do scripts like that I modify it to skip the download step and keep the old file around so I can test the rest without anything time-consuming.

Meneth

Some years ago I thought, no one would be stupid enough to download 100+ megabytes in their build script (which runs on CI whenever you push a commit).

Then I learned about Docker.

cjs_ac

I have a funny feeling that the sort of people who do these things don't read these sorts of blog posts.

aitchnyu

Do they email heavy users? We used Nominatim free api for geocoding addresses in 2012 and our email was required parameter. They mailed us and asked us to cache results to reduce request rates.

cadamsdotcom

Definitely a use case for bittorrent.

john_minsk

If the data changes, how would a torrent client pick it up and download changes?

hambro

Let the client curl latest.torrent from some central service and then download the big file through bittorrent.

maeln

A lot of torrent client support various API to automatically collect torrent file. The most common is to simply use RSS.

extraduder_ire

There's a BEP for updatable torrents.

Klinky

Pretty sure people used or even still use RSS for this.

globular-toast

Ah, responsibility... The one thing we hate teaching and hate learning even more. Someone is probably downloading files in some automated pipeline. Nobody taught them that with great power (being able to write programs and run them on the internet) comes great responsibility. It's similar to how people drive while intoxicated or on the phone etc. It's all fun until you realise you have a responsibility.

trklausss

I mean, at this point I wouldn't mind if they rate-limit downloads. A _single_ customer downloading the same file 10.000 times? Sorry, we need to provide for everyone, try again at some other point.

It is free, yes, but there is no need to either abuse it or give as much resource for free as they can.

k_bx

This. Maybe they could actually make some infra money out of this. Make token-based free tier download, pay if you break it.

rossant

Can't the server detect and prevent repeated downloads from the same IP, forcing users to act accordingly?

jbstack

See: "Also, when we block an IP range for abuse, innocent third parties can be affected."

Although they refer to IP ranges, the same principle applies on a smaller scale to a single IP address: (1) dynamic IP addresses get reallocated, and (2) entire buildings (universities, libraries, hotels, etc.) might share a single IP address.

Aside from accidentally affecting innocent users, you also open up the possibility of a DOS attack: the attacker just has to abuse the service from an IP address that he wants to deny access to.

imiric

More sophisticated client identification can be used to avoid that edge case, e.g. TLS fingerprints. They can be spoofed as well, but if the client is going through that much trouble, then they should be treated as hostile. In reality it's more likely that someone is doing this without realizing the impact they're having.

imiric

It could be slightly more sophisticated than that. Instead of outright blocking an entire IP range, set quotas for individual clients and throttle downloads exponentially. Add latency, cap the bandwidth, etc. Whoever is downloading 10,000 copies of the same file in 24 hours will notice when their 10th attempt slows down to a crawl.

holowoodman

Just wait until some AI dudes decide it is time to train on maps...

nativeit

I’m looking forward to visiting all of the fictional places it comes up with!

jbstack

AI models are trained relatively rarely, so it's unlikely this would be very noticeable among all the regular traffic. Just the occasional download-everything every few months.

holowoodman

One would think so. If AI bros were sensible, responsible and intelligent.

However, the pratical evidence is to the contrary, AI companies are hammering every webserver out there, ignoring any kind of convention like robots.txt, re-downloading everything in pointlessly short intervals. Annoying everyone and killing services.

Just a few recent examples from HN: https://news.ycombinator.com/item?id=45260793 https://news.ycombinator.com/item?id=45226206 https://news.ycombinator.com/item?id=45150919 https://news.ycombinator.com/item?id=42549624 https://news.ycombinator.com/item?id=43476337 https://news.ycombinator.com/item?id=35701565

Waraqa

IMHO in the long term this will lead to a closed web where you are required to log-in to view any content.

M95D

Map slop? That's new!