Download responsibly
60 comments
·September 22, 2025teekert
marklit
Amazon, Esri, Grab, Hyundai, Meta, Microsoft, Precisely, Tripadvisor and TomTom, along with 10s of other businesses got together and offer OSM data in Parquet on S3 free of charge. You can query it surgically and run analytics on it needing only MBs of bandwidth on what is a multi-TB dataset at this point. https://tech.marksblogg.com/overture-dec-2024-update.html
If you're using ArcGIS Pro, use this plugin: https://tech.marksblogg.com/overture-maps-esri-arcgis-pro.ht...
maeln
I can imagine a few things :
1. BitTorrent has a bad rep. Most people still associate it with just illegal download.
2. It requires slightly more complex firewall rules, and asking the network admin to put them in place might raise some eyebrow for reason 1. On very restrictive network, they might not want to allow them at all due to the fact that it opens the door for, well, BitTorrent.
3. A BitTorrent client is more complicated than an HTTP client, and not installed on most company computer / ci pipeline (for lack of need, and again reason 1.). A lot of people just want to `curl` and be done with it.
4. A lot of people think they are required to seed, and for some reason that scare the hell of them.
Overall, I think it is mostly 1 and the fact that you can just simply `curl` stuff and have everything working. I do sadden me that people do not understand how good of a file transfer protocol BT is and how it is underused. I do remember some video game client using BT for updates under the hood, and peertube use webtorrent, but BT is sadly not very popular.
simonmales
At least the planet download offers BitTorrent. https://planet.openstreetmap.org/
Fokamul
Lol, bad rep? Interesting, in my country everybody is using it to download movies :D Even more so now, after this botched streaming war. (EU)
maeln
Which is exactly why it has a bad rep. In most people mind BitTorrent = illegal download.
_def
> A lot of people think they are required to seed, and for some reason that scare the hell of them.
Some of the reasons consists of lawyers sending put costly cease and desist letters even to "legitimate" users
dotwaffle
From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available (potentially causing congestion of transit ports affecting everyone) and no reliable way of forecasting where the traffic will come from making capacity planning a nightmare.
Additionally, as anyone who has tried to share an internet connection with someone heavily torrenting, the excessive number of connections means overall quality of non-torrent traffic on networks goes down.
Not to mention, of course, that BitTorrent has a significant stigma attached to it.
The answer would have been a squid cache box before, but https makes that very difficult as you would have to install mitm certs on all devices.
For container images, yes you have pull through registries etc, but not only are these non-trivial to setup (as a service and for each client) the cloud providers charge quite a lot for storage making it difficult to justify when not having a check "works just fine".
The Linux distros (and CPAN and texlive etc) have had mirror networks for years that partially addresses these problems, and there was an OpenCaching project running that could have helped, but it is not really sustainable for the wide variety of content that would be cached outside of video media or packages that only appear on caches hours after publishing.
BitTorrent might seem seductive, but it just moves the problem, it doesn't solve it.
rlpb
> From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available...
As a consumer, I pay the same for my data transfer regardless of the location of the endpoint though, and ISPs arrange peering accordingly. If this topology is common then I expect ISPs to adjust their arrangements to cater for it, just the same as any other topology.
zaphodias
I remember seeing the concept of "torrents with dynamic content" a few years ago, but apparently never became a thing[1]. I kind of wish it did, but I don't know if there are critical problems (i.e. security?).
nativeit
I assume it’s simply the lack of the inbuilt “universal client” that http enjoys, or that devs tend to have with ssh/scp. Not that such a client (even an automated/scripted CLI client) would be so difficult to setup, but then trackers are also necessary, and then the tooling for maintaining it all. Intuitively, none of this sounds impossible, or even necessarily that difficult apart from a few tricky spots.
I think it’s more a matter of how large the demand is for frequent downloads of very large files/sets, which leads to a questions of reliability and seeding volume, all versus the effort involved to develop the tooling and integrate it with various RCS and file syncing services.
Would something like Git LFS help here? I’m at the limit of my understanding for this.
nativeit
I certainly take advantage of BitTorrent mirrors for downloading Debian ISOs, as they are generally MUCH faster.
nopurpose
All Linux ISOs collectors in the world wholeheartedly agree.
mschuster91
Trackers haven't been necessary for well over a decade now thanks to DHT.
trenchpilgrim
> Like container registries?
https://github.com/uber/kraken exists, using a modified BT protocol, but unless you are distributing quite large images to a very large number of nodes, a centralized registry is probably faster, simpler and cheaper
null
vaylian
> Like container registries? Package repos, etc.
I had the same thoughts for some time now. It would be really nice to distribute software and containers this way. A lot of people have the same data locally and we could just share it.
alluro2
People like Geofabrik are why we can (sometimes) have nice things, and I'm very thankful for them.
Level of irresponsibility/cluelessness you can see from developers if you're hosting any kind of an API is astonishing, so downloads are not surprising at all...If someone, a couple of years back, told me things that I've now seen, I'd absolutely dismiss them as making stuff up and grossly exaggerating...
However, on the same token, it's sometimes really surprising how API developers rarely ever think in terms of multiples of things - it's very often just endpoints to do actions on single entities, even if nature of use-case is almost never on that level - so you have no other way than to send 700 requests to do "one action".
Gigachad
Sounds like someone people are downloading it in their CI pipelines. Probably unknowingly. This is why most services stopped allowing automated downloads for unauthenticated users.
Make people sign up if they want a url they can `curl` and then either block or charge users who download too much.
userbinator
I'd consider CI one of the worst massive wastes of computing resources invented, although I don't see how map data would be subject to the same sort of abusive downloading as libraries or other code.
stevage
Let's say you're working on an app that incorporates some Italian place names or roads or something. It's easy to imagine how when you build the app, you want to download the Italian region data from geofabrik then process it to extract what you want into your app. You script it, you put the script in your CI...and here we are:
> Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!
Gigachad
This stuff tends to happen by accident. Some org has an app that automatically downloads the dataset if it's missing, helpful for local development. Then it gets loaded in to CI, and no one notices that it's downloading that dataset every single CI run.
mschuster91
CI itself doesn't have to be a waste. The problem is most people DGAF about caching.
marklit
I suspect web apps that "query" the GPKG files. Parquet can be queried surgically, I'm not sure if there is a way to do the same with GPKG.
aitchnyu
Can we identify requests from CI servers reliably?
IshKebab
You can identify requests from Github's free CI reliably which probably covers 99% of requests.
For example GMP blocked GitHub:
https://www.theregister.com/2023/06/28/microsofts_github_gmp...
This "emergency measure" is still in place, but there are mirrors available so it doesn't actually matter too much.
ncruces
I try to stick to GitHub for GitHub CI downloads.
E.g. my SQLite project downloads code from the GitHub mirror rather than Fossil.
Gigachad
Sure, have a js script involved in generating a temporary download url.
That way someone manually downloading the file is not impacted, but if you try to put the url in a script it won’t work.
stevage
>Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!
Whenever I have done something like that, it's usually because I'm writing a script that goes something like:
1. Download file 2. Unzip file 3. Process file
I'm working on step 3, but I keep running the whole script because I haven't yet built a way to just do step 3.
I've never done anything quite that egregious though. And these days I tend to be better at avoiding this situation, though I still commit smaller versions of this crime.
xmprt
My solution to this is to only download if the file doesn't exist. An additional bonus is that the script now runs much faster because it doesn't need to do any expensive networking/downloads.
stanac
10,000 times a day is on average 8 times a second. No way someone has 8 fixes per second, this is more like someone wanted to download a new copy every day, or every hour but they messed up milliseconds config or something. Or it's simply malicious user.
edit: bad math, it's 1 download every 8 seconds
gblargg
When I do scripts like that I modify it to skip the download step and keep the old file around so I can test the rest without anything time-consuming.
Meneth
Some years ago I thought, no one would be stupid enough to download 100+ megabytes in their build script (which runs on CI whenever you push a commit).
Then I learned about Docker.
cjs_ac
I have a funny feeling that the sort of people who do these things don't read these sorts of blog posts.
aitchnyu
Do they email heavy users? We used Nominatim free api for geocoding addresses in 2012 and our email was required parameter. They mailed us and asked us to cache results to reduce request rates.
cadamsdotcom
Definitely a use case for bittorrent.
john_minsk
If the data changes, how would a torrent client pick it up and download changes?
hambro
Let the client curl latest.torrent from some central service and then download the big file through bittorrent.
maeln
A lot of torrent client support various API to automatically collect torrent file. The most common is to simply use RSS.
extraduder_ire
There's a BEP for updatable torrents.
Klinky
Pretty sure people used or even still use RSS for this.
globular-toast
Ah, responsibility... The one thing we hate teaching and hate learning even more. Someone is probably downloading files in some automated pipeline. Nobody taught them that with great power (being able to write programs and run them on the internet) comes great responsibility. It's similar to how people drive while intoxicated or on the phone etc. It's all fun until you realise you have a responsibility.
trklausss
I mean, at this point I wouldn't mind if they rate-limit downloads. A _single_ customer downloading the same file 10.000 times? Sorry, we need to provide for everyone, try again at some other point.
It is free, yes, but there is no need to either abuse it or give as much resource for free as they can.
k_bx
This. Maybe they could actually make some infra money out of this. Make token-based free tier download, pay if you break it.
rossant
Can't the server detect and prevent repeated downloads from the same IP, forcing users to act accordingly?
jbstack
See: "Also, when we block an IP range for abuse, innocent third parties can be affected."
Although they refer to IP ranges, the same principle applies on a smaller scale to a single IP address: (1) dynamic IP addresses get reallocated, and (2) entire buildings (universities, libraries, hotels, etc.) might share a single IP address.
Aside from accidentally affecting innocent users, you also open up the possibility of a DOS attack: the attacker just has to abuse the service from an IP address that he wants to deny access to.
imiric
More sophisticated client identification can be used to avoid that edge case, e.g. TLS fingerprints. They can be spoofed as well, but if the client is going through that much trouble, then they should be treated as hostile. In reality it's more likely that someone is doing this without realizing the impact they're having.
imiric
It could be slightly more sophisticated than that. Instead of outright blocking an entire IP range, set quotas for individual clients and throttle downloads exponentially. Add latency, cap the bandwidth, etc. Whoever is downloading 10,000 copies of the same file in 24 hours will notice when their 10th attempt slows down to a crawl.
holowoodman
Just wait until some AI dudes decide it is time to train on maps...
nativeit
I’m looking forward to visiting all of the fictional places it comes up with!
jbstack
AI models are trained relatively rarely, so it's unlikely this would be very noticeable among all the regular traffic. Just the occasional download-everything every few months.
holowoodman
One would think so. If AI bros were sensible, responsible and intelligent.
However, the pratical evidence is to the contrary, AI companies are hammering every webserver out there, ignoring any kind of convention like robots.txt, re-downloading everything in pointlessly short intervals. Annoying everyone and killing services.
Just a few recent examples from HN: https://news.ycombinator.com/item?id=45260793 https://news.ycombinator.com/item?id=45226206 https://news.ycombinator.com/item?id=45150919 https://news.ycombinator.com/item?id=42549624 https://news.ycombinator.com/item?id=43476337 https://news.ycombinator.com/item?id=35701565
Waraqa
IMHO in the long term this will lead to a closed web where you are required to log-in to view any content.
M95D
Map slop? That's new!
Whenever I read about such issues I always wonder why we all don’t make more use of BitTorrent. Why is it not the underlying protocol for much more stuff? Like container registries? Package repos, etc.