Stop Using Zip Codes for Geospatial Analysis (2019)
43 comments
·February 7, 2025jonas21
ericrallen
This is a tangent, but addresses are also way more complicated than most people realize - especially if you’re relying on a user to input a correct address or if you need to support multiple countries, somewhere with unique addresses like Queens[0], or you need to differentiate between units of a specific street address that uses something other than unit numbers for a unit designation.
At that point you need something like Smarty[1] to validate and parse addresses.
[0]: https://stackoverflow.com/questions/2783155/how-to-distingui...
nitwit005
Yes, unfortunately, their assertion that everyone knows their zip code is wrong. People often write a neighboring code, and the post office just delivers it.
Similar issues for city name, of course.
ellisv
There are point process models, but, yes, its much more common to want to aggregate to a spatial area.
Another consideration is what kind of reference information is available at different spatial units. There are plenty of Census Bureau data available by ZCTA but some data may only be available at other aggregate units. Zip Codes are often used as political boundaries.
I'd also mention the "best" areal unit depends on the data. There is a well known phenomenon called the modifiable areal unit problem in which spatial effects appear and vanish at different spatial resolutions. It can sort of be thought of as a spatial variation of the ecological fallacy.
JumpCrisscross
Would add that there are network effects with zip code data. If you collect H2 data, you have fewer sources with which to join.
o11c
Also, "use a different grid" is only masking the problem, not actually fixing it.
The real problem is ever using an average without also specifying some sort of bounds. For median-based data, this probably means the upper and lower quartiles (or possibly other percentiles); for mean-based data, this probably means standard deviation.
mattforrest
Well you hit on all the points that discuss the compromises that zip codes offer. Just because you have them in your data doesn't mean that they can produce anything useful. You are correct that no one knows their census unit is (if you are thinking from someone entering this on a website) but collecting location or address will be a lot better.
Fact is a lot of web data contains a zip but if you can collect something better it will usually render better results. Unless you are analyzing shipments then that is fine.
walrus01
In terms of "good enough", a Canadian postal code, broadly equivalent to a zip code, is much more granular and can often identify an individual apartment building, or single city block. Plenty of large office buildings in major Canadian cities also have their own postal code.
The functionality of it is closer to the "Zip+4" with extension used to have a more granular routing of physical mail for USPS.
https://www.canadapost-postescanada.ca/cpc/en/support/articl...
jihadjihad
To put it in plain mathematical language, ZIP codes are not defined as polygons [0]. The consequence is that performing any analysis with an assumption that ZIP codes are polygons is bound to be error-prone.
0: https://manifold.net/doc/mfd8/zip_codes_are_not_areas.htm
mholt
Yeah. ZIP codes are sets in the abstract-dimensional space of carrier delivery points. I suppose you could think of them as lines, but definitely not polygons.
cogman10
Zip codes (in the US) are machine readable numbers a mail sorter can use to send a parcel to the right delivery truck for final delivery. In the US, they represent the hierarchy of postal centers with the most significant digit representing the primary hub for a region and the smallest number the actual post office that will be in charge of delivering the letter (or truck if you do the extended post code).
They don't represent geography at all, they represent the organizational structure of USPS.
They work by making the address on a letter almost meaningless. For some smaller population zip codes you can practically just put the name and zip code down and achieve delivery.
alsodumb
I agree that they weren't explicitly meant to represent geography, but implicitly they do, right? Are there cases where this is violated?
In other words, is it safe to assume that for entity in a zip code is less than x distance away from the closest entity in the same zip code?
Spivak
Right but this ends up being a good approximation for geography because the reality of logistics is that you end up doing a cute n-ary search of the geography. When you know the regional hub you can say for certain a huge chunk of the US the zip code doesn't represent. And then you keep n-secting. Sometimes the land-mass you get at the end is specific enough for your uses.
You're not going to wind up with a situation where zip codes with the same regional marker end up on different coasts.
mcphage
> The consequence is that performing any analysis with an assumption that ZIP codes are polygons is bound to be error-prone.
Yeah, but any analysis you're likely to perform is approximate enough that the fact that ZIP codes aren't polygons is basically a rounding error.
Plus, it's a lot easier to get ZIP codes, and they're more reliably correct, so you might still get better results, than you would going with another indicator that is either (a) less reliable or (b) less available.
jpjoi
Zip codes are just weird to use for anything other than mail in general because they’re set up based off infrastructure.
CGP Grey has a great video on this: https://m.youtube.com/watch?v=1K5oDtVAYzk
diggan
I've noticed more and more super/hypermarkets started asking for your zip/postal code sometime during self-checkout. I'm guessing they use these as approximations about where people travel from, so they can evaluate if to open more stores closer to popular areas, or something like that. Pretty sure there is more use cases for postal codes too.
Spivak
Wait until you find out that this is the same way phones used to work. The number was the row/colum for the operator needed to plug your line into.
eterevsky
ZIP codes are a simple approximation, which does their job good enough in most cases.
The alternatives that the author suggests are much more complicated, both in terms of the implementation and in terms of convincing the user to give you their full address.
throw0101c
CGP Grey recently posted a video on Zip codes, "The Hidden Pattern in Post Codes":
Cthulhu_
That's what I was thinking of earlier, the succinct version is "your address is where mail needs to go, the zip code is how to get it there". Or in other words, the zip code is the address(es) of the sorting centers and post offices to the destination.
hammock
Great article. Zip codes can be super expedient. But you have to be self aware that for many uses cases they function WORSE than a random grid. Because they have built-in aggregation of a central post office(and surrounding) with a certain radius of rural/less dense surrounding.
So for example, if you are sorting “rural zips” vs “urban zips” it will only take you so far, and may actually be harmful.
Same goes with MSAs/DMAs (media markets). These have to be used for buying media, but for geospatial analysis they are suboptimal for the same reasons.
Easiest way to dip your toe into the water of something better is to start with A-D census counties.
dhunter_mn
I used to work for a company that basically merged USPS and Census Bureau data on a monthly basis. The output would be a roadbase that was optimized for address ranges on road segments. ZIP Codes were extra fun to work with.
ej1
This is a test comment posted using Playwright!
serjester
H3 is awesome here! What I don't think many people realize is that H3 cells and normal geographic data (like zips) are not mutually exclusive. You can take zip outlines, and find all the h3 cells within them and allocate your metric accordingly (population, income, etc).
This makes joining disparate data sources quite easy. And this also lets you do all sorts of cool stuff like aggregations, smoothing, flow modeling, etc.
We do some geospatial stuff and I wrote a polars plugin to help with this a while back [1].
funkaster
If you want to learn a bit more, there was a recent, really good Planet Money episode[1] about this exact same topic. They focus on the problems that you might face when using zip code for demographic analysis.
[1]: https://www.npr.org/2025/01/08/1223466587/zip-code-history
mattforrest
Funny to see this one pop up today (I wrote this one way back when) but I just refreshed it into a video on my channel: https://www.youtube.com/watch?v=x-opv4REEic
PLenz
I gave a talk at DataEngConf many years ago: https://www.datacouncil.ai/talks/zip-codes-and-other-lies-yo...
ZIP codes are an emergent property of the mail delivery system. While the author might consider this a bad thing, this makes them "good enough" on multiple axes in practice. They tend to be:
- Well-known (everybody knows their zip code)
- Easily extracted (they're part of every address, no geocoding required)
- Uniform-enough (not perfect, but in most cases close)
- Granular-enough
- Contiguous-enough by travel time
Notably, the alternatives the author proposes all fail on one or more of these:
- Census units: almost nobody knows what census tract they live in, and it can be non-trivial to map from address to tract
- Spatial cells: uneven distribution of population, and arbitrary division of space (boundaries pass right through buildings), and definitely nobody knows what S2 or H3 cell they live in.
- Address: this option doesn't even make sense. Yes, you can geocode addresses, but you still need to aggregate by something.