Calculate distance between Zip Codes… AND users.

Ok, for starters, you don’t really need to use the Haversine formula here. For large distances where a less accurate formula produces a larger error, your users don’t care if the match is plus or minus a few miles, and for closer distances, the error is very small. There are easier (to calculate) formulas listed on the Geographical Distance Wikipedia article.

Since zip codes are nothing like evenly spaced, any process that partitions them evenly is going to suffer mightily in areas where they are clustered tightly (east coast near DC being a good example). If you want a visual comparison, check out http://benfry.com/zipdecode and compare the zipcode prefix 89 with 07.

A far better way to deal with indexing this space is to use a data structure like a Quadtree or an R-tree. This structure allows you to do spatial and distance searches over data which is not evenly spaced.

Here’s what an Quadtree looks like:

Quadtree

To search over it, you drill down through each larger cell using the index of smaller cells that are within it. Wikipedia explains it more thoroughly.

Of course, since this is a fairly common thing to do, someone else has already done the hard part for you. Since you haven’t specified what database you’re using, the PostgreSQL extension PostGIS will serve as an example. PostGIS includes the ability to do R-tree spatial indexes which allow you to do efficient spatial querying.

Once you’ve imported your data and built the spatial index, querying for distance is a query like:

SELECT zip
FROM zipcode
WHERE
geom && expand(transform(PointFromText('POINT(-116.768347 33.911404)', 4269),32661), 16093)
AND
distance(
   transform(PointFromText('POINT(-116.768347 33.911404)', 4269),32661),
   geom) < 16093

I’ll let you work through the rest of the tutorial yourself.

Here are some other references to get you started.

Leave a Comment