H3: Uber’s Hexagonal Hierarchical Spatial Index (2018)

eng.uber.com

262 points by felipap 6 years ago

I did some indexing using elasticsearch and some home grown stuff based on geohashes about seven years ago. At the time Elasticsearch was just adding support for geoshape indexing as well. Initially this was also based on geohashes. Later they added proper quad tree support (instead of indexing the geohash as a term), and recently they revamped the implementation using BKD trees. The current implementation is way faster and scalable than what they had a few years back and allegedly quite nice for handling complex GIS data.

What's the advantage of Uber's approach over any of this. Even my primitive geohash solution worked pretty nicely and you can implement it on just about any type of DB. I had a simple algorithm to cover any shape with geohashes of a certain size as well as a quick way to generate a polygon for a circle. The two combined allowed me to do radius searches for any shape overlapping or contained by the circle with simple terms queries on the geohash prefixes. My main headache was keeping the number of terms in a query (i.e. number of geohashes) to a reasonable size (below 1024, which if I recall was a the default limit).

mtrovo 6 years ago

They go into some detail on this talk https://youtu.be/ay2uwtRO3QE?t=712.
What I get from their explanation is that hexagon is a better shape for map grids because they are the most complex shape that can tesselate (the other two are triangles and squares). As they are more close to a circle, distances within a cell are more stable, also computing the distance from a cell center to its neighbours is stable in hexagons as well.
I think the reason hex is not more common is that subdivisions are hard to create compared to triangles or squares. Uber solved this by subdividing in 7 smaller hex and tilting it so they cover the bigger shape with some small overlap.
Also a big problem is distortion, I never thought this would be that huge of a problem but it makes sense. They go into a lot of the details later on the same video.
- Pamar 6 years ago
  
  Which is why, btw, Hexagons have been used for decades in wargames (and - to a more limited extent, in other types of boardgames).
- twic 6 years ago
  
  > hexagon is a better shape for map grids because they are the most complex shape that can tesselate
  [Penrose tiling intensifies]
  
  CositaS 6 years ago
  
  I think it's clear what they meant was not actually 'the most complex shape' but the regular polygon with the most sides that can tessellate.
  
  gloflo 6 years ago
  
  Autocomplete of 'most compact'.
- AlchemistCamp 6 years ago
  
  The Techzing podcast talked about the origin of this several years ago, back when one of the hosts had a critical role in its creation. Good stuff.
sroussey 6 years ago

geohashes are cool, and lopping off a letter reduces precision.
But if you look at have they overlay London, you get quite a split higher up and two next to each other don’t look like they are and so I can see where the number of terms would get big.
The new ElasticSearch implementation is now the default and I think they are deprecating the prefix versions like geohash.
Too bad their stuff is not embeddable into an app. Know of lib for this?
- jillesvangurp 6 years ago
  
  This is all part of Lucene and not specific to Elasticsearch as far as I understand.
  Update with link: https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/...
  
  sroussey 6 years ago
  
  Nice! Thanks for the link. I was hoping more for an embeddable lib for mobile apps

Mr_P 6 years ago

I'd be curious to see a more thorough comparison to S2, which IMHO seems simpler (it's just quadtrees) and likely faster (it supports O(1) lookup vs. needing a hierarchical search).

jandrewrogers 6 years ago

It is complicated, and it depends on the use case. There are roughly three dimensions to what you are optimizing the representation for: presentation, computational geometry, and decomposition (sharding). S2 and H3 are both fundamentally cartography-driven representations, primarily optimizing for presentation. S2 focuses a bit more on sharding and H3 a bit more on computational geometry, there is quite a bit of literature on the tradeoffs of their characteristic designs. If the core application is not presentation driven, such as pure spatiotemporal analytics, neither of these representations are good choices.
Representation systems for geospatial data models is an amazingly deep theoretical rabbit hole. Common systems are almost always optimized for presentation as most were originally designed for cartographic use cases. If you were looking at representation systems optimized for fast, scalable geospatial analytics, for example, you'd use some type of 3-space embedding representation. There is a lot of diversity.
- nwlieb 6 years ago
  
  What is an example of a 3-space embedding or interesting literature? I'm having difficulties googling the term.
  
  jandrewrogers 6 years ago
  
  A 3-space embedding is a representation optimized for efficient decomposition and computational geometry, ideal for scale-out analytics. This is an interesting design problem in that you can't achieve both with a single surface and they are mathematically incompatible (one requires a discrete surface, the other requires a real surface). A 3-space embedding is a dual surface representation engineered to make it easy to move between the surfaces as required by code. As the name implies, you are logically embedding a standard 2-spheroid in a synthetic discrete 3-space and both coordinate systems can be used simultaneously. Presentation requires computing a projection of some sort.
  Unlike single-surface representations, these have the advantage of being essentially free of computational edge cases if you design them correctly. They are also amenable to implementations that are extremely computationally efficient to use, which is a bit of an afterthought for most presentation-optimized designs but important for high-scale geospatial analytics.
  A common reflexive criticism of these representations is that they use equal volume sharding, which means that sharding them is not a good approximation of equal area on the embedded surface. An equal area decomposition only makes sense in the context of presentation (e.g. tiling) because the underlying data distribution is naturally extremely and unpredictably skewed, leading to non-uniform cell loading no matter how you decompose it. The assumption that equal area decomposition helps to ensure uniform cell loading is trivially false in practice, making it a non-optimization. Therefore, any competent implementation always requires a separate mechanism for ensuring uniform loading independent of the decomposition model.
  The term of art for all of this is discrete global grid systems (commonly "DGGS"). The vast majority of the literature is focused on presentation optimized systems, and the design of single-surface representations, but other types of representations are discussed. It has a very rich taxonomy. I have an article I've been sporadically writing which I should probably finish that steps through the design of a state-of-the-art 3-space embedding representation system for scale-out analytics, based on a (currently stalled) effort to produce a formal standard for industry. A good 3-space embedding has a relatively simple description and implementation but there is much technical subtlety as to why it is designed a specific way.
  
  scottlocklin 6 years ago
  
  I'm guessing he means stuff like voronoi tesselation, which isn't limited to 3-space. Look at the books of Hanan Samet for more on this stuff: http://www.cs.umd.edu/~hjs/
pininja 6 years ago

H3 and S2 are kinda similar, except by using hexagons, H3 grid centroids are equidistant - rectangles have different distances from center to center.
H3 also has efficient means for finding a cell’s neighbors, and comes with some nice algorithms - like the “compact” fill. See https://uber.github.io/h3/#/documentation/overview/use-cases
The hierarchical search data is built into a grid’s ID in both H3 and S2, which helps when comparing ID’s to see how close they are to each other.
For visualization, I prefer the way H3 looks over S2. While that’s just an opinion, H3 grid is exposed directly to drivers and in lots of tools.
I work for Uber and have used H3, but don’t work on the library.
- Mr_P 6 years ago
  
  > For visualization, I prefer the way H3 looks over S2. While that’s just an opinion, H3 grid is exposed directly to drivers and in lots of tools.
  Ah, yes this makes a lot of sense. The video in the link actually does have a nice comparison at 17:30, where this is called out. It seems to me to be the most compelling argument for hexagons.
kndjckt 6 years ago

The critical advantage of h3 vs. s2 is that all neighbors are equidistant from the central cell. Also, the implementation of the projection means there is less distortion than s2.
I use this library every day and absolutely love it.
- stefco_ 6 years ago
  
  This is the same reason astronomers like HEALPix tiling for skymaps [0]. Equal areas-per-pixel (important for integrating over areas) with straightforward spherical harmonics calculations and hierarchical extensions that can store images at multiple resolutions by subdividing base cells (LIGO uses one such hierarchical strategy for our low-latency gravitational wave source direction estimates [1]).
  [0] https://healpix.sourceforge.io
  [1] https://arxiv.org/abs/1508.03634
- dpflan 6 years ago
  
  Thanks for sharing. What do you use it for? Did you use S2 previously?
seanhandley 6 years ago

>it supports O(1) lookup vs. needing a hierarchical search
Bear in mind that H3 only has 16 levels of resolution, so traversing from top level down to the square-metre level is still a constant time operation i.e. O(16)
erichocean 6 years ago

Seeing as Uber used to use S2, presumably it was worse for their needs.

Cactus2018 6 years ago

Previously https://news.ycombinator.com/item?id=16135302

kiwidrew 6 years ago

Uber really likes hexagons. Their internal name for driver promotions is even "hexcentive"! I've long wondered what compelled Uber to use hexagons to define their various promotion boundaries, because it's not a very user-friendly choice -- the lines on a map look very jagged, for instance.

This post goes some way towards explaining why hexagons:

* "minimize the quantization error introduced when users move through a city"

* "allow us to approximate radiuses easily"

* non-hexagonal regions (such as postal code boundaries) "have unusual shapes and sizes which are not helpful for analysis, and are subject to change"

traverseda 6 years ago

>Hexagons are just pentagons without one side

My favorite quote in the video

kjeetgill 6 years ago

Don't they have one more side?
- Yolta 6 years ago
  
  They do.
- traverseda 6 years ago
  
  Oops, typed that backwards

hamandcheese 6 years ago

How does this compare to S2 from google? It is not mentioned in the post at all.

http://s2geometry.io/

seanhandley 6 years ago

You get more aesthetic visuals from H3.
S2 cells distort quite heavily depending on which part of the globe you're mapping.
- phkahler 6 years ago
  
  I think this is the real answer. I've like hex tiles for a long time, but they don't nest properly like squares in a quadtree. I think someone at Uber decided to "make it work" because it looks cool. They traded one set of problems for another, and I think from a technical point of view they made a poor choice.
  
  malandrew 6 years ago
  
  > I think from a technical point of view they made a poor choice.
  based on what criteria? H3 vs S2 is a tradeoff.
  
  phkahler 6 years ago
  
  >> based on what criteria?
  Well for one, the fact that they don't nest properly. They mention that you can get the address of the parent hexagon by truncating that of the current one. The problem is that doesn't actually work all the way up because certain tiles on the border will flip back and forth in strange ways. They had to do something to deal with this, just like the would have had to do something else with squares having two kinds ot neighbors. The squares issues are more obvious though and the code to deal with it will probably be very localized.
  I mean, don't make a nested spatial index with spacial regions that don't actually nest. The structure fails at its primary claim.
  
  ClumsyPilot 6 years ago
  
  Agreed,it seems incomprehensible that you would accept such nesting problems.
  
  malandrew 6 years ago
  
  Maybe nesting wasn't considered important for the use case?
snek 6 years ago

There is a deep comparison in the video in the post.

dang 6 years ago

Thread from last year: https://news.ycombinator.com/item?id=17417545

scrappyjoe 6 years ago

If anyone wants to use this in R, the h3-r package works quite nicely. It’s in C which, in my tests, is an order of magnitude faster than the v8 implementation. We use it to chop up spatial data into equally sized chunks that we can use as observations for regressions and the like. https://github.com/crazycapivara/h3-r

seanhandley 6 years ago

I work for a delivery company. We currently use H3 to help calculate ETAs in realtime.

lvh 6 years ago

Does that work? I’d expect you want to do a fairly traditional road-based ETA? Otherwise you’ll be wildly off if you’re in a cul de sac next to a major highway or on said highway, or in massive traffic vs not at all.
- seanhandley 6 years ago
  
  It's more to figure out how many couriers are in a given cell at a given time, and track that data over time so it can be forecasted.
  Then we can say if you need a package collecting at address X, there are usually Y available couriers within Z distance. Combine that with a road mapping/traffic API and you're done.

no_identd 6 years ago

I hate Uber, but I now opened this issue suggesting a way to massively increase the performance of this anyway:

https://github.com/uber/h3/issues/258

(Note: by "massively" I mean pushing it to O(1) for the fundamental coordinate operations. So, perhaps read that as "most massively".)

lacker 6 years ago

I might be the only one pedantically annoyed here, but... the very first image on this page contains a mix of hexagons and pentagons.

Which is it - an inaccurate image, and the system is really hexagon-based, or does the system also use pentagons? My guess is that it is actually all hexagon-based since the overlap between granularity is already only approximate.

nexuist 6 years ago

Is UberCon going on or something? Why so many releases in the past few days? I'm not complaining; just curious.

kyruzic 6 years ago

This article is over a year old.

drewm1980 6 years ago

I wonder why healpix from NASA wasn't an option for them. Maybe the nonpermissive license...

evrydayhustling 6 years ago

Why does each cell contain seven finer cells instead of six, resulting in imperfect containment?

wlesieutre 6 years ago

You can't divide a hexagon into 6 regular hexagons, that would mean using triangular cells to subdivide the hexagon.
Which isn't to say that you couldn't tile the planet with triangles, but they point out that the consistent relationship between neighboring hexagon tiles is useful:
>Using a hexagon as the cell shape is critical for H3. As depicted in Figure 6, hexagons have only one distance between a hexagon centerpoint and its neighbors’, compared to two distances for squares or three distances for triangles. This property greatly simplifies performing analysis and smoothing over gradients.
As you noticed, you can't divide a hexagon into 7 regular hexagons either. But it's apparently close enough:
>H3 supports sixteen resolutions. Each finer resolution has cells with one seventh the area of the coarser resolution. Hexagons cannot be perfectly subdivided into seven hexagons, so the finer cells are only approximately contained within a parent cell.
- evrydayhustling 6 years ago
  
  Thanks! It turns out I am bad at hexagons.
flamtap 6 years ago

I wonder if it has something to do with the number of hexes it takes to cover a sphere with? It seems to me like the larger hexes/pentagons are mostly just illustrative borders.
pfortuny 6 years ago

They are not all hexagons on the top picture: there is no hexagonal polyhedron (with regular hexagons). You can see pentagons here and there.
the_seraphim 6 years ago

because you can't do that, that's not how shapes work.

Animats 6 years ago

There's a lot to be said for a hexagonal grid. I looked at this once for a collision detection system. It's convenient to get rid of the special cases needed where four cells meet. You have to invent some new bookkeeping, but it can be worth it.

jacobolus 6 years ago

The aliasing / moiré artifacts are also dramatically less objectionable with hexagonal grids.
Now that image sensors and e.g. smartphone displays are very high resolution, and most content ends up going through multiple resampling routines on its way from capture -> storage/processing -> display, I would love to see people start to experiment with hexagon-grid cameras and displays. The slightly more complicated resampling wouldn’t really be a big deal for modern DSP hardware, and the visual output could be significantly better for the same pixel count.

bayesian_horse 6 years ago

I love the aesthetics of hexagons.

It should be possible to turn this projection into quite a nice 3D printable model. Challenging, though. Probably one could try to split the sphere into equal hexagons for printing and then assemble.

jpab 6 years ago

You can't make a sphere entirely out of hexagons; the math doesn't work [1]. In the article they note that they include 12 pentagons too, to solve this problem. Neatly, they arrange to have all the pentagons in areas of water where they presumably won't have to analyse traffic patterns for a while.
(I mention it mostly because I think it's an interesting little mathematical factoid.)
[1] https://math.stackexchange.com/q/2121175
- jacobolus 6 years ago
  
  One thing you can do though is tile an octahedron by just hexagons. At each of the 6 corners you end up with 2 hexagons which border each-other along 2 edges (instead of the usual 1). If you blow this octahedron up into a sphere those hexagons appear to be pentagons, because two of their edges are colinear (i.e. the same great-circle arc).
  This can be nicer in some cases: the edge case your hexagon-grid algorithms have to deal with is having a hexagon with one of the same neighbors twice, instead of needing to worry about pentagons per se.
  
  alanbernstein 6 years ago
  
  Would you care to explain this? When you mention tiling an octahedron with hexagons, the first thing I imagine is a truncated octahedron, which is composed of hexagons and squares.
  
  jacobolus 6 years ago
  
  You can start with 4 hexagons, each one covering an octant and a third of the neighboring octants.
  Or another way to say this: if you start with 4 hexagons, with each glued together with each other along two adjacent edges, and you add the appropriate folds, you can make an octahedron.
  Then you can subdivide each of those starting hexagons into n hexagons for any of these numbers, https://oeis.org/A003136 (power-of-4 sizes may be the most convenient among these, so that the overall grid has 2^n by 2^n size)
- na85 6 years ago
  
  That link is a great example of why I stopped using stack exchange. The OP asked a question about tiling spheres with pentagons and the third response is basically "you can tile a sphere if it's not a sphere."
  I had analogous experiences every time I asked a question there. One would ask a very clear question like, say, "how do I print to stdout in C?" And the first or second answer you get is inevitably about taking input from stdin. Or polymorphism.
  
  ant6n 6 years ago
  
  Maybe people just like to hijack threads and change the topic to what they like to talk about.
  
  alluro2 6 years ago
  
  Ouch :) Thanks for that, nice one :)

starpilot 6 years ago

Python bindings on Windows working yet?

seanhandley 6 years ago

David Ellis is working on this https://github.com/dfellis/h3-py/pull/1

jeanlucas 6 years ago

Second time seeing it, would love to use it just for fun, but no idea what to do

peter_d_sherman 6 years ago

Utterly Brilliant!

thstart 6 years ago

H3 is working for the specific Ubers’ needs. Not much useful as S2.

kjeetgill 6 years ago

I wish you weren't downvoted, this is exactly right.
The nice thing about S2 is that is subdivides cleanly: A square can be composed of smaller squares while hexagons can't be. This property makes S2 much more broadly useful for a bigger range of applications.
In H3, each hierarchical level of hexagon doesn't fit cleanly in the one below. For Uber's uses, this is acceptable because hexagons have more uniform adjacency but the "zoom in and out" math is pretty gnarly.
But even S2 had the funkyness of first mapping a sphere to a cube. They're both fairly interesting to read up on.
mtrovo 6 years ago

Could you give more details? After reading the article I think the main reason is the subdivision algorithm which for S2 is easier but for everything else hexagon seems like a better approach.
- HelloNurse 6 years ago
  
  Subdividing tiles can be more important than everything else: knowing in which tile a point lies is a fundamental operation for a spatial indexing system and it has to be very easy and efficient.
  
  twic 6 years ago
  
  It's also vital for correct aggregation: if you sum up some quantity over all the subdivisions of some area, you really want that sum to be equal to the quantity in that area. With this hexagonal pseudo-subdivision, it isn't.
  
  mtrovo 6 years ago
  
  I think as long as you're adding the leaves on the same resolution level it should be fine. the problem is when your storing objects in different levels as it would be possible on a KDB tree.

ggg3 6 years ago

did not look at the code yet, but the algo for subdiving might be a nightmare.

square and triangle are weird for distance, maybe. but they can all hold precise subdivisions of the shape inside each other. hexagon "bleed", so why exactly they boast so many times in the article that h3 is so great for nesting different details levels? even their diagrams show very bad bleeding (worse than it should be if optimized) and they never mention it on this summary.

ThouYS 6 years ago

Beautiful blog post!