ST_Subdivide all the Things

This post originally appeared in the CARTO blog.

One of the things that makes managing geospatial data challenging is the huge variety of scales that geospatial data covers: areas as large as a continent or as small as a man-hole cover.

The data in the database also covers a wide range, from single points, to polygons described with thousands of vertices. And size matters! A large object takes more time to retrieve from storage, and more time to run calculations on.

The Natural Earth countries file is a good example of that variation. Load the data into PostGIS and inspect the object sizes using SQL:

SELECT admin, ST_NPoints(the_geom), ST_MemSize(the_geom) 
FROM ne_10m_admin_0_countries 
  • Coral Sea Islands are represented with a 4 point polygon, only 112 bytes.
  • Canada is represented with a 68159 point multi-polygon, 1 megabytes in size!

Countries by Size in KB

Over half (149) of the countries in the table are larger than the database page size (8Kb) which means they will take extra time to retrieve.

SELECT Count(*) 
FROM ne_10m_admin_0_countries 
WHERE ST_MemSize(the_geom) > 8192;

We can see the overhead involved in working with large data by forcing a large retrieval and computation.

Load the Natural Earth populated places into PostGIS as well, and then run a full spatial join between the two tables:

SELECT Count(*)
FROM ne_10m_admin_0_countries countries 
JOIN ne_10m_populated_places_simple places 
ON ST_Contains(countries.the_geom, places.the_geom)

Even though the places table (7322) and countries table (255) are quite small the computation still takes several seconds (about 30 seconds on my computer).

The large objects cause a number of inefficiencies:

  • Geographically large areas (like Canada or Russia) have large bounding boxes, so the indexes don’t work as efficiently in winnowing out points that don’t fall within the countries.
  • Physically large objects have large vertex lists, which take a long time to pass through the containment calculation. This combines with the poor winnowing to make a bad situation worse.

How can we speed things up? Make the large objects smaller using ST_Subdivide()!

First, generate a new, sub-divided countries table:

CREATE TABLE ne_10m_admin_0_countries_subdivided AS
SELECT ST_SubDivide(the_geom) AS the_geom, admin 
FROM ne_10m_admin_0_countries;

Now we have the same data, but no object is more than 255 vertices (about 4Kb) in size!

Subdivided Countries by Size in KB

Run the spatial join torture test again, and see the change!

SELECT Count(*)
FROM ne_10m_admin_0_countries_subdivided countries 
JOIN ne_10m_populated_places_simple places 
ON ST_Contains(countries.the_geom, places.the_geom)

On my computer, the return time about 0.5 seconds, or 60 times faster, even though the countries table is now 8633 rows. The subdivision has accomplished two things:

  • Each polygon now covers a smaller area, so index searches are less likely to pull up points that are not within the polygon.
  • Each polygon is now below the page size, so retrieval from disk will be much faster.

Subdividing big things can make map drawing faster too, but beware: once your polygons are subdivided you’ll have turn off the polygon outlines to avoid showing the funny square boundaries in your rendered map.

Happy mapping and querying!

GZip in PostgreSQL

I love PostgreSQL extensions.

Extensions are the truest expression of the second principle of the original “design of Postgres” vision, to

provide user extendibility for data types, operators and access methods.

Extensions allow users to do more with PostgreSQL than just basic storage and retrieval. PostgreSQL is a full-on integration environment, like Python or Perl, and you can build very complete data manipulation pipelines very close to the metal using native and extension features of PostgreSQL.

Even though I’m a contributor to one of the largest PostgreSQL extensions, I have particularly come to love small extensions, that do one simple thing, particularly one simple thing we maybe take for granted in other environments.

My old HTTP extension is just a binding of libcurl to a SQL interface, so users can do web queries inside the SQL environment.

And today I’ve finished up a GZIP extension, that is just a binding of zlib to SQL, so that users can… compress and decompress things.

It’s not a lot, but it’s a little.

The GZIP entension came about because of an email on the PostGIS development list, where Yuri noted

The amazing ST_AsMVT() has two common usage patterns: copy resulting MVTs to a tile cache (e.g. .mbtiles file or a materialized view), or serve MVT to the users (direct SQL->browser approach). Both patterns still require one additional data processing step – gziping.

Huh. And this use case also applies to people generating GeoJSON directly in the database and sending it out to web clients.

The PostgreSQL core has generally frowned on compression functions at the SQL level, because the database already does compression of over-sized tuples as necessary. The last thing we want is people manually applying compression to column values, and then stuffing them into rows where the database will then have to re-compress them internally. From the perspective of storage efficiency, just standing back and letting PostgreSQL do its work is preferable.

But from the perspective of an integration environment, where an application might be expected to emit or consume compressed data, having a tool in SQL to pack and unpack that data is potentially quite useful.

So I did the tiny binding to zlib and packed it up in an extension.

I hope lots of people find it useful.

Waiting for PostGIS 3: GEOS 3.8

While PostGIS includes lots of algorithms and functionality we have built ourselves, it also adds geospatial smarts to PostgreSQL by linking in specialized libraries to handle particular problems:

  • Proj for coordinate reference support;
  • GDAL for raster functions and formats;
  • GEOS for computational geometry (basic operations);
  • CGAL for more computational geometry (3D operations); and
  • for format support, libxml2, libjsonc, libprotobuf-c

Many of the standard geometry processing functions in PostGIS are actually evaluated inside the GEOS library, so updates in GEOS are very important to PostGIS – they add new functionality or smooth the behaviour of existing functions.

Functions backed by GEOS include:

These functions are all “overlay operation” functions – they take in geometry arguments and construct new geometries for output. Under the covers is an operation called an “overlay”, which combines all the edges of the inputs into a graph and then extracts new outputs from that graph.

While the “overlay operations” in GEOS are very reliable, they are not 100% reliable. When operations fail, the library throws the dreaded TopologyException, which indicates the graph is in an inconsistent and unusable state.

Because there are a lot of PostGIS users and they manage a lot of data, there are a non-zero number of cases that cause TopologyExceptions, and upset users. We would like take that number down to zero.

Update: Next-generation overlay did not make the 3.8 GEOS release and will be part of 3.9 instead.

With luck, GEOS 3.8 will succeed in finally bringing fully robust overlay operations to the open source community. The developer behind the GEOS algorithms, Martin Davis, recently joined Crunchy Data, and has spent this summer working on a new overlay engine.

Overlay failures are caused when intersections between edges result in inconsistencies in the overlay graph. Even using double precision numbers, systems have only 51 bits of precision to represent coordinates, and that fixed precision can result in graphs that don’t correctly reflect their inputs.

The solution is building a system that can operate on any fixed precision and retain valid geometry. As an example, here the new engine builds valid representations of Europe at any precision, even ludicrously coarse ones.

europe at different precisions

In practice, the engine will be used with a tolerance that is close to double precision, but still provides enough slack to handle tricky cases in ways that users find visually “acceptable”. Initially the new functionality should slot under the existing PostGIS functions without change, but in the future we will be able to expose knobs to allow users to explicitly set the precision domain they want to work in.

GEOS 3.8 may not be released in time for PostGIS 3, but it will be a close thing. In addition to the new overlay engine, a lot of work has been done making the code base cleaner, using more “modern” C++ idioms, and porting forward new fixes to existing algorithms.

Waiting for PostGIS 3: Parallelism in PostGIS

Parallel query has been a part of PostgreSQL since 2016 with the release of version 9.6 and in theory PostGIS should have been benefiting from parallelism ever since.

In practice, the complex nature of PostGIS has meant that very few queries would parallelize under normal operating configurations – they could only be forced to parallelize using oddball configurations.

With PostgreSQL 12 and PostGIS 3, parallel query plans will be generated and executed far more often, because of changes to both pieces of software:

  • PostgreSQL 12 includes a new API that allows extensions to modify query plans and add index clauses. This has allowed PostGIS to remove a large number of inlined SQL functions that were previously acting as optimization barriers to the planner.
  • PostGIS 3 has taken advantage of the removal of the SQL inlines to re-cost all the spatial functions with much higher costs. The combination of function inlining and high costs used to cause the planner to make poor decisions, but with the updates in PostgreSQL that can now be avoided.

Increasing the costs of PostGIS functions has allowed us to encourage the PostgreSQL planner to be more aggressive in choosing parallel plans.

PostGIS spatial functions are far more computationally expensive than most PostgreSQL functions. An area computation involves lots of math involving every point in a polygon. An intersection or reprojection or buffer can involve even more. Because of this, many PostGIS queries are bottlenecked on CPU, not on I/O, and are in an excellent position to take advantage of parallel execution.

One of the functions that benefits from parallelism is the popular ST_AsMVT() aggregate function. When there are enough input rows, the aggregate will fan out and parallelize, which is great since ST_AsMVT() calls usually wrap a call to the expensive geometry processing function, ST_AsMVTGeom().

Tile 1/0/0

Using the Natural Earth Admin 1 layer of states and provinces as an input, I ran a small performance test, building a vector tile for zoom level one.

parallel MVT tile performance

Spatial query performance appears to scale about the same as non-spatial as the number of cores increases, taking 30-50% less time with each doubling of processors, so not quite linearly.

Join, aggregates and scans all benefit from parallel planning, though since the gains are sublinear there’s a limit to how much performance you can extract from an operation by throwing more processors at at. Also, operations that do a large amount of computation within a single function call, like ST_ClusterKMeans, do not automatically parallelize: the system can only parallelize the calling of functions multiple times, not the internal workings of single functions.

Waiting for PostGIS 3: ST_Transform() and Proj6

Where are you? Go ahead and figure out your answer, I’ll wait.

No matter what your answer, whether you said “sitting in my office chair” or “500 meters south-west of city hall” or “48.43° north by 123.36° west”, you expressed your location relative to something else, whether that thing was your office layout, your city, or Greenwich.

A geospatial database like PostGIS has to have able to convert between these different reference frames, known as “coordinate reference systems”. The math for these conversions and the definition of the standards needed to align them all is called “geodesy”, a field with sufficient depth and detail that you can make a career out of it.

geographic crs

Fortunately for users of PostGIS, most of the complexity of geodesy is hidden under the covers, and all you need to know is that different common coordinate reference systems are described in the spatial_ref_sys table, so making conversions between reference system involves knowing just two numbers: the srid (spatial reference id) of your source reference system, and the srid of your target system.

For example, to convert (and display) a point from the local British Columbia Albers coordinate reference system to geographic coordinates relative to the North American datum (NAD83), the following SQL is used:

        ST_SetSRID('POINT(1195736 383004)'::geometry, 3005),
 POINT(-123.360004575121 48.4299914959586)

PostGIS makes use of the Proj library for coordinate reference system conversions, and PostGIS 3 will support the latest Proj release, version 6.

Proj 6 includes support for time-dependent datums and for direct conversion between datums. What does that mean? Well, one of the problems with the earth is that things move. And by things, I mean the ground itself, the very things we measure location relative to.

Highway fault

As location measurement becomes more accurate, and expectations for accuracy grow, reference shifts that were previously dealt with every 50 years have to be corrected closer to real time.

  • North America had two datums set in the twentieth century, NAD 27 and NAD 83. In 2022, North America will get new datums, fixed to the tectonic plates that make up the continent, and updated regularly to account for continental drift.
  • Australia is on fast moving plates that travel about 7cm per year, and will modernize its datums in 2020. 7cm/year may not sount like much, but that means that a coordinate captured to centimeter accuracy will be almost a meter out of place in a decade. For an autonomous drone navigating an urban area, that could be the difference between an uneventful trip and a crash.

Being able to accurately convert between local reference frames, like continental datums where static data are captured, to global frames like those used by the GPS/GLONASS/Galileo systems is critical for accurate and safe geo-spatial calculations.

Local vs global datum

Proj 6 combines updates to handle the new frames, along with computational improvements to make conversions between frames more accurate. Older versions used a “hub and spoke” system for conversions between systems: all conversions had WGS84 as a “neutral” frame in the middle.

A side effect of using WGS84 as a pivot system was increased error, because no conversion between reference systems is error free: a conversion from one frame to another would accrue the error associated with two conversions, instead of one. Additionally, local geodesy agencies–like NOAA in the USA and GeoScience Australia–published very accurate direct transforms between historical datums, like NAD27 and NAD83, but old versions of Proj relied on hard-coded hacks to enable direct transforms. Proj 6 automatically finds and uses direct system-to-system transforms where they exist, for the most accurate possible transformation.