5x Faster Spatial Join with this One Weird Trick
28 Sep 2018Update: As of PostGIS 3.1, spatial join is much faster due to some caching improvements, and this trick will mostly no longer yield large performance gains, though it still won’t hurt.
My go-to performance test for PostGIS is the point-in-polygon spatial join: given a collection of polygons of variables sizes and a collection of points, count up how many points are within each polygon. It’s a nice way of testing indexing, point-in-polygon calculations and general overhead.
Setup
First download some polygons and some points.
Load the shapes into your database.
shp2pgsql -s 4326 -D -I ne_10m_admin_0_countries.shp countries | psql performance
shp2pgsql -s 4326 -D -I ne_10m_populated_places.shp places | psql performance
Now we are ready with 255 countries and 7343 places.
One thing to note about the countries is that they are quite large objects, with 149 of them having enough vertices to be stored in TOAST tuples.
Baseline Performance
Now we can run the baseline performance test.
On my laptop, this query takes 25 seconds.
If you stick the process into a profiler while running it you’ll find that over 20 of those seconds are spent in the pglz_decompress
function. Not doing spatial algorithms or computational geometry, just decompressing the geometry before handing it on to the actual processing.
Among the things we talked about this week at our PostGIS code sprint have been clever ways to avoid this overhead:
- Patch PostgreSQL to allow partial decompression of geometries.
- Enrich our serialization format to include a unique hash key at the front of geometries.
These are cool have-your-cake-and-eat-too ways to both retain compression for large geometries and be faster when feeding them into the point-in-polygon machinery.
However, they ignore a more brutal and easily testable approach to avoiding decompression: just don’t compress in the first place.
One Weird Trick
PostGIS uses the “main” storage option for its geometry type. The main option tries to keep geometries in their original table until they get too large, then compresses them in place, then moves them to TOAST.
There’s another option “external” that keeps geometries in place, and if they get too big moves them to TOAST uncompressed. PostgreSQL allows you to change the storage on columns at run-time, so no hacking or code is required to try this out.
The spatial join now runs in under 4 seconds.
What’s the penalty?
- With a “main” storage the table+toast+index is 6MB.
- With a “external” storage the table+toast+index is 9MB.
Conclusion
For a 50% storage penalty, on a table that has far more large objects than most spatial tables, we achieved a 500% performance improvement. Maybe we shouldn’t apply compression to our large geometry at all?
Using “main” storage was mainly a judgement call back when we decided on it, it wasn’t benchmarked or anything – it’s possible that we were just wrong. Also, only large objects are compressed; since most tables are full of lots of small objects (short lines, points) changing to “external” by default wouldn’t have any effect on storage size at all.