Spatial Indexes and Bad Queries

So, this happened:

Tweet about Indexes

Basically a GeoDjango user posted some workarounds to some poor performance in spatial queries, and the original query was truly awful and the workaround not a lot better, so I snarked, and the GeoDjango maintainer reacted in kind.

Sometimes a guy just wants to be a prick on the internet, you know? But still, I did raise the red flag of snarkiness, so it it seems right and proper to pay the fine.

I come to this not knowing entirely the contract GeoDjango has with its users in terms of “hiding the scary stuff”, but I will assume for these purposes that:

  • Data can be in any coordinate system.
  • Data can use geometry or geography column types.
  • The system has freedom to create indexes as necessary.

So the system has to cover up a lot of variability in inputs to hide the scary stuff.

We’ll assume a table name of the_table a geometry column name of geom and a geography column name of geog.

Searching Geography

This is the easiest, since geography queries conform to the kind of patterns new users expect: the coordinates are in lon/lat but the distances are provided/returned in metres.

Hopefully the column has been spatially indexed? You can check in the system tables.

SELECT * 
FROM pg_indexes 
WHERE tablename = 'the_table';

Yes, there are more exact ways to query the system tables for this information, I give the simple example for space reasons.

If it has not been indexed, you can make a geography index like this:

CREATE INDEX the_table_geog_x 
  ON the_table
  USING GIST (geog);

And then a “buffered” query, that finds all objects within a radius of an input geometry (any geometry, though only a point is shown here) looks like this.

SELECT *
FROM the_table
WHERE ST_DWithin(
    the_table.geog,
    ST_SetSRID(ST_MakePoint(%lon, %lat), 4326),
    %radius
    );

Note that there is no “buffering” going on here! A radius search is logically equivalent and does not pay the cost of building up buffers, which is an expensive operation.

Also note that, logically, ST_DWithin(A, B, R) is the same as ST_Distance(A, B) < R, but in execution the former can leverage a spatial index (if there is one) while the latter cannot.

Indexable Functions

Since I mention that ST_DWithin() is indexable, I should list all the functions that can make use of a spatial index:

And for a bonus there are also a few operators that access spatial indexes.

  • geom_a && geom_b returns true if the bounding box of geom_a intersects the bounding box of geom_b in 2D space.
  • geom_a &&& geom_b returns true if the bounding box of geom_a intersects the bounding box of geom_b in ND space (an ND index is required for this to be index assisted),

Searching Planar Geometry

If the data are planar, then spatial searching should be relatively easy, even if the input geometry is not in the same coordinate system.

First, is your data planar? Here’s a quick-and-dirty query that returns true for geographic data and false for planar.

SELECT srs.srtext ~ '^GEOGCS' 
FROM spatial_ref_sys srs
JOIN geometry_columns gc
ON srs.srid = gc.srid
WHERE gc.f_table_name = 'the_table'

Again, do you have a spatial index on your geometry column? Check!

CREATE INDEX the_table_geom_x 
  ON the_table
  USING GIST (geom);

Now, assuming query coordinates in the same planar projection as the data, a fast query will look like this:

SELECT *
FROM the_table
WHERE ST_DWithin(
    geom,
    ST_SetSRID(ST_MakePoint(%x, %y), %srid)
    %radius
    );

But what if users are feeding in queries in geographic coordinates? No problem, just convert them to planar before querying:

SELECT *
FROM the_table
WHERE ST_DWithin(
    geom,
    ST_Transform(
        ST_SetSRID(ST_MakePoint(%lon, %lat), 4326), 
        %srid)
    %radius
    );

Note that here we are transforming the geography query to planar, not transforming the planar column to geographic!

Why? There’s only one query object, and there’s potentially 1000s of rows of table data. And also, the spatial index has been built on the planar data: it cannot assist the query unless the query is in the same projection.

Searching Lon/Lat Geometry

This is the hard one. It is quite common for users to load geographic data into the “geometry” column type. So the database understands them as planar (that’s what the geometry column type is for!) while their units (longitude and latitude) are in fact angular.

There are benefits to staying in the geometry column type:

  • There are far more functions native to geometry, so you can avoid a lot of casting.
  • If you are mostly working with planar queries you can get better performance from 2D functions.

However, there’s a huge downside: questions that expect metric answers or metric parameters can’t be answered. ST_Distance() between two geometry objects with lon/lat coordinates will return an answer in… “degrees”? Not really an answer that makes any sense, as cartesian math on anglar coordinates returns garbage.

So, how to get around this conundrum? First, the system has to recognize the conundrum!

  • Is the column type “geometry”?
  • Is the SRID a long/lat coordinate system? (See test query above.)

Both yes? Ok, this is what we do.

First, create a functional index on the geography version of the geometry data. (Since you’re here, make a standard geometry index too, for other purposes.)

CREATE INDEX the_table_geom_geog_x
ON the_table
USING GIST (geography(geom));

CREATE INDEX the_table_geom
ON the_table
USING GIST (geom);

Now we have an index that understands geographic coordinates!

All we need now is a way to query the table that uses that index efficiently. The key with functional indexes is to ensure the function you used in the index shows up in your query.

SELECT * 
FROM the_table
WHERE ST_DWithin(
    geography(geom),
    geography(ST_SetSRID(ST_MakePoint(%lon, %lat), 4326))
    %radius
    );

What’s going on here? By using the “geography” version of ST_DWithin() (where both spatial arguments are of type “geography”) I get a search in geography space, and because I have created a functional index on the geography version of the “geom” column, I get it fully indexed.

Random Notes

  • The user blog post asserts incorrectly that their best performing query is much faster because it is “using the spatial index”.
SELECT 
    "core_searchcriteria"."id", 
    "core_searchcriteria"."geo_location"::bytea,
     "core_searchcriteria"."distance",
     ST_DistanceSphere(
        "core_searchcriteria"."geo_location", 
        ST_GeomFromEWKB('\001\001\000\000 \346\020\000\000\000\000\000\000\000@U@\000\000\000\000\000\000@@'::bytea)) AS "ds" 
     FROM "core_searchcriteria" 
        WHERE ST_DistanceSphere(
            "core_searchcriteria"."geo_location", 
            ST_GeomFromEWKB('\001\001\000\000 \346\020\000\000\000\000\000\000\000@U@\000\000\000\000\000\000@@'::bytea)
        ) <= "core_searchcriteria"."distance";
  • However, the WHERE clause is just not using any of the spatially indexable functions. Any observed speed-up is just because it’s less brutally ineffecient than the other queries.

  • Why was the original brutally inefficient?

SELECT 
    "core_searchcriteria"."id", 
    "core_searchcriteria"."geo_location"::bytea, 
    "core_searchcriteria"."distance", 
    ST_Buffer(
        CAST("core_searchcriteria"."geo_location" AS geography(POINT,4326)), "core_searchcriteria"."distance"
        )::bytea AS "buff" FROM "core_searchcriteria" 
    WHERE ST_Intersects(
        ST_Buffer(
            CAST("core_searchcriteria"."geo_location" AS geography(POINT,4326)), "core_searchcriteria"."distance"), 
        ST_GeomFromEWKB('\001\001\000\000 \346\020\000\000\000\000\000\000\000@U@\000\000\000\000\000\000@@'::bytea)
    )
  • The WHERE clause converts the entire contents of the data column to geography and then buffers every single object in the system.
  • It then compares all those buffered objects to the query object (what, no index? no).
  • Since the column objects have all been buffered… any spatial index that might have been built on the objects is unusable. The index is on the originals, not on the buffered objects.

Dumping a ByteA with psql

Sometimes you just have to work with binary in your PostgreSQL database, and when you do the bytea type is what you’ll be using. There’s all kinds of reason to work with bytea:

  • You’re literally storing binary things in columns, like image thumbnails.
  • You’re creating a binary output, like an image, a song, a protobuf, or a LIDAR file.
  • You’re using a binary transit format between two types, so they can interoperate without having to link to each others internal format functions. (This is my favourite trick for creating a library with optional PostGIS integration, like ogr_fdw.)

Today I was doing some debugging on the PostGIS raster code, testing out a new function for interpolating a grid surface from a non-uniform set of points, and I needed to be able to easily see what the raster looked like.

Interpolated raster surface grid

There’s a function to turn a PostGIS raster into a GDAL image format, so I could create image data right in the database, but in order to actually see the image, I needed to save it out as a file. How to do that without writing a custom program? Easy! (haha)

Basic steps:

  • Pipe the query of interest into the database
  • Access the image/music/whatever as a bytea
  • Convert that bytea to a hex string using encode()
  • Ensure psql is not wrapping the return in any extra cruft
  • Pipe the hex return value into xxd
  • Redirect into final output file

Here’s what it would look like if I was storing PNG thumbnails in my database and wanted to see one:

echo "SELECT encode(thumbnail, 'hex') FROM persons WHERE id = 12345" \
  | psql --quiet --tuples-only -d dbname \
  | xxd -r -p \
  > thumbnail.png

Any bytea output can be pushed through this chain, here’s what I was using to debug my ST_GDALGrid() function.

echo "SELECT encode(ST_AsGDALRaster(ST_GDALGrid('MULTIPOINT(10.5 9.5 1000, 11.5 8.5 1000, 10.5 8.5 500, 11.5 9.5 500)'::geometry, ST_AddBand(ST_MakeEmptyRaster(200, 400, 10, 10, 0.01, -0.005, 0, 0), '16BSI'), 'invdist' ), 'GTiff'), 'hex')" \
  | psql --quiet --tuples-only grid \
  | xxd -r -p \
  > testgrid.tif 

Waiting for PostGIS 3.1: GEOS 3.9

This post originally appeared on the Crunchy Data blog.


While we talk about “PostGIS” like it’s one thing, it’s actually the collection of a number of specialized geospatial libraries, along with a bunch of code of its own.

  • PostGIS provides core functionality
    • bindings to PostgreSQL, the types and indexes,
    • format reading and writing
    • basic algorithms like distance and area
    • performance tricks like caching
    • simple geometry manipulations (add a point, dump rings, etc)
    • algorithms that don’t exist in the other libraries
  • Proj provides coordinate system transformations
  • GDAL provides raster algorithms and format supports
  • GEOS provides computational geometry algorithms
    • geometry relationships, like “intersects”, “touches” and “relate”
    • geometry operations, like “intersection”, “union”
    • basic algorithms, like “triangulate”

The algorithms in GEOS are actually a port to C++ of algoriths in the JTS Java library. The ecosystem of projects that depend on GEOS or JTS or one of the other language ports of GEOS is very large.

GEOS/JTS Ecosystem

Overlay NG

Over the past 12 months, the geospatial team at Crunchy Data has invested heavily in JTS/GEOS development, overhauling the overlay engine that backs the Intersection, Union, Difference and SymDifference functions in all the projects that depend on the library.

IntersectionUnion

The new overlay engine, “Overlay NG”, promises to be more reliable, and hopefully also faster for most common cases.

One use of overlay code is chopping large objects up, to find the places they have in common. This query summarizes climate zones (bec) by watershed (wsa).

SELECT 
    Sum(ST_Area(ST_Intersection(b.geom, w.geom))) AS area_zone, 
    w.wsd_id, 
    b.zone
FROM bec b
JOIN wsa w
ON ST_Intersects(b.geom, w.geom)
WHERE w.nwwtrshdcd like '128-835500-%'
GROUP BY 2, 3

Summarization

The new implementation for this query runs about 2 times faster than the original. Even better, when run on a larger area with more data, the original implementation fails – it’s not possible to get a result out. The new implementation runs to completion.

Another common use of overlay code is melting together areas that share an attribute. This query takes (almost) every watershed on Vancouver Island and melts them together.

SELECT ST_Union(geom)
FROM wsa
WHERE nwwtrshdcd like '920-%'
   OR nwwtrshdcd like '930-%'

At the start, there are 1384 watershed polygons.

Vancouver Island watersheds

At the end there is just one.

Vancouver Island

The new implementation takes about 50% longer currently, but it is more robust and less likely to fail than the original.

Fixed Precision Overlay

The way Overlay NG ensures robust results, is by falling back to more and more reliable noding approaches. “Noding” refers to how new vertices are introduced into geometries during the overlay process.

  • Initially a naive “floating point” noding is used, that just uses double precision coordinates. This works most of the time, but occasionally fails when noding “almost parallel” edges.
  • On failure, a “snapping” noding is used, which nudges nearby edges and nodes together within a tolerance. That works most of the time, but occasionally fails.
  • Finally, a “fixed precision” noding nudges all of the coordinates in both geometries into a fixed space, where edge collapses can be handled deterministically. This is the lowest performance approach, but it very very rarely occurs.

Sometimes, end users actually prefer to have their geometry forced into a fixed precision grid, and for overlay to use a fixed precision. For those users, with PostGIS 3.1 and GEOS 3.9 there are some new parameters in the intersection/union/difference functions.

Precision reduction

The new “gridSize” parameter determines the size of the grid to snap to when generating new outputs. This can be used both to generate new geometries, and also to precision reduce existing geometries, just by unioning a geometry with an empty geometry.

Inscribed Circle

As always, there are a few random algorithmic treats in each new GEOS release. For 3.9, there is the “inscribed circle”, which finds the largest circle that can be fit inside a polygon (or any other boundary).

Vancouver Island inscribed circle

In addition to making a nice picture, the inscribed circle functions as a measure of the “wideness” of a polygon, so it can be used for things like analyzing river polygons to determine the widest point, or placing labels.

Waiting for PostGIS 3.1: Grid Generators

This post originally appeared on the Crunchy Data blog.


Summarizing data against a fixed grid is a common way of preparing data for analysis. Fixed grids have some advantages over natural and administrative boundaries:

  • No appeal to higher authorities
  • Equal unit areas
  • Equal distances between cells
  • Good for passing data from the “spatial” computational realm to a “non-spatial” realm

Ideally, we want to be able to generate grids that have some key features:

  • Fixed origin point, so that grid can be re-generated and not move
  • Fixed cell coordinates for a given cell size, so that the same cell can be referred to just using a cell address, without having to materialize the cell bounds

ST_SquareGrid()

The ST_SquareGrid(size, bounds) function generates a grid with an origin at (0, 0) in the coordinate plane, and fills in the square bounds of the provided geometry.

SELECT (ST_SquareGrid(400000, ST_Transform(a.geom, 3857))).* 
FROM admin a  
WHERE name = 'Brazil';

So a grid generated using Brazil as the driving geometry looks like this.

Brazil square grid

ST_HexagonGrid()

The ST_HexagonGrid(size, bounds) function works much the same as the square grid function.

Hexagons are popular for some cartographic display purposes and modeling purposes. Surprisingly they can also be indexed using the same two-dimensional indexing scheme as squares.

The hexagon grid starts with a (0, 0) hexagon centered at the origin, and the gridding for a bounds includes all hexagons that touch the bounds.

Hexagon gridding

As with the square gridding, the coordinates of hexes are fixed for a particular gridding size.

SELECT (ST_HexagonGrid(100000, ST_Transform(a.geom, 3857))).* 
FROM admin a  
WHERE name = 'Germany';

Here’s a 100km hexagon gridding of Germany.

Germany hex grid

Summarizing with Grids

It’s possible to materialize grid-based summaries, without actually materializing the grids, using the generator functions to create the desired grids on-the-fly.

Here’s a summary of population points, using a hex grid.

SELECT sum(pop_max) as pop_max, hexes.geom
FROM
    ST_HexagonGrid(
        4.0,
        ST_SetSRID(ST_EstimatedExtent('places', 'geom'), 4326)
    ) AS hexes
    INNER JOIN
    places AS p
    ON ST_Intersects(p.geom, hexes.geom)
GROUP BY hexes.geom;

World population summary

It’s also possible to join up on-the-fly gridding to visualization tools, for very dynamic user experiences, feeding these dynamically generated grids out to the end user via pg_tileserv.

Waiting for PostGIS 3.1: Performance

This post originally appeared on the Crunchy Data blog.


Open source developers sometimes have a hard time figuring out what feature to focus on, in order to generate the maximum value for end users. As a result, they will often default to performance.

Performance is the one feature that every user approves of. The software will keep on doing all the same cool stuff, only faster.

For PostGIS 3.1, there have been a number of performance improvements that, taken together, might add up to a substantial performance gain for your workloads.

Large Geometry Caching

Spatial joins have been slowed down by the overhead of large geometry access for a very long time.

SELECT A.*, B.*
FROM A
JOIN B
ON ST_Intersects(A.geom, B.geom)

PostgreSQL will plan and execute spatial joins like this using a “nested loop join”, which means iterating through one side of the join, and testing the join condition. This results in executions that look like:

  • ST_Intersects(A.geom(1), B.geom(1))
  • ST_Intersects(A.geom(1), B.geom(2))
  • ST_Intersects(A.geom(1), B.geom(3))

So one side of the test repeats over and over.

Geometry Caches

Caching that side and avoiding re-reading the large object for each iteration of the loop makes a huge difference to performance. We have seen 20 times speed-ups in common spatial join workloads (see below).

The fixes are quite technical, but if you are interested we have a detailed write-up available.

Header-only Geometry Reads

The on-disk format for geometry includes a short header that includes information about the geometry bounds, the spatial reference system and dimensionality. That means it’s possible for some functions to return an answer after only reading a few bytes of the header, rather than the whole object.

However, not every function that could do a fast read, did do a fast read. That is now fixed.

Faster Text Generation

It’s pretty common for web applications and others to generate text formats inside the database, and the code for doing so was not optimized. Generating “well-known text” (WKT), GeoJSON, and KML output all now use a faster path and avoid unnecessary copying.

PostGIS also now uses the same number-to-text code as PostgreSQL, which has been shown to be faster, and also allows us to expose a little more control over precision to end users.

How Much Faster?

For the specific use case of spatially joining, here is my favourite test case:

Admin0 and Populated Places

Load the data into both versions.

shp2pgsql -D -s 4326 -I ne_10m_admin_0_countries admin | psql postgis30
shp2pgsql -D -s 4326 -I ne_10m_populated_places places | psql postgis30

Run a spatial join that finds the sum of populated places in each country.

EXPLAIN ANALYZE
SELECT Sum(p.pop_max) as pop_max, a.name
FROM admin a
JOIN places p
ON ST_Intersects(a.geom, p.geom)
GROUP BY a.name

Average time over 5 runs:

  • PostGIS 3.0 = 23.4s
  • PostGIS 3.1 = 0.9s

This test is somewhat of a “worst case”, in that there are lots of very large countries in the admin data, but it gives an idea of the kind of speed-ups that are available for spatial joins against collections that include larger (250+ coordinates) geometries.