PostGIS Code Sprint 2018 #2

An important topic of conversation this sprint was what kinds of core PostgreSQL features might make PostGIS better in the future?

PostGIS Code Sprint 2018 #2

Wider Typmod

The current attribue typmod column is a 32-bit integer. For geometry, we are already packing that 32 bits to the limit: a couple bits for dimensionality, some more for the type number, and the bit kahune, a bunch for the SRID number. Having even a 64-bit typmod number would allow even more interesting things, like declared coordinate precision, to fit in there. Maybe we are abusing typmod and there’s a better way to do what we want though?

Parallel GIST Scan

The PostGIS spatial index is built using the PostgreSQL GIST index infrastructure, so anything that makes GIST scans faster is a win for us. This would be a big win for folks with large tables (and thus deep trees) and who run scans that return a lot of indexed tuples.

Faster GIST Index Building

B-Tree index builds are accellerated by pre-sorting the inputs; could the same trick be used in building GIST indexes? Again, for large tables, GIST index building is slower than B-Tree and “faster” is the #1 feature all existing users want.

Multi-Threading in Functions

This isn’t a feature request, so much as a request for clarification and assurance: PostGIS calls out to other libraries, like GEOS, and it’s possible we could make some of our algorithms there faster via parallel processing. If we do our parallel processing within a function call, so the PostgreSQL thread of execution isn’t doing anything until we return, is it OK for us to do threading? We use pthreads, or maybe OpenMP.

Compressed Datum Slicing

“Detoast datum slice” doesn’t actually get around to the slicing step until after the datum is decompressed, which can make some queries quite slow. We already try to read boxes from the headers of our objects, and for large objects that means decompressing the whole thing: it would be nice to only decompress the first few bytes. I have an ugly patch I will be testing to try and get committed.

Forced Inlining

A problem we have with PostgreSQL right now is that we cannot effectively cost our functions due to the current inlining behaviour on our wrapper functions like ST_Intersects(). When raised on the list, the hackers came to a tentative conclusion that improving the value caching behaviour would be a good way to avoid having inlining in incorrect places. It sounded like subtle and difficult work, and nobody jumped to it.

We propose leaving the current inlining logic unchanged, but adding an option to SQL language functions to force inlining for those functions. That way there is no ambiguity: we always want our wrapper functions inlined; always, always, always. If we could use an INLINE option on our wrapper function definitions all the current careful logic can remain in place for more dynamic use cases.

Extension Version Dependency

PostGIS is growing a small forest of dependent extensions

  • some inside the project, like postgis_topology and now postgis_raster
  • some outside the project, like PgRouting and pgpointcloud

When a new version of PostGIS is installed, we want all the PostGIS extensions to be updated. When a third party extension is installed, it may require features from a particular recent version of PostGIS.

The extension framework supports dependency, but for safety, as the ecosystem grows, version dependencies are going to be required eventually.

Global Upgrade Paths

Right now extension upgrade paths have to explicitly state the start and end version of the path. So an upgrade file might be named postgis--2.3.4--2.3.5.sql. That’s great if you have four or five versions. We have way more than that. The number of upgrade files we have keeps on growing and growing.

Unlike upgrade files for smaller projects, we drop and recreate all the functions in our upgrade files. That means that actually our current version upgrade file is capable of upgrading any prior version. Nonetheless, we have to make a copy, or a symlink, many many version combinations.

If there was a global “version”, we could use our master upgrade script, and only ship one script for each new version: postgis--ANY--2.3.5.sql

Size Based Costing in the Planner

Right now costing in the planner is based heavily on the “number of rows” a given execution path might generate at each stage. This is fine when the cost of processing each tuple is fairly uniform.

For us, the cost of processing a tuple can vary wildly: calculating the area of a 4 point polygon is far cheaper than calculating the area of a 40000 point polygon. Pulling a large feature out of TOAST tuples is more expensive than pulling it from main storage.

Having function COST taken into more consideration in plans, and having that COST scale with the average size of tuples would make for better plans for PostGIS. It would also make for better plans for PostgreSQL types that can get very large, like text and tsvector.

The analysis hooks might have to be enriched to also ask for stats on average tuple size for a query key, in addition to selectivity, so a query that pulled a moderate number of huge objects might have a higher cost than one that pulled quite a few small objects.

We Are Not Unreasonable People

We just want our due, you know?

Thanks, PostgreSQL team!

PostGIS Code Sprint 2018 #1

When I tell people I am heading to an open source “code sprint”, which I try to do at least once a year, they ask me “what do you do there?”

When I tell them, “talk, mostly”, they are usually disappointed. There’s a picture, which is not unearned, of programmers bent over their laptops, quietly tapping away. And that happens, but the real value, even when there is lots of tapping, is in the high-bandwidth, face-to-face communication.

PostGIS Code Sprint 2018 #1

So, inevitably I will be asked what I coded, this week at the PostGIS Code Sprint and I will answer… “uhhhhh”. I did a branch of PostgreSQL that will do partial decompression of compressed tuples, but didn’t get around to testing it. I tested some work that others had done. But mostly, we talked.

PostGIS 3

Why move to PostGIS 3 for the next release? Not necessarily because we will have any earth-shattering features, but to carry out a number of larger changes. Unlike most PostgreSQL extensions, PostGIS has a lot of legacy from past releases and has added, removed and renamed functions over time. These things are disruptive, and we’d like to do some of the known disruptive things at one time.

Split Vector and Raster

When we brought raster into PostGIS, we included it in the “postgis” extension, so if you CREATE EXTENSION postgis you get both vector and raster features. The rationale was that if we left it optional, packagers wouldn’t build it, and thus most people wouldn’t have access to the functionality, so it wouldn’t get used, so we’d be maintaining unused garbage code.

Even being included in the extension, by and large people haven’t used it much, and the demand from packagers and other users to have a “thin” PostGIS with only vector functionality have finally prevailed: when you ALTER EXTENSION postgis UPDATE TO '3.0.0' the raster functions will be unbundled from the extension. They can then be re-bundled into a “postgis_raster” dependent package and either dropped or kept around depending on user preference.

Remove postgis.so Minor Version

For users in production, working with packaged PostgreSQL, in deb or rpm packages, the packaging system often forces you to have only one version of PostGIS installed at a time. When upgrading PostgreSQL and PostGIS the net effect is to break pg_upgrade, meaning PostGIS users are mandated to do a full dump/restore.

Removing the minor version will allow the pg_upgrade process to run through, and users can then run the sql ALTER EXTENSION postgis UPDATE command to synchronize their SQL functions with the new binary postgis.so library.

This is good for most users. It’s bad for users who expect to be able to run multiple versions of PostGIS on one server: they won’t easily be able to. There will be a switch to make it possible to build with minor versions again, but we expect it will mostly be used by us (developers) for testing and development.

Serialization

As I discussed recently, the compression of geometries in PostgreSQL can have a very large effect on performance.

A new serialization could:

  • use a more effective compression format for our kind of data, arrays of autocorrelated doubles
  • add space for more flag bits for things like
    • encoding a smaller point format
    • flagging empty geometries
    • noting the validity of the object
    • noting the presense of a unique hash code
    • extra version bits
    • optional on-disk internal indexes or summary shapes

It’s not clear that a new serialization is a great idea. The only really pressing problem is that we are starting to use up our flag space.

Validity Flag

Testing geometry validity is computationally expensive, so for workflows that require validity a lot of time is spent checking and rechecking things that have already been confirmed to be valid. Having a flag on the object would allow the state to be marked once, the first time the check is done.

The downside of a validity flag is that every operation that alters coordinates must then carefully make sure to turn the flag off again, as the object may have been rendered invalid by the processing.

Exception Policy

A common annoyance for advanced users of PostGIS is when a long running computation stops suddenly and the database announces “TopologyException”!

It would be nice to provide some ways for users to indicate to the database that they are OK losing some data or changing some data, if that will complete the processing for 99% of the data.

We discussed adding some standard parameters to common processing functions:

  • null_on_exception (true = return null, false = exception)
  • repair_on_exception (true = makevalid() the inputs, false = do the null_on_exception action)

Modern C

C is not a quickly changing langauge, but since the PostgreSQL project has moved to C99, we will also be moving to C99 as our checked standard language.

Named Parameters

A lot of our function definitions were written before the advent of default values and named parameters as PostgreSQL function features. We will modernize our SQL extension file so we’re using named parameters everywhere. For users this will mean that correct parameter order will not be required anymore, it will be optional if you use named parameters.

M Coordinate

PostGIS supports “4 dimensional” features, with X, Y, Z and M, but while everyone knows what X, Y and Z are, only GIS afficionados know what “M” is. We will document M and also try and bring M support into GEOS so that operations in GEOS are “M preserving”.

Project Actions

Web Site

The web site is getting a little crufty around content, and the slick styling of 7 years ago is not quite as slick. We agreed that it would be OK to modernize, with particular emphasis on:

  • thinking about new user onboarding and the process of learning
  • supporting mobile devices with a responsive site style
  • using “standard” static site infrastructure (jekyll probably)

Standard Data

We agreed that having some standard data that was easy to install would make a whole bunch of other tasks much easier:

  • writing standard workshops and tutorials, so all the examples lined up
  • writing a performance harness that tracked the performance of common queries over time
  • having examples in the reference documentation that didn’t always have to generate their inputs on the fly

News File Policy

It’s a tiny nit, but for developers back-porting fixes over our 4-5 stable branches, knowing where to note bugs in the NEWS files, and doing it consistently is important.

  • Bug fixes applied to stable branches always listed in NEWS for each branch applied to
  • Bug fixes applied to stable and trunk should be listed in NEWS for each branch and NEWS in trunk
  • Bug fixes applied to only trunk can be left out of trunk NEWS (these bugs are development artifacts, not bugs that users in production have reported)

Next…

We also discussed features and changes to PostgreSQL that would help PostGIS improve, and I’ll write about those in the next post.

5x Faster Spatial Join with this One Weird Trick

My go-to performance test for PostGIS is the point-in-polygon spatial join: given a collection of polygons of variables sizes and a collection of points, count up how many points are within each polygon. It’s a nice way of testing indexing, point-in-polygon calculations and general overhead.

Setup

First download some polygons and some points.

Load the shapes into your database.

shp2pgsql -s 4326 -D -I ne_10m_admin_0_countries.shp countries | psql performance
shp2pgsql -s 4326 -D -I ne_10m_populated_places.shp places | psql performance

Now we are ready with 255 countries and 7343 places.

Countries and Places

One thing to note about the countries is that they are quite large objects, with 149 of them having enough vertices to be stored in TOAST tuples.

SELECT count(*) 
  FROM countries 
  WHERE ST_NPoints(geom) > (8192 / 16);

Baseline Performance

Now we can run the baseline performance test.

SELECT count(*), c.name 
  FROM countries c 
  JOIN places p 
  ON ST_Intersects(c.geom, p.geom) 
  GROUP BY c.name;

On my laptop, this query takes 25 seconds.

If you stick the process into a profiler while running it you’ll find that over 20 of those seconds are spent in the pglz_decompress function. Not doing spatial algorithms or computational geometry, just decompressing the geometry before handing it on to the actual processing.

Among the things we talked about this week at our PostGIS code sprint have been clever ways to avoid this overhead:

  • Patch PostgreSQL to allow partial decompression of geometries.
  • Enrich our serialization format to include a unique hash key at the front of geometries.

These are cool have-your-cake-and-eat-too ways to both retain compression for large geometries and be faster when feeding them into the point-in-polygon machinery.

However, they ignore a more brutal and easily testable approach to avoiding decompression: just don’t compress in the first place.

One Weird Trick

PostGIS uses the “main” storage option for its geometry type. The main option tries to keep geometries in their original table until they get too large, then compresses them in place, then moves them to TOAST.

There’s another option “external” that keeps geometries in place, and if they get too big moves them to TOAST uncompressed. PostgreSQL allows you to change the storage on columns at run-time, so no hacking or code is required to try this out.

-- Change the storage type
ALTER TABLE countries
  ALTER COLUMN geom
  SET STORAGE EXTERNAL;

-- Force the column to rewrite
UPDATE countries
  SET geom = ST_SetSRID(geom, 4326);

-- Re-run the query  
SELECT count(*), c.name 
  FROM countries c 
  JOIN places p 
  ON ST_Intersects(c.geom, p.geom) 
  GROUP BY c.name;

The spatial join now runs in under 4 seconds.

What’s the penalty?

  • With a “main” storage the table+toast+index is 6MB.
  • With a “external” storage the table+toast+index is 9MB.

Conclusion

For a 50% storage penalty, on a table that has far more large objects than most spatial tables, we achieved a 500% performance improvement. Maybe we shouldn’t apply compression to our large geometry at all?

Using “main” storage was mainly a judgement call back when we decided on it, it wasn’t benchmarked or anything – it’s possible that we were just wrong. Also, only large objects are compressed; since most tables are full of lots of small objects (short lines, points) changing to “external” by default wouldn’t have any effect on storage size at all.

Digital Transformation and Fundamental Change

I’m a huge supporter of modernizing government IT practices, but there’s something apocolyptic about the rhetoric of “digital transformation” folks which I cannot quite wrap my head around. Maybe it’s because I’m on the outside looking in, so I cannot perceive the cultural problems that they see.

To me, the message that “we have to fundamentally change how the public service operates” seems like a recapitulation, at the organizational and leadership level, of the worst mistake of old school IT: the old system is no good, we need a brand new system.

Back in the early aughts, when I was a fluffy young IT boffin, and discovered open source software, I was pretty sure we were on the cusp of a radical and immediate transformation: the open source model was so self-evidently better that a culture change would necessarily flow through IT.

Here we are, coming up on 20 years later and… things are still slowly getting better?

There was no radical change, the overall culture of IT slowly changed, and the things that I once had to argue loudly for – agile development (we called it RAD), GitHub (we called it open development), open source (OK, we called it that too) – are now accepted as somewhat normal practices.

Government culture will change because the rest of the culture is also changing. Sure, government is heirarchical. It used to be a whole lot more heirarchical as were the post-WW2 corporate and military cultures it co-existed with.

Small organizations are getting flat and more agile. Larger ones are following along at their own pace. As the largest of the organizations, government sometimes runs slowest. The challenge is to speed up the pace of change without breaking the beast.

I don’t see how a message like “we have to fundamentally change how the public service operates” isn’t going to give rise to a lot of push-back from people who aren’t necessarily opposed in principal to digital, but who are surely opposed in practice to being talked down to.

At the same time, for people inclined to support digital, it implies a manichean, “year zero” approach to change that is fundamentally unrealistic. With lots of institutional support and political backing and always the best of intentions among civil servants, a more digital government will arrive in 10 years instead of 20.

E&N (T)rail Time

The last week of August, I took three days and rode my bike from Victoria to Courtenay. It was a marvelous trip, and I got to see and stay in some wonderful towns along the way: Cowichan Bay, Duncan, Chemainus, Ladysmith, Nanaimo, Parksville, Qualicum Beach, Fanny Bay, Union Bay and Courtenay.

Active rail line has not seen a train since 2011

I also got to see a good portion of the old E&N railway line, as that line also passes through all the little towns I visited (with the exception of Cowichan Bay). It doesn’t take a trained surveyor to see that most of the railbed is in really poor condition. In many places the ties are rotting out, and you can pull spikes out of them with your bare hands. Running a train on the thing is going to take huge investments to basically rebuild the rail bed (and many of the trestles) from scratch, and the economics don’t work: revenues from freight and passenger service couldn’t even cover the operating costs of the line before it was shut down, let alone support a huge capital re-investment.

Cast bronze totem in Duncan

What to do with this invaluable right-of-way, an unobstructed ribbon of land running from Victoria to Courtenay (and beyond to Port Alberni)?

May I (and others) suggest a rail trail?

My breakfast destination in Ladysmith

Right now this chunk of land is returning nothing to the province economically. It’s actually a net drain, as municipalities spend money maintaining unused level crossings and the Island Corridor Foundation (ICF) spends federal and provincial grants to cut brush and replace the occasional tie on the never-again-to-be-used line.

Nanaimo waterfront promenade

Unlike the current ghost railway, a recreational trail would pay for itself almost immediately.

  • My first point of anecdata is my own 3-day bike excursion. Between accomodations, snacks along the way, and very tasty dinners (Maya Norte in Ladysmith and CView in Qualicum) I injected about $400 into the local economies over just two nights.
  • My second point of anecdata is an economic analysis of the Rum Runner’s Trail in Nova Scotia. The study shows annual expenditures by visitors alone of $3M per year. That doesn’t even count the economic benefit of local commuting and connection between communities.
  • My third point of anecdata is to just multiply $200 per night by three nights (decent speed) to cover the whole trail and 2000 marginal new tourists on the trail to get $1.2M direct new dollars. I find my made-up numbers are very compelling.
  • My fourth point of anecdata is the Mackenzie Interchange, currently under construction for over $70M. There is no direct economic benefit to this infrastructure, it will induce no tourist dollars and generate no long term employment.

If a Vancouver Island Rail Trail can generate even $3M in net new economic benefit for the province, it warrants a at least $50M investment to generate an ongoing 6% return. We spend more money for less return routinely (see the Mackenzie Interchange above).

No traffic on the line in Qualicum

And that’s just the tourism benefit.

Electric bikes are coming, and coming fast. A paved, continuous trail will provide another transportation alternative that is currently unavailable. Take it from me, I rode from Nanaimo to Parksville on the roaring busy highway 19 through Nanoose: it’s a terrible experience, nobody wants to do that. Cruising a paved rail trail on a quietly whirring electic bike though, that would be something else again.

Right now the E&N line is not a transportation alternative. Nor is it a tourist destination. Nor is it a railway. It’s time to put that land back to work.