Proj 6 in PostGIS

Map projection is a core feature of any spatial database, taking coordinates from one coordinate system and converting them to another, and PostGIS has depended on the Proj library for coordinate reprojection support for many years.

Proj 6 in PostGIS

For most of those years, the Proj library has been extremely slow moving. New projection systems might be added from time to time, and some bugs fixed, but in general it was easy to ignore. How slow was development? So slow that the version number migrated into the name, and everyone just called it “Proj4”.

No more.

Starting a couple years ago, new developers started migrating into the project, and the pace of development picked up. Proj 5 in 2018 dramatically improved the plumbing in the difficult area of geodetic transformation, and promised to begin changing the API. Only a year later, here is Proj 6, with yet more huge infrastructural improvements, and the new API.

Some of this new work was funded via the GDALBarn project, so thanks go out to those sponsors who invested in this incredibly foundational library and GDAL maintainer Even Roualt.

For PostGIS that means we have to accomodate ourselves to the new API. Doing so not only makes it easier to track future releases, but gains us access to the fancy new plumbing in Proj.

Proj 6 in PostGIS

For example, Proj 6 provides:

Late-binding coordinate operation capabilities, that takes metadata such as area of use and accuracy into account… This can avoid in a number of situations the past requirement of using WGS84 as a pivot system, which could cause unneeded accuracy loss.

Or, put another way: more accurate results for reprojections that involve datum shifts.

Here’s a simple example, converting from an old NAD27/NGVD29 3D coordinate with height in feet, to a new NAD83/NAVD88 coordinate with height in metres.

SELECT ST_Astext(
         ST_Transform(
           ST_SetSRID(geometry('POINT(-100 40 100)'),7406), 
           5500));

Note that the height in NGVD29 is 100 feet, if converted directly to meters, it would be 30.48 metres. The transformed point is:

POINT Z (-100.0004058 40.000005894 30.748549546)

Hey look! The elevation is slightly higher! That’s because in addition to being run through a horizontal NAD27/NAD83 grid shift, the point has also been run through a vertical shift grid as well. The result is a more correct interpretation of the old height measurement in the new vertical system.

Astute PostGIS users will have long noted that PostGIS contains three sources of truth for coordinate references systems (CRS).

Within the spatial_ref_sys table there are columns:

  • The authname, authsrid that can be used, if you have an authority database, to lookup an authsrid and get a CRS. Well, Proj 6 now ships with such a database. So there’s one source of truth.
  • The srtext, a string representation of a CRS, in a standard ISO format. That’s two sources.
  • The proj4text, the old Proj string for the CRS. Until Proj 6, this was the only form of definition that the Proj library could consume, and hence the only source of truth that mattered to PostGIS. Now, it’s a third source of truth.

Knowing this, when you ask PostGIS to transform to an SRID, what will it do?

  • If there are non-NULL values in authname and authsrid ask Proj to return a CRS based on those entries.
  • If Proj fails, and there is a non-NULL srtext ask Proj to build a CRS using that text.
  • If Proj still fails, and there is a non-NULL proj4text ask Proj to build a CRS using that text.

In general, the best transforms will come by having Proj look-up the CRS in its own database, because then it can apply all the power of “late binding” to ensure the best transformation for each geometry. Hence we bias in favour of Proj lookups, then the quite detailed WKT format, and finally the old Proj format.

Dr. JTS comes to Crunchy

Today’s an exciting day in the Victoria office of Crunchy Data – our local staff count goes from one to two, as Martin Davis joins the company!

This is kind of a big deal, because this year Martin and I will be spending much or our time on the core computational geometry library that powers PostGIS, the GEOS library, and the JTS library from which it derives its structure.

Why is that a big deal? Because GEOS, JTS and other language ports provide the computational geometry algorithms underneath most of the open source geospatial ecosystem – so improvements in our core libraries ripple out to help a huge swathe of other software.

JTS came first, initially as a project of the British Columbia government. GEOS is a C++ port of JTS. There are also Javascript and .Net ports (JSTS and NTS).

Each of those libraries has developed a rich downline of other libraries and projects that depend on them. On the desktop, on the web, in the middleware, JTS and GEOS power all of it.

So we know that work on JTS and GEOS on our side is going to benefit far more than just PostGIS.

I’ve already spent a decent amount of time on bringing the GEOS library up to date with the changes in JTS over the past few months, and trying to fulfill the “maintainer” role, merging pull requests and closing some outstanding tickets.

As Martin starts adding to JTS, I now feel more confident in my ability to bring those changes into the C++ world of GEOS as they land.

Without pre-judging what will get first priority, topics of overlay robustness, predicate performance, and geometry cleaning are near the top of our list.

Our spatial customers at Crunchy process a lot of geometry, so ensuring that PostGIS (GEOS) operations are robust and high performance is a big win for PostgreSQL and for our customers as well.

Esri and Winning

How much winning is enough? Have you been winning so much that you’re tired of winning now?

I ask because last month I gave a talk (PDF Download) to the regional GIS association in Manitoba, about open source and open data. My talk included a few points that warned about the downsides of being beholden to a single software vendor, and it included some information about the software available in the open source geospatial ecosystem.

MGUG Keynote 2018

In a testament to the full-spectrum dominance of Esri, the audience was almost entirely made up of Esri customers. The event was sponsored by Esri. Esri had a table at the back of the room. Before giving my talk, I made a little joke that my talk would in fact include some digs at Esri (though I left the most pointed ones on the cutting room floor).

People seemed to like the talk. They laughed at my jokes. They nodded in the right places.

Among the points I made:

  • Single-vendor dominance in our field is narrowing the understanding of “what is possible” amongst practitioners in that ecosystem.
  • Maintaining a single-vendor policy dramatically reduces negotiating power with that vendor.
  • Maintaining a single-vendor policy progressively de-skills your staff, as they become dependant on a single set of tooling.
  • Practitioners have higher market value when they learn more than just the tools of one vendor, so self-interest dictates learning tools outside the single-vendor ecosystem.
  • Point’n’click GIS tools from Esri have widened access to GIS, which is a good thing, but driven down the market value of practitioners who limit themselves to those tools.

None of these points is unique to Esri – they are true of any situation where a single tool has driven competitors off the field, whether it be Adobe graphics tools or Autodesk CAD tools or Microsoft office automation tools.

Nor are any of these points indicative of any sort of ill will or malign intent on the part of Esri – they are just the systemic effects of market dominance. It is not contingent on Esri to change their behaviour or limit their success; it’s contingent on practitioners and managers to recognize the negative aspects of the situation and react accordingly.

And yet.

Esri and Winning

Despite the fact that almost all the people in the room were already their customers, that no new business would be endangered by my message, that all the students would still be taught their tools, that all the employers would still include them in job requirements, that people would continue to use the very words they choose to describe basic functions of our profession …

Despite all that, the Esri representative still went to the president of the association, complained to her about the content of my talk, and asked her to ensure that nothing I would say in my afternoon technical talk would be objectionable to himself. (In the event, I made some nasty jokes about Oracle, nobody complained.)

For some of these people, no amount of winning is enough, no position of dominance is safe, no amount of market leverage is sufficient.

It’s sad and it’s dangerous.

I was reminded of this last week, meeting an old friend in Australia and learning that he’d been blackballed out of a job for recommending software that wasn’t Esri software. Esri took away his livelihood for insufficient fealty.

This is the danger of dominance.

When the local Esri rep has a better relationship with your boss than you do, do you advocate for using alternative tools? You could be limiting or even potentially jeapardizing your career.

When Esri has locked up the local geospatial software market, do you bid an RFP with an alternative open tool set? You could lose your Esri partnership agreement and with it your ability to bid any other local contracts. Esri will make your situation clear to you

This is the danger of dominance.

A market with only one vendor is not a market. There’s a name for it, and there’s laws against it. And yet, our profession glories in it. We celebrate “GIS day”, a marketing creation of our dominant vendor. Our publicly funded colleges and universities teach whole curricula using only Esri tools.

And we, as a profession, do not protest. We smile and nod. We accept our “free” or “discounted” trainings from Esri (comes with our site license!) and our “free” or “discounted” tickets to the Esri user conference. If we are particularly oblivious, we wonder why those open source folks never come around to market their tools to us.

We have met the enemy, and he is us.

We have met the enemy and he is us

PostGIS Code Sprint 2018 #2

An important topic of conversation this sprint was what kinds of core PostgreSQL features might make PostGIS better in the future?

PostGIS Code Sprint 2018 #2

Wider Typmod

The current attribue typmod column is a 32-bit integer. For geometry, we are already packing that 32 bits to the limit: a couple bits for dimensionality, some more for the type number, and the bit kahune, a bunch for the SRID number. Having even a 64-bit typmod number would allow even more interesting things, like declared coordinate precision, to fit in there. Maybe we are abusing typmod and there’s a better way to do what we want though?

Parallel GIST Scan

The PostGIS spatial index is built using the PostgreSQL GIST index infrastructure, so anything that makes GIST scans faster is a win for us. This would be a big win for folks with large tables (and thus deep trees) and who run scans that return a lot of indexed tuples.

Faster GIST Index Building

B-Tree index builds are accellerated by pre-sorting the inputs; could the same trick be used in building GIST indexes? Again, for large tables, GIST index building is slower than B-Tree and “faster” is the #1 feature all existing users want.

Multi-Threading in Functions

This isn’t a feature request, so much as a request for clarification and assurance: PostGIS calls out to other libraries, like GEOS, and it’s possible we could make some of our algorithms there faster via parallel processing. If we do our parallel processing within a function call, so the PostgreSQL thread of execution isn’t doing anything until we return, is it OK for us to do threading? We use pthreads, or maybe OpenMP.

Compressed Datum Slicing

“Detoast datum slice” doesn’t actually get around to the slicing step until after the datum is decompressed, which can make some queries quite slow. We already try to read boxes from the headers of our objects, and for large objects that means decompressing the whole thing: it would be nice to only decompress the first few bytes. I have an ugly patch I will be testing to try and get committed.

Forced Inlining

A problem we have with PostgreSQL right now is that we cannot effectively cost our functions due to the current inlining behaviour on our wrapper functions like ST_Intersects(). When raised on the list, the hackers came to a tentative conclusion that improving the value caching behaviour would be a good way to avoid having inlining in incorrect places. It sounded like subtle and difficult work, and nobody jumped to it.

We propose leaving the current inlining logic unchanged, but adding an option to SQL language functions to force inlining for those functions. That way there is no ambiguity: we always want our wrapper functions inlined; always, always, always. If we could use an INLINE option on our wrapper function definitions all the current careful logic can remain in place for more dynamic use cases.

Extension Version Dependency

PostGIS is growing a small forest of dependent extensions

  • some inside the project, like postgis_topology and now postgis_raster
  • some outside the project, like PgRouting and pgpointcloud

When a new version of PostGIS is installed, we want all the PostGIS extensions to be updated. When a third party extension is installed, it may require features from a particular recent version of PostGIS.

The extension framework supports dependency, but for safety, as the ecosystem grows, version dependencies are going to be required eventually.

Global Upgrade Paths

Right now extension upgrade paths have to explicitly state the start and end version of the path. So an upgrade file might be named postgis--2.3.4--2.3.5.sql. That’s great if you have four or five versions. We have way more than that. The number of upgrade files we have keeps on growing and growing.

Unlike upgrade files for smaller projects, we drop and recreate all the functions in our upgrade files. That means that actually our current version upgrade file is capable of upgrading any prior version. Nonetheless, we have to make a copy, or a symlink, many many version combinations.

If there was a global “version”, we could use our master upgrade script, and only ship one script for each new version: postgis--ANY--2.3.5.sql

Size Based Costing in the Planner

Right now costing in the planner is based heavily on the “number of rows” a given execution path might generate at each stage. This is fine when the cost of processing each tuple is fairly uniform.

For us, the cost of processing a tuple can vary wildly: calculating the area of a 4 point polygon is far cheaper than calculating the area of a 40000 point polygon. Pulling a large feature out of TOAST tuples is more expensive than pulling it from main storage.

Having function COST taken into more consideration in plans, and having that COST scale with the average size of tuples would make for better plans for PostGIS. It would also make for better plans for PostgreSQL types that can get very large, like text and tsvector.

The analysis hooks might have to be enriched to also ask for stats on average tuple size for a query key, in addition to selectivity, so a query that pulled a moderate number of huge objects might have a higher cost than one that pulled quite a few small objects.

We Are Not Unreasonable People

We just want our due, you know?

Thanks, PostgreSQL team!

PostGIS Code Sprint 2018 #1

When I tell people I am heading to an open source “code sprint”, which I try to do at least once a year, they ask me “what do you do there?”

When I tell them, “talk, mostly”, they are usually disappointed. There’s a picture, which is not unearned, of programmers bent over their laptops, quietly tapping away. And that happens, but the real value, even when there is lots of tapping, is in the high-bandwidth, face-to-face communication.

PostGIS Code Sprint 2018 #1

So, inevitably I will be asked what I coded, this week at the PostGIS Code Sprint and I will answer… “uhhhhh”. I did a branch of PostgreSQL that will do partial decompression of compressed tuples, but didn’t get around to testing it. I tested some work that others had done. But mostly, we talked.

PostGIS 3

Why move to PostGIS 3 for the next release? Not necessarily because we will have any earth-shattering features, but to carry out a number of larger changes. Unlike most PostgreSQL extensions, PostGIS has a lot of legacy from past releases and has added, removed and renamed functions over time. These things are disruptive, and we’d like to do some of the known disruptive things at one time.

Split Vector and Raster

When we brought raster into PostGIS, we included it in the “postgis” extension, so if you CREATE EXTENSION postgis you get both vector and raster features. The rationale was that if we left it optional, packagers wouldn’t build it, and thus most people wouldn’t have access to the functionality, so it wouldn’t get used, so we’d be maintaining unused garbage code.

Even being included in the extension, by and large people haven’t used it much, and the demand from packagers and other users to have a “thin” PostGIS with only vector functionality have finally prevailed: when you ALTER EXTENSION postgis UPDATE TO '3.0.0' the raster functions will be unbundled from the extension. They can then be re-bundled into a “postgis_raster” dependent package and either dropped or kept around depending on user preference.

Remove postgis.so Minor Version

For users in production, working with packaged PostgreSQL, in deb or rpm packages, the packaging system often forces you to have only one version of PostGIS installed at a time. When upgrading PostgreSQL and PostGIS the net effect is to break pg_upgrade, meaning PostGIS users are mandated to do a full dump/restore.

Removing the minor version will allow the pg_upgrade process to run through, and users can then run the sql ALTER EXTENSION postgis UPDATE command to synchronize their SQL functions with the new binary postgis.so library.

This is good for most users. It’s bad for users who expect to be able to run multiple versions of PostGIS on one server: they won’t easily be able to. There will be a switch to make it possible to build with minor versions again, but we expect it will mostly be used by us (developers) for testing and development.

Serialization

As I discussed recently, the compression of geometries in PostgreSQL can have a very large effect on performance.

A new serialization could:

  • use a more effective compression format for our kind of data, arrays of autocorrelated doubles
  • add space for more flag bits for things like
    • encoding a smaller point format
    • flagging empty geometries
    • noting the validity of the object
    • noting the presense of a unique hash code
    • extra version bits
    • optional on-disk internal indexes or summary shapes

It’s not clear that a new serialization is a great idea. The only really pressing problem is that we are starting to use up our flag space.

Validity Flag

Testing geometry validity is computationally expensive, so for workflows that require validity a lot of time is spent checking and rechecking things that have already been confirmed to be valid. Having a flag on the object would allow the state to be marked once, the first time the check is done.

The downside of a validity flag is that every operation that alters coordinates must then carefully make sure to turn the flag off again, as the object may have been rendered invalid by the processing.

Exception Policy

A common annoyance for advanced users of PostGIS is when a long running computation stops suddenly and the database announces “TopologyException”!

It would be nice to provide some ways for users to indicate to the database that they are OK losing some data or changing some data, if that will complete the processing for 99% of the data.

We discussed adding some standard parameters to common processing functions:

  • null_on_exception (true = return null, false = exception)
  • repair_on_exception (true = makevalid() the inputs, false = do the null_on_exception action)

Modern C

C is not a quickly changing langauge, but since the PostgreSQL project has moved to C99, we will also be moving to C99 as our checked standard language.

Named Parameters

A lot of our function definitions were written before the advent of default values and named parameters as PostgreSQL function features. We will modernize our SQL extension file so we’re using named parameters everywhere. For users this will mean that correct parameter order will not be required anymore, it will be optional if you use named parameters.

M Coordinate

PostGIS supports “4 dimensional” features, with X, Y, Z and M, but while everyone knows what X, Y and Z are, only GIS afficionados know what “M” is. We will document M and also try and bring M support into GEOS so that operations in GEOS are “M preserving”.

Project Actions

Web Site

The web site is getting a little crufty around content, and the slick styling of 7 years ago is not quite as slick. We agreed that it would be OK to modernize, with particular emphasis on:

  • thinking about new user onboarding and the process of learning
  • supporting mobile devices with a responsive site style
  • using “standard” static site infrastructure (jekyll probably)

Standard Data

We agreed that having some standard data that was easy to install would make a whole bunch of other tasks much easier:

  • writing standard workshops and tutorials, so all the examples lined up
  • writing a performance harness that tracked the performance of common queries over time
  • having examples in the reference documentation that didn’t always have to generate their inputs on the fly

News File Policy

It’s a tiny nit, but for developers back-porting fixes over our 4-5 stable branches, knowing where to note bugs in the NEWS files, and doing it consistently is important.

  • Bug fixes applied to stable branches always listed in NEWS for each branch applied to
  • Bug fixes applied to stable and trunk should be listed in NEWS for each branch and NEWS in trunk
  • Bug fixes applied to only trunk can be left out of trunk NEWS (these bugs are development artifacts, not bugs that users in production have reported)

Next…

We also discussed features and changes to PostgreSQL that would help PostGIS improve, and I’ll write about those in the next post.