PostGIS Day in STL

Every year, on the second Wednesday of November, Esri (“the Microsoft of GIS”) promotes a day of celebration, “GIS Day” in which the members of our community unite to tell the world about the wonders of cartography and spatial data and incidentally use their software a lot in the process.

And every year, for the last number of years, on the day after “GIS Day”, a motley crew of open source users and SQL database afficionados observe “PostGIS Day”. Until this fall, I had never had a chance to personally participate in a PostGIS Day event, but this year Crunchy sponsored a day in St Louis, and I got to talk an awful lot about PostGIS.

It was really good, and I feel like there’s lots more to be done, if only on the subject of spatial SQL and analysis in the database. Here’s the talks I gave, the balance are on the event page.

PostGIS Introduction

Serving Dynamic Vector Tiles

Geocoding and Text Search in PostGIS

PostGIS 3.0 Overview

Open Source Debate @ North51

In a little change from the usual course of conference structure, I was invited to debate the merits of open source versus proprietary software at the North51 conference in Banff, Alberta last week.

Debate

There isn’t a video, but to give you a flavour of how I approached the task, here’s my opening and closing statements:


Opening

Thank you for having me here, Jon, and choosing me to represent the correct side of this argument.

So, to provide a little context, I’d like to start by reading the founding texts of this particular disagreement.

The first text is Bill Gates’ “Open Letter to Hobbyists”, published in the Homebrew Computer Club newsletter in February of 1976, kicking off the era of proprietary software that is still be-devilling us, 45 years later.

“Almost a year ago, Paul Allen and myself, expecting the hobby market to expand, hired Monte Davidoff and developed Altair BASIC. Though the initial work took only two months, the three of us have spent most of the last year documenting, improving and adding features to BASIC… The value of the computer time we have used exceeds $40,000.

“The feedback we have gotten from the hundreds of people who say they are using BASIC has all been positive. Two surprising things are apparent, however, 1) Most of these “users” never bought BASIC and 2) The amount of royalties we have received from sales to hobbyists makes the time spent on Altair BASIC worth less than $2 an hour.

“Why is this? As the majority of hobbyists must be aware, most of you steal your software. Hardware must be paid for, but software is something to share. Who cares if the people who worked on it get paid?

The second text is the initial announcement of the Linux source code, sent by Linus Torvalds in August of 1991

“Hello everybody out there using minix - I’m doing a (free) operating system (just a hobby, won’t be big and professional like gnu) for 386(486) AT clones. … , and I’d like to know what features most people would want. Any suggestions are welcome, but I won’t promise I’ll implement them :-) Linus

After sending these messages, both these technology innovators went on to manage the creation of operating systems that have become dominant, industry standards. Microsoft Windows and Linux.

The difference is that, in the process, Gates became a multi-billionaire and for a time the richest man in the world, while Torvalds is just a garden variety millionaire.

A simple interpretation of this set of facts, of Gates berating hobbyists for “stealing” his software, and of Torvalds welcoming folks to provide him with feature suggestions, is that Gates is a miserable, grasping, corporate greedhead, and Torvalds is a far-seeing, generous, socially conscious computer monk.

There’s something to that.

Still…. A more nuanced view looks at the dates the letters were sent.

In 1976, when Gates sent his letter, the way you built a substantial piece of serious software, is you got 1, 10, 100 programmers together in one building, so you could coordinate their efforts, and you paid them to come in every day and work on it.

And you kept on paying them, until they were done, or done enough to ship, 1, 2 or more months later. And then you made back all that money afterwards, selling many many copies of the software.

This was the shrink-wrap proprietary software model, and it was really easy to understand, because it was exactly the same model used for books, or music, or movies.

Hey… has anyone noticed, a change, in the way we consume books, or music, or movies? Things are different than they were, in 1976. Right?

OK, so what is REALLY different about Torvalds? The difference is manifest in the very way he sent his message. He didn’t mail it to a magazine. He published it electronically, on a Usenet bulletin board, an early mailing list, of Unix operating system enthusiasts, comp.os.minix.

And those enthusiasts joined him in building a complete, open source, operating system. They didn’t have to move to Seattle, and he didn’t have to pay them, and they didn’t have to work on it 8 hours a day, 5 days a week.

They worked on it a bit at a time, in the time they had, and it got bigger.

And eventually it was useful enough the companies started using it, for important things. And they started hiring people to work on it, 8 hours a day, 5 days a week.

But they didn’t have to move to Seattle, either. And they didn’t have to sign over their work to Microsoft, to do the work.

And eventually Linux was so useful that the majority of people working on it were being paid, but not by any one company.

Linus Torvalds works for the Linux Foundation, which is funded by the largest companies in tech.

The top contributors to the Linux kernel in 2017 worked for Intel, Red Hat, Linaro, Samsung, SUSE and IBM. And that only accounts for about 1/3 of contributions… the rest come from a long tail of other contributors.

We’re going to end up talking about definitions and arguing over words a little in this session, I imagine, so I’d like to get one common one out of the way early.

The opposite of “open source” is “proprietary”, not “commercial”. Open source licensing does not foreclose all commercial ventures, it only forecloses those predicated on restricting access to the source code.

You wouldn’t describe my previous employer, Carto, as an “open source company”, they sell geospatial intelligence software as a service. The software they write is all open source licensed and available on Github.

You wouldn’t describe my current employer, Crunchy data as a “non-commercial” entity. We are a private for-profit company, with a corporate headquarters, major clients in government and industry, and a headcount of almost 100.

It’s worth noting that both those employers of mine are headquartered nowhere near where I live, in British Columbia.

That’s not an accident. That really the whole story. It’s the heart of the matter.

The proprietary software model is obsolete.

It is a relic of a bygone age, the pre-internet age.

Before the advent of the internet, the proprietary model was useful to capitalize some forms of large scale software development, but now, in the internet era, it’s mostly an obstacle, a brake on innovation.

Those are some hard words. I know. But if you don’t believe me, maybe ask the number one contributor to open source projects on Github. It’s a little Seattle company called “Microsoft”.

Not only is Microsoft the number one corporate contributor to open source software hosted on Github, they liked Github itself so much they bought the company.

Microsoft knows the future is open source, I know the future is open source, and I hope by the time we’re done you all know the future is open source.


Closing

I know there is proprietary software in the world.

I am not saying there is no place for proprietary software in the world.

Some of my best friends use proprietary software.

I do not harangue them, I do not put them down.

Well, not to their faces, anyways.

What I do know is that the system of proprietary software is a system in which all the incentives align towards passivity, towards stagnation, and away from innovation.

Proprietary vendor’s incentives are to lock customers in, and to drive customers towards further purchases of more product.

This leads to less interoperability.

It leads to less efficient software.

It leads to artificial boundaries in functionality, between “Basic” versions and “Pro” versions and “Enterprise” versions – boundaries predicated on nothing but revenue maximization.

Customers dependent on proprietary software lose their agency in improving their tools. They adopt poor systems designs to fit their software use within the terms of their license limits.

They stop thinking that their tools could be something other than what the vendor says they are.

They lose their freedom.

  • As a society, we should bias toward a software development model that maximizes innovation. Open source is that model.
  • As organizations, we should bias toward a software development model and tools that maximize flexibility and cooperation. Open source is that model.
  • As individuals, we should bias towards tools that maximize our ability to move our skills from employer to employer, regardless of what vendor they happen to use. Open source provides those tools.

Embrace your power, embrace your freedom, embrace open source.

PostgreSQL and SQL and Books

I never tire of telling developers that they should learn SQL.

And I never run out of developers for whom that is good advice.

I think the reason is that so many developers learn basic SQL CRUD operations, and then stop. They can filter with a WHERE clause, they can use Sum() and GROUP BY, they can UPDATE and DELETE.

If they are particularly advanced, they can do a JOIN. But that’s usually where it ends.

And the tragedy is that, because they stop there, they end up re-writing big pieces of data manipulation logic in their applications – logic that they could skip if only they knew what their SQL database engine was capable of.

Since so many developers are using PostgreSQL now, I have taken to recommending a couple of books, written by community members.

For people getting started with PostgreSQL, and SQL, the Art of PostgreSQL, by Dmitri Fontaine.

Art of PostgreSQL

For people who are wanting to learn PostGIS, and spatial SQL, I recommend PostGIS in Action, by Regina Obe and Leo Hsu.

PostGIS in Action

Both Dmitri and Regina are community members, and both have been big contributors to PostgreSQL and PostGIS. One of the key PostgreSQL features that PostGIS uses is the “extension” system, which Dmitri implemented many years ago now. And of course Regina has been active in the PostGIS develompent community almost since the first release in the early 2000s.

I often toy with the idea of writing a PostGIS or a PostgreSQL book, but then I stop and think, “wait, there’s already lots of good ones out there!” So I wrote this short post instead.

OGR FDW Spatial Filtering

The OGR FDW now pushes spatial filters down to remote data sources!

Whuuuut?!?!?

The Basics

OK, first, “OGR” is a subcomponent of the GDAL toolkit that allows generic access to dozens of different geospatial file formats. The OGR part handles the “vector” data (points, lines and polygons) and the GDAL part handles the “raster” data (imagery, elevation grids).

Second, “FDW” is a “foreign data wrapper”, an extension API for PostgreSQL that allows developers to connect non-database information to the database and present it in the form of a table.

The simplest FDWs, like the Oracle FDW, just make remote database tables in foreign systems look like local ones. Connecting two databases is “easy” because they share the same data model: tables of typed columns and rows of data.

The OGR data model is pleasantly similar to the database data model. Every OGR “datasource” (database) has “layers” (tables) made of “fields” (columns) with data types like “string” (varchar) and “number” (integer, real).

Now, combine the two ideas of “OGR” and “FDW”!

The “OGR FDW” uses the OGR library to present geospatial data sources as tables inside a PostgreSQL database. The FDW abstraction layer lets us make tables, and OGR abstraction layer lets those tables be sourced from almost any geospatial file format or server.

It’s an abstraction layer over an abstraction layer… the best kind!

Setup the FDW

Here’s an example that connects to a “web feature service” (WFS) from Belgium (we all speak Flemish, right?) and makes a table of it.


CREATE EXTENSION postgis;
CREATE EXTENSION ogr_fdw;

CREATE SERVER wfsserver 
  FOREIGN DATA WRAPPER ogr_fdw 
  OPTIONS (
    datasource 'WFS:http://geoservices.informatievlaanderen.be/overdrachtdiensten/Haltes/wfs',
    format 'WFS',
    config_options 'CPL_DEBUG=ON'
  );

CREATE FOREIGN TABLE haltes (
    fid bigint,
    shape Geometry(Point,31370),
    gml_id varchar,
    uidn double precision,
    oidn double precision,
    stopid double precision,
    naamhalte varchar,
    typehalte integer,
    lbltypehal varchar,
    codegem varchar,
    naamgem varchar
  ) 
  SERVER wfsserver 
  OPTIONS (
    layer 'Haltes:Halte'
  );

Pushdown from FDW

Let’s run a query on the haltes table, and peak into what the OGR FDW is doing, by setting the debug level to DEBUG1.

SET client_min_messages = DEBUG1;

SELECT gml_id, ST_AsText(shape) AS shape, naamhalte, lbltypehal
  FROM haltes 
  WHERE lbltypehal = 'Niet-belbus'
    AND shape && ST_MakeEnvelope(207950, 186590, 207960, 186600, 31370);

We get back one record, and two debug entries:

DEBUG:  OGR SQL: (LBLTYPEHAL = 'Niet-belbus')
DEBUG:  OGR spatial filter (207950 186590, 207960 186600)
-[ RECORD 1 ]-----------------------
gml_id     | Halte.10328
shape      | POINT(207956 186596)
naamhalte  | Lummen Frederickxstraat
lbltypehal | Niet-belbus

The debug entries are generated by the OGR FDW code, when it recognizes there are parts of the SQL query that can be passed to OGR:

  • OGR understands some limited SQL syntax, and OGR FDW passes those parts of any PostgreSQL query down to OGR.
  • OGR can handle simple bounding box spatial filters, and when OGR FDW sees the use of the && PostGIS operator, it passes the filter constant down to OGR.

So OGR FDW is passing the attribute and spatial filters from the SQL down to the OGR layer. But are they then being passed on to the remote datasource?

Pushdown from OGR

Every OGR “driver” is capable of pushing different amounts of logic down to the source data.

  • A driver that reads a file format cannot push anything down: there is no logic in a file.
  • A driver that reads from a database can push a lot down: databases are rich and powerful execution engines in their own right.

Our example data source, the Belgian “web feature server” actually supports both attribute and spatial filters, and the OGR driver will pass them down.

We can see OGR passing the filters down because when we created the server, we set config_options 'CPL_DEBUG=ON', to expose the GDAL logging information to our PostgreSQL server.

The GDAL debug entries are visible when we set the logging level to DEBUG2

SET client_min_messages = DEBUG2;

SELECT gml_id, ST_AsText(shape) AS shape, naamhalte, lbltypehal
  FROM haltes 
  WHERE lbltypehal = 'Niet-belbus'
    AND shape && ST_MakeEnvelope(207950, 186590, 207960, 186600, 31370);

Now we get a whole slew of logging, but I’m only going to pull out one line, the line that shows the WFS query that OGR sends to the remote server:

DEBUG:  GDAL None [0] WFS: http://geoservices.informatievlaanderen.be/overdrachtdiensten/Haltes/wfs?SERVICE=WFS&VERSION=1.1.0&REQUEST=GetFeature&TYPENAME=Haltes:Halte&FILTER=%3CFilter%20xmlns%3D%22http:%2F%2Fwww.opengis.net%2Fogc%22%20xmlns:Haltes%3D%22informatievlaanderen.be%2FHaltes%22%20xmlns:gml%3D%22http:%2F%2Fwww.opengis.net%2Fgml%22%3E%3CAnd%3E%3CPropertyIsEqualTo%3E%3CPropertyName%3ELBLTYPEHAL%3C%2FPropertyName%3E%3CLiteral%3ENiet%2Dbelbus%3C%2FLiteral%3E%3C%2FPropertyIsEqualTo%3E%3CBBOX%3E%3CPropertyName%3EHaltes:SHAPE%3C%2FPropertyName%3E%3Cgml:Box%3E%3Cgml:coordinates%3E207950.0000000000000000,186590.0000000000000000%20207960.0000000000000000,186600.0000000000000000%3C%2Fgml:coordinates%3E%3C%2Fgml:Box%3E%3C%2FBBOX%3E%3C%2FAnd%3E%3C%2FFilter%3E

Awesome, right?

That’s pretty much un-readable, but if I copy out the value in the FILTER request variable, and reverse the URL encoding, I get this:

<Filter
  xmlns="http://www.opengis.net/ogc"
  xmlns:Haltes="informatievlaanderen.be/Haltes"
  xmlns:gml="http://www.opengis.net/gml">
  <And>
    <PropertyIsEqualTo>
      <PropertyName>LBLTYPEHAL</PropertyName>
      <Literal>Niet-belbus</Literal>
    </PropertyIsEqualTo>
    <BBOX>
      <PropertyName>Haltes:SHAPE</PropertyName>
      <gml:Box>
        <gml:coordinates>
          207950.0000000000000000,186590.0000000000000000
          207960.0000000000000000,186600.0000000000000000
        </gml:coordinates>
      </gml:Box>
    </BBOX>
  </And>
</Filter>

I know, who ever thought that jamming an XML encoded version of a SQL filter into an HTTP GET request was a good idea? (Some very very nice people.)

Anyways, as you can see, both the attribute and spatial portions of our original SQL query have been re-encoded as a WFS XML filter, and sent to the remote server.

OGR FDW correctly pushed the attribute and spatial portions of the WHERE clause into OGR, and OGR correctly pushed those filters into the dialect of the driver we were using, in this case the WFS driver.

The End

The really really cool part is that if we had been using, for example, the Oracle driver, OGR would have instead generated Oracle-compatible SQL and pushed that down!

It’s an abstraction layer over an abstraction layer… the best kind!

ST_Subdivide all the Things

This post originally appeared in the CARTO blog.

One of the things that makes managing geospatial data challenging is the huge variety of scales that geospatial data covers: areas as large as a continent or as small as a man-hole cover.

The data in the database also covers a wide range, from single points, to polygons described with thousands of vertices. And size matters! A large object takes more time to retrieve from storage, and more time to run calculations on.

The Natural Earth countries file is a good example of that variation. Load the data into PostGIS and inspect the object sizes using SQL:

SELECT admin, ST_NPoints(the_geom), ST_MemSize(the_geom) 
FROM ne_10m_admin_0_countries 
ORDER BY ST_NPoints;
  • Coral Sea Islands are represented with a 4 point polygon, only 112 bytes.
  • Canada is represented with a 68159 point multi-polygon, 1 megabytes in size!

Countries by Size in KB

Over half (149) of the countries in the table are larger than the database page size (8Kb) which means they will take extra time to retrieve.

SELECT Count(*) 
FROM ne_10m_admin_0_countries 
WHERE ST_MemSize(the_geom) > 8192;

We can see the overhead involved in working with large data by forcing a large retrieval and computation.

Load the Natural Earth populated places into PostGIS as well, and then run a full spatial join between the two tables:

SELECT Count(*)
FROM ne_10m_admin_0_countries countries 
JOIN ne_10m_populated_places_simple places 
ON ST_Contains(countries.the_geom, places.the_geom)

Even though the places table (7322) and countries table (255) are quite small the computation still takes several seconds (about 30 seconds on my computer).

The large objects cause a number of inefficiencies:

  • Geographically large areas (like Canada or Russia) have large bounding boxes, so the indexes don’t work as efficiently in winnowing out points that don’t fall within the countries.
  • Physically large objects have large vertex lists, which take a long time to pass through the containment calculation. This combines with the poor winnowing to make a bad situation worse.

How can we speed things up? Make the large objects smaller using ST_Subdivide()!

First, generate a new, sub-divided countries table:

CREATE TABLE ne_10m_admin_0_countries_subdivided AS
SELECT ST_SubDivide(the_geom) AS the_geom, admin 
FROM ne_10m_admin_0_countries;

Now we have the same data, but no object is more than 255 vertices (about 4Kb) in size!

Subdivided Countries by Size in KB

Run the spatial join torture test again, and see the change!

SELECT Count(*)
FROM ne_10m_admin_0_countries_subdivided countries 
JOIN ne_10m_populated_places_simple places 
ON ST_Contains(countries.the_geom, places.the_geom)

On my computer, the return time about 0.5 seconds, or 60 times faster, even though the countries table is now 8633 rows. The subdivision has accomplished two things:

  • Each polygon now covers a smaller area, so index searches are less likely to pull up points that are not within the polygon.
  • Each polygon is now below the page size, so retrieval from disk will be much faster.

Subdividing big things can make map drawing faster too, but beware: once your polygons are subdivided you’ll have turn off the polygon outlines to avoid showing the funny square boundaries in your rendered map.

Happy mapping and querying!