Developers Diary 1

I’m not a particularly good developer.

I don’t plan well, I tend to hack first and try and find the structure afterwards. I am easily distracted. It takes me an exceedingly long time to marshal a problem in my head enough to attack it.

That said, the enforced slow-down from pandemic time has given me the opportunity to sit and look at code, knowing nothing else is coming down the pipe. There are no talks to prepare, no big-think keynotes to draft. I enjoy those things, and I really enjoy the ego-boost of giving them, but the preparation of them puts me in a mental state that is not conducive to doing code work.

So the end of travel has been good, for at least one aspect of my professional work.

The Successful Failure

Spatial operations against large objects have always been a performance hot spot.

The first problem is that large objects are … large. So if you have algorithms that scale O(n^2) on the number of vertices large objects will kill you. Guess what? Distance, intersects tests, and so on are all O(n^2) in their basic implementations.

We solved this problem a long time ago in PostGIS by putting in an extra layer of run-time indexing.

INDEXING!

During a query (for those functions where it makes sense) if we see the same object twice in a row, we build an index on the edges of that object and keep the index in memory, for the life of the query. This gives us O(log(n)) performance for intersects, point-in-polygon, and so on. For joins in particular, this pattern of “seeing the same big thing multiple times” is very common.

This one small trick is one reason PostGIS is so much faster than “the leading brands”.

However, in order to “see the same object twice” we have to, for each function call in the query, retrieve the whole object, in order to compare it against the one we are holding in memory, to see if it is the same.

Here we run into an issue with our back-end.

PostgreSQL deals with large objects by (a) compressing them and (b) cutting the compressed object into slices and storing them in a side table. This all happens in the background, and is why you can store 1GB objects transparently in a database that has only an 8KB page size.

It’s quite computationally expensive, though. So much so that I found that simply bypassing the compression part of this feature could provide 5x performance gains on our spatial join workload.

Faster

At a code sprint in 2018, the PostGIS team agreed on the necessary steps to work around this long-standing performance issue.

  • Enhance PostgreSQL to allow partial decompression. This would allow the PostGIS caching system to retrieve just a little bit of large objects and use that part to determine if the object was not already in the cache.
  • Enhance the PostGIS serialization scheme to add a hashcode at the front of each large object. This way “is this a new object” could be answered with just a few bytes of hash, instead of checking the whole object.
  • Actually update the caching code code to use hash code and avoid unneccessary object retrievals.

Since this involved a change in PostgreSQL, which runs on an annual release cycle, and a change to the PostGIS serialization scheme, which is a major release marker, the schedule for this work was… long term.

Long Term

Still, I managed to slowly chip away at it, goal in mind:

That left adding the hash code to the front of the objects, and using that code in the PostGIS statement cache.

And this is where things fall apart.

Things Fall Apart

The old statement cache was focussed on ensuring the in-memory indexes were in place. It didn’t kick in until the object had already been retrieved. So avoiding retrieval overhead was going to involve re-working the cache quite a bit, to handle both object and index caching.

I started on the work, which still lives on in this branch, but the many possible states of the cache (do I have part of an object? a whole object? an indexed object?) and the fact that it was used in multiple places by different indexing methods (geography tree, geometry tree, GEOS tree), made the change worrisomely complex.

And so I asked a question, that I should have asked years ago, to the pgsql-hackers list:

… within the context of a single SQL statement, will the Datum values for a particular object remain constant?

Basically, could I use the datum values as unique object keys without retrieving the whole object? That would neatly remove any need to retrieve full objects in order to determine if the cache needed to be updated. As usual, Tom Lane had the answer:

Jeez, no, not like that.

Oh, “good news”, I guess, my work is not in vain. Except wait, Tom included a codicil:

The case where this would actually be worth doing, probably, is where you are receiving a toasted-out-of-line datum. In that case you could legitimately use the toast pointer ID values (va_valueid + va_toastrelid) as a lookup key for a cache, as long as it had a lifespan of a statement or less.

Hm. So for a subset of objects, it was possible to generate a unique key without retrieving the whole object.

Facepalm

And that subset – “toasted-out-of-line datum” – were in fact the objects causing the hot spot: objects large enough to have been compressed and then stored in a side table in 8KB chunks.

What if, instead of re-writing my whole existing in-memory index cache, I left that in place, and just added a simple new cache that only worried about object retrieval. And only cached objects that it could obtain unique keys for, these “toasted-out-of-line” objects. Would that improve performance?

It did. By 20 times on my favourite spatial join benchmark. In increased it by 5 times on a join where only 10% of the objects were large ones. And on joins where none of the objects were large, the new code did not reduce performance at all.

And here’s the punch line: I’ve known about the large object hot spot for at least 5 years. Probably longer. I put off working on it because I thought the solution involved core changes to PostgreSQL and PostGIS, so first I had to put those changes in, which took a long time.

Once I started working on the “real problem”, I spent a solid week:

  • First on a branch to add hash codes, using the new serialization mechanisms from PostGIS 3.
  • Then on a unified caching system to replace the old in-memory index cache.

And then I threw all that work away, and in about 3 hours, wrote and tested the final patch that gave a 20x performance boost.

So, was this a success or a failure?

Stupid

I’ve become inured to the huge mismatch in “time spent versus code produced”, particularly when debugging. Spending 8 hours stepping through a debugger to generate a one-line patch is pretty routine.

But something about the mismatch between my grandious and complex solution (partial retrieval! hash code!) and the final solution (just ask! try the half-measure! see if it’s better!) has really gotten on my nerves.

I like the win, but the path was a long and windy one, and PostGIS users have had slower queries than necessary for years because I failed to pose my simple question to the people who had an answer.

The Successful Success

Contra to that story of the past couple weeks, this week has been a raging success. I keep pinching myself and waiting for something to go wrong.

A number of years ago, JTS got an improvement to robustness in some operations by doing determinant calculations in higher precision than the default IEEE double precision.

Those changes didn’t make it into GEOS. There was an experimental branch, that Mateusz Loskot put together, and it sat un-merged for years, until I picked it up last fall, rebased it and merged it. I did so thinking that was the fastest way, and probably it was, but it included a dependency on a full-precision math library, ttmath, which I added to our tree.

Stupid

Unfortunately, ttmath is basically unmaintained now.

And ttmath is arbitrary precision, while we really only need “higher precision”. JTS just uses a “double double” implementation, that uses the register space of two doubles for higher precision calculations.

And ttmath doesn’t support big-endian platforms (like Sparc, Power, and other chips), which was the real problem. We couldn’t go on into the future without support for these niche-but-not-uncommon platforms.

And ttmath includes some fancy assembly language that makes the build system more complex.

Fortunately, the JTS DD is really not that large, and it has no endian assumptions in it, so I ported it and tested it out against ttmath.

It’s smaller.

It’s faster. (About 5-10%. Yes, even though it uses no special assembly tricks, probably because it doesn’t have to deal with arbitrary precision.)

And here’s the huge surprise: it caused zero regression failures! It has exactly the same behaviour as the old implementation!

Perfect

So needless to say, once the branch was stable, I merged it in and stood there in wonderment. It seems implausable that something as foundational as the math routines could be swapped out without breaking something.

The whole thing took just a few days, and it was so painless that I’ve also made a patch to the 3.8 stable series to bring the new code back for big endian platform support in the mean time.

The next few days I’ll be doing ports of JTS features and fixes that are net-new to GEOS, contemplative work that isn’t too demanding.

Some days everything is easy.

Some days everything is hard.

Don’t let the hard days hold you back!

Perfect

PostGIS Day in STL

Every year, on the second Wednesday of November, Esri (“the Microsoft of GIS”) promotes a day of celebration, “GIS Day” in which the members of our community unite to tell the world about the wonders of cartography and spatial data and incidentally use their software a lot in the process.

And every year, for the last number of years, on the day after “GIS Day”, a motley crew of open source users and SQL database afficionados observe “PostGIS Day”. Until this fall, I had never had a chance to personally participate in a PostGIS Day event, but this year Crunchy sponsored a day in St Louis, and I got to talk an awful lot about PostGIS.

It was really good, and I feel like there’s lots more to be done, if only on the subject of spatial SQL and analysis in the database. Here’s the talks I gave, the balance are on the event page.

PostGIS Introduction

Serving Dynamic Vector Tiles

Geocoding and Text Search in PostGIS

PostGIS 3.0 Overview

Open Source Debate @ North51

In a little change from the usual course of conference structure, I was invited to debate the merits of open source versus proprietary software at the North51 conference in Banff, Alberta last week.

Debate

There isn’t a video, but to give you a flavour of how I approached the task, here’s my opening and closing statements:


Opening

Thank you for having me here, Jon, and choosing me to represent the correct side of this argument.

So, to provide a little context, I’d like to start by reading the founding texts of this particular disagreement.

The first text is Bill Gates’ “Open Letter to Hobbyists”, published in the Homebrew Computer Club newsletter in February of 1976, kicking off the era of proprietary software that is still be-devilling us, 45 years later.

“Almost a year ago, Paul Allen and myself, expecting the hobby market to expand, hired Monte Davidoff and developed Altair BASIC. Though the initial work took only two months, the three of us have spent most of the last year documenting, improving and adding features to BASIC… The value of the computer time we have used exceeds $40,000.

“The feedback we have gotten from the hundreds of people who say they are using BASIC has all been positive. Two surprising things are apparent, however, 1) Most of these “users” never bought BASIC and 2) The amount of royalties we have received from sales to hobbyists makes the time spent on Altair BASIC worth less than $2 an hour.

“Why is this? As the majority of hobbyists must be aware, most of you steal your software. Hardware must be paid for, but software is something to share. Who cares if the people who worked on it get paid?

The second text is the initial announcement of the Linux source code, sent by Linus Torvalds in August of 1991

“Hello everybody out there using minix - I’m doing a (free) operating system (just a hobby, won’t be big and professional like gnu) for 386(486) AT clones. … , and I’d like to know what features most people would want. Any suggestions are welcome, but I won’t promise I’ll implement them :-) Linus

After sending these messages, both these technology innovators went on to manage the creation of operating systems that have become dominant, industry standards. Microsoft Windows and Linux.

The difference is that, in the process, Gates became a multi-billionaire and for a time the richest man in the world, while Torvalds is just a garden variety millionaire.

A simple interpretation of this set of facts, of Gates berating hobbyists for “stealing” his software, and of Torvalds welcoming folks to provide him with feature suggestions, is that Gates is a miserable, grasping, corporate greedhead, and Torvalds is a far-seeing, generous, socially conscious computer monk.

There’s something to that.

Still…. A more nuanced view looks at the dates the letters were sent.

In 1976, when Gates sent his letter, the way you built a substantial piece of serious software, is you got 1, 10, 100 programmers together in one building, so you could coordinate their efforts, and you paid them to come in every day and work on it.

And you kept on paying them, until they were done, or done enough to ship, 1, 2 or more months later. And then you made back all that money afterwards, selling many many copies of the software.

This was the shrink-wrap proprietary software model, and it was really easy to understand, because it was exactly the same model used for books, or music, or movies.

Hey… has anyone noticed, a change, in the way we consume books, or music, or movies? Things are different than they were, in 1976. Right?

OK, so what is REALLY different about Torvalds? The difference is manifest in the very way he sent his message. He didn’t mail it to a magazine. He published it electronically, on a Usenet bulletin board, an early mailing list, of Unix operating system enthusiasts, comp.os.minix.

And those enthusiasts joined him in building a complete, open source, operating system. They didn’t have to move to Seattle, and he didn’t have to pay them, and they didn’t have to work on it 8 hours a day, 5 days a week.

They worked on it a bit at a time, in the time they had, and it got bigger.

And eventually it was useful enough the companies started using it, for important things. And they started hiring people to work on it, 8 hours a day, 5 days a week.

But they didn’t have to move to Seattle, either. And they didn’t have to sign over their work to Microsoft, to do the work.

And eventually Linux was so useful that the majority of people working on it were being paid, but not by any one company.

Linus Torvalds works for the Linux Foundation, which is funded by the largest companies in tech.

The top contributors to the Linux kernel in 2017 worked for Intel, Red Hat, Linaro, Samsung, SUSE and IBM. And that only accounts for about 1/3 of contributions… the rest come from a long tail of other contributors.

We’re going to end up talking about definitions and arguing over words a little in this session, I imagine, so I’d like to get one common one out of the way early.

The opposite of “open source” is “proprietary”, not “commercial”. Open source licensing does not foreclose all commercial ventures, it only forecloses those predicated on restricting access to the source code.

You wouldn’t describe my previous employer, Carto, as an “open source company”, they sell geospatial intelligence software as a service. The software they write is all open source licensed and available on Github.

You wouldn’t describe my current employer, Crunchy data as a “non-commercial” entity. We are a private for-profit company, with a corporate headquarters, major clients in government and industry, and a headcount of almost 100.

It’s worth noting that both those employers of mine are headquartered nowhere near where I live, in British Columbia.

That’s not an accident. That really the whole story. It’s the heart of the matter.

The proprietary software model is obsolete.

It is a relic of a bygone age, the pre-internet age.

Before the advent of the internet, the proprietary model was useful to capitalize some forms of large scale software development, but now, in the internet era, it’s mostly an obstacle, a brake on innovation.

Those are some hard words. I know. But if you don’t believe me, maybe ask the number one contributor to open source projects on Github. It’s a little Seattle company called “Microsoft”.

Not only is Microsoft the number one corporate contributor to open source software hosted on Github, they liked Github itself so much they bought the company.

Microsoft knows the future is open source, I know the future is open source, and I hope by the time we’re done you all know the future is open source.


Closing

I know there is proprietary software in the world.

I am not saying there is no place for proprietary software in the world.

Some of my best friends use proprietary software.

I do not harangue them, I do not put them down.

Well, not to their faces, anyways.

What I do know is that the system of proprietary software is a system in which all the incentives align towards passivity, towards stagnation, and away from innovation.

Proprietary vendor’s incentives are to lock customers in, and to drive customers towards further purchases of more product.

This leads to less interoperability.

It leads to less efficient software.

It leads to artificial boundaries in functionality, between “Basic” versions and “Pro” versions and “Enterprise” versions – boundaries predicated on nothing but revenue maximization.

Customers dependent on proprietary software lose their agency in improving their tools. They adopt poor systems designs to fit their software use within the terms of their license limits.

They stop thinking that their tools could be something other than what the vendor says they are.

They lose their freedom.

  • As a society, we should bias toward a software development model that maximizes innovation. Open source is that model.
  • As organizations, we should bias toward a software development model and tools that maximize flexibility and cooperation. Open source is that model.
  • As individuals, we should bias towards tools that maximize our ability to move our skills from employer to employer, regardless of what vendor they happen to use. Open source provides those tools.

Embrace your power, embrace your freedom, embrace open source.

PostgreSQL and SQL and Books

I never tire of telling developers that they should learn SQL.

And I never run out of developers for whom that is good advice.

I think the reason is that so many developers learn basic SQL CRUD operations, and then stop. They can filter with a WHERE clause, they can use Sum() and GROUP BY, they can UPDATE and DELETE.

If they are particularly advanced, they can do a JOIN. But that’s usually where it ends.

And the tragedy is that, because they stop there, they end up re-writing big pieces of data manipulation logic in their applications – logic that they could skip if only they knew what their SQL database engine was capable of.

Since so many developers are using PostgreSQL now, I have taken to recommending a couple of books, written by community members.

For people getting started with PostgreSQL, and SQL, the Art of PostgreSQL, by Dmitri Fontaine.

Art of PostgreSQL

For people who are wanting to learn PostGIS, and spatial SQL, I recommend PostGIS in Action, by Regina Obe and Leo Hsu.

PostGIS in Action

Both Dmitri and Regina are community members, and both have been big contributors to PostgreSQL and PostGIS. One of the key PostgreSQL features that PostGIS uses is the “extension” system, which Dmitri implemented many years ago now. And of course Regina has been active in the PostGIS develompent community almost since the first release in the early 2000s.

I often toy with the idea of writing a PostGIS or a PostgreSQL book, but then I stop and think, “wait, there’s already lots of good ones out there!” So I wrote this short post instead.

OGR FDW Spatial Filtering

The OGR FDW now pushes spatial filters down to remote data sources!

Whuuuut?!?!?

The Basics

OK, first, “OGR” is a subcomponent of the GDAL toolkit that allows generic access to dozens of different geospatial file formats. The OGR part handles the “vector” data (points, lines and polygons) and the GDAL part handles the “raster” data (imagery, elevation grids).

Second, “FDW” is a “foreign data wrapper”, an extension API for PostgreSQL that allows developers to connect non-database information to the database and present it in the form of a table.

The simplest FDWs, like the Oracle FDW, just make remote database tables in foreign systems look like local ones. Connecting two databases is “easy” because they share the same data model: tables of typed columns and rows of data.

The OGR data model is pleasantly similar to the database data model. Every OGR “datasource” (database) has “layers” (tables) made of “fields” (columns) with data types like “string” (varchar) and “number” (integer, real).

Now, combine the two ideas of “OGR” and “FDW”!

The “OGR FDW” uses the OGR library to present geospatial data sources as tables inside a PostgreSQL database. The FDW abstraction layer lets us make tables, and OGR abstraction layer lets those tables be sourced from almost any geospatial file format or server.

It’s an abstraction layer over an abstraction layer… the best kind!

Setup the FDW

Here’s an example that connects to a “web feature service” (WFS) from Belgium (we all speak Flemish, right?) and makes a table of it.


CREATE EXTENSION postgis;
CREATE EXTENSION ogr_fdw;

CREATE SERVER wfsserver 
  FOREIGN DATA WRAPPER ogr_fdw 
  OPTIONS (
    datasource 'WFS:http://geoservices.informatievlaanderen.be/overdrachtdiensten/Haltes/wfs',
    format 'WFS',
    config_options 'CPL_DEBUG=ON'
  );

CREATE FOREIGN TABLE haltes (
    fid bigint,
    shape Geometry(Point,31370),
    gml_id varchar,
    uidn double precision,
    oidn double precision,
    stopid double precision,
    naamhalte varchar,
    typehalte integer,
    lbltypehal varchar,
    codegem varchar,
    naamgem varchar
  ) 
  SERVER wfsserver 
  OPTIONS (
    layer 'Haltes:Halte'
  );

Pushdown from FDW

Let’s run a query on the haltes table, and peak into what the OGR FDW is doing, by setting the debug level to DEBUG1.

SET client_min_messages = DEBUG1;

SELECT gml_id, ST_AsText(shape) AS shape, naamhalte, lbltypehal
  FROM haltes 
  WHERE lbltypehal = 'Niet-belbus'
    AND shape && ST_MakeEnvelope(207950, 186590, 207960, 186600, 31370);

We get back one record, and two debug entries:

DEBUG:  OGR SQL: (LBLTYPEHAL = 'Niet-belbus')
DEBUG:  OGR spatial filter (207950 186590, 207960 186600)
-[ RECORD 1 ]-----------------------
gml_id     | Halte.10328
shape      | POINT(207956 186596)
naamhalte  | Lummen Frederickxstraat
lbltypehal | Niet-belbus

The debug entries are generated by the OGR FDW code, when it recognizes there are parts of the SQL query that can be passed to OGR:

  • OGR understands some limited SQL syntax, and OGR FDW passes those parts of any PostgreSQL query down to OGR.
  • OGR can handle simple bounding box spatial filters, and when OGR FDW sees the use of the && PostGIS operator, it passes the filter constant down to OGR.

So OGR FDW is passing the attribute and spatial filters from the SQL down to the OGR layer. But are they then being passed on to the remote datasource?

Pushdown from OGR

Every OGR “driver” is capable of pushing different amounts of logic down to the source data.

  • A driver that reads a file format cannot push anything down: there is no logic in a file.
  • A driver that reads from a database can push a lot down: databases are rich and powerful execution engines in their own right.

Our example data source, the Belgian “web feature server” actually supports both attribute and spatial filters, and the OGR driver will pass them down.

We can see OGR passing the filters down because when we created the server, we set config_options 'CPL_DEBUG=ON', to expose the GDAL logging information to our PostgreSQL server.

The GDAL debug entries are visible when we set the logging level to DEBUG2

SET client_min_messages = DEBUG2;

SELECT gml_id, ST_AsText(shape) AS shape, naamhalte, lbltypehal
  FROM haltes 
  WHERE lbltypehal = 'Niet-belbus'
    AND shape && ST_MakeEnvelope(207950, 186590, 207960, 186600, 31370);

Now we get a whole slew of logging, but I’m only going to pull out one line, the line that shows the WFS query that OGR sends to the remote server:

DEBUG:  GDAL None [0] WFS: http://geoservices.informatievlaanderen.be/overdrachtdiensten/Haltes/wfs?SERVICE=WFS&VERSION=1.1.0&REQUEST=GetFeature&TYPENAME=Haltes:Halte&FILTER=%3CFilter%20xmlns%3D%22http:%2F%2Fwww.opengis.net%2Fogc%22%20xmlns:Haltes%3D%22informatievlaanderen.be%2FHaltes%22%20xmlns:gml%3D%22http:%2F%2Fwww.opengis.net%2Fgml%22%3E%3CAnd%3E%3CPropertyIsEqualTo%3E%3CPropertyName%3ELBLTYPEHAL%3C%2FPropertyName%3E%3CLiteral%3ENiet%2Dbelbus%3C%2FLiteral%3E%3C%2FPropertyIsEqualTo%3E%3CBBOX%3E%3CPropertyName%3EHaltes:SHAPE%3C%2FPropertyName%3E%3Cgml:Box%3E%3Cgml:coordinates%3E207950.0000000000000000,186590.0000000000000000%20207960.0000000000000000,186600.0000000000000000%3C%2Fgml:coordinates%3E%3C%2Fgml:Box%3E%3C%2FBBOX%3E%3C%2FAnd%3E%3C%2FFilter%3E

Awesome, right?

That’s pretty much un-readable, but if I copy out the value in the FILTER request variable, and reverse the URL encoding, I get this:

<Filter
  xmlns="http://www.opengis.net/ogc"
  xmlns:Haltes="informatievlaanderen.be/Haltes"
  xmlns:gml="http://www.opengis.net/gml">
  <And>
    <PropertyIsEqualTo>
      <PropertyName>LBLTYPEHAL</PropertyName>
      <Literal>Niet-belbus</Literal>
    </PropertyIsEqualTo>
    <BBOX>
      <PropertyName>Haltes:SHAPE</PropertyName>
      <gml:Box>
        <gml:coordinates>
          207950.0000000000000000,186590.0000000000000000
          207960.0000000000000000,186600.0000000000000000
        </gml:coordinates>
      </gml:Box>
    </BBOX>
  </And>
</Filter>

I know, who ever thought that jamming an XML encoded version of a SQL filter into an HTTP GET request was a good idea? (Some very very nice people.)

Anyways, as you can see, both the attribute and spatial portions of our original SQL query have been re-encoded as a WFS XML filter, and sent to the remote server.

OGR FDW correctly pushed the attribute and spatial portions of the WHERE clause into OGR, and OGR correctly pushed those filters into the dialect of the driver we were using, in this case the WFS driver.

The End

The really really cool part is that if we had been using, for example, the Oracle driver, OGR would have instead generated Oracle-compatible SQL and pushed that down!

It’s an abstraction layer over an abstraction layer… the best kind!