Parallel PostGIS IIA

One of the core complaints in my review of PostgreSQL parallelism, was that the cost of functions executed on rows returned by queries do not get included in evaluations of the cost of a plan.

So for example, the planner appeared to consider these two queries equivalent:

SELECT *
FROM pd;

SELECT ST_Area(geom)
FROM pd;

They both retrieve the same number of rows and both have no filter on them, but the second one includes a fairly expensive function evaluation. No amount of changing the cost of the ST_Area() function would cause a parallel plan to materialize. Only changing the size of the table (making it bigger) would flip the plan into parallel mode.

Fortunately, when I raised this issue on pgsql-hackers, it turned out to have been reported and discussed last month, and Amit Kapila had already prepared a patch, which he kindly rebased for me.

With the patch in place, I now see rational behavior from the planner. Using the default PostGIS function costs, a simple area calculation on my 60K row polling division table is sequential:

EXPLAIN
SELECT ST_Area(geom)
FROM pd;
Seq Scan on pd  
(cost=0.00..14445.17 rows=69534 width=8)

However, if the ST_Area() function is costed a little more realistically, the plan shifts.

ALTER FUNCTION ST_Area(geometry) COST 100;

EXPLAIN
SELECT ST_Area(geom)
FROM pd;
 Gather  
 (cost=1000.00..27361.20 rows=69534 width=8)
   Workers Planned: 3
   ->  Parallel Seq Scan on pd  
       (cost=0.00..19407.80 rows=22430 width=8)

Perfect!

While not every query receives what I consider a “perfect plan”, it now appears that we at least have some reasonable levers available to get better plans via applying some sensible (higher) costs across the PostGIS code base.

Parallel PostGIS II

A year and a half ago, with the release of PostgreSQL 9.6 on the horizon, I evaluated the parallel query infrastructure and how well PostGIS worked with it.

The results at the time were mixed: parallel query worked, when poked just the right way, with the correct parameters set on the PostGIS functions, and on the PostgreSQL back-end. However, under default settings, parallel queries did not materialize. Not for scans, not for joins, not for aggregates.

With the recent release of PostgreSQL 10, another generation of improvement has been added to parallel query processing, so it’s fair to ask, “how well does PostGIS parallelize now?”

Parallel PostGIS II

TL;DR:

The answer is, better than before:

  • Parallel aggregations now work out-of-the-box and parallelize in reasonable real-world conditions.
  • Parallel scans still require higher function costs to come into action, even in reasonable cases.
  • Parallel joins on spatial conditions still seem to have poor planning, requiring a good deal of manual poking to get parallel plans.

Setup

In order to run these tests yourself, you will need:

  • PostgreSQL 10
  • PostGIS 2.4

You’ll also need a multi-core computer to see actual performance changes. I used a 4-core desktop for my tests, so I could expect 4x improvements at best.

For testing, I used the same ~70K Canadian polling division polygons as last time.

createdb parallel
psql -c 'create extension postgis' parallel
shp2pgsql -s 3347 -I -D -W latin1 PD_A.shp pd | psql parallel

PDs

To support join queries, and on larger tables, I built a set of point tables based on the polling divisions. One point per polygon:

CREATE TABLE pts AS 
SELECT
ST_PointOnSurface(geom)::Geometry(point, 3347) AS geom,
gid, fed_num
FROM pd;

CREATE INDEX pts_gix
ON pts USING GIST (geom);

Ten points per polygon (for about 700K points):

CREATE TABLE pts_10 AS 
SELECT
(ST_Dump(ST_GeneratePoints(geom, 10))).geom::Geometry(point, 3347) AS geom,
gid, fed_num
FROM pd;

CREATE INDEX pts_10_gix
ON pts_10 USING GIST (geom);

One hundred points per polygon (for about 7M points):

CREATE TABLE pts_100 AS 
SELECT
(ST_Dump(ST_GeneratePoints(geom, 100))).geom::Geometry(point, 3347) AS geom,
gid, fed_num
FROM pd;

CREATE INDEX pts_100_gix
ON pts_100 USING GIST (geom);

The configuration parameters for parallel query have changed since the last test, and are (in my opinion) a lot easier to understand.

These parameters are used to fine-tune the planner and execution. Usually you don’t need to change them.

  • parallel_setup_cost sets the planner’s estimate of the cost of launching parallel worker processes. Default 1000.
  • parallel_tuple_cost sets the planner’s estimate of the cost of transferring one tuple from a parallel worker process to another process. Default 0.1.
  • min_parallel_table_scan_size sets the minimum amount of table data that must be scanned in order for a parallel scan to be considered. Default 8MB.
  • min_parallel_index_scan_size sets the minimum amount of index data that must be scanned in order for a parallel scan to be considered. Default 512kB.
  • force_parallel_mode forces the planner to parallelize is wanted. Values: off | on | regress
  • effective_io_concurrency for some platforms and hardware setups allows true concurrent read. Values from 1 (for one spinning disk) to ~100 (for an SSD drive). Default 1.

These parameters control how many parallel processes are launched for a query.

  • max_worker_processes sets the maximum number of background processes that the system can support. Default 8.
  • max_parallel_workers sets the maximum number of workers that the system can support for parallel queries. Default 8.
  • max_parallel_workers_per_gather sets the maximum number of workers that can be started by a single Gather or Gather Merge node. Setting this value to 0 disables parallel query execution. Default 2.

Once you get to the point where #processes == #cores there’s not a lot of advantage in adding more processes. However, each process does exact a cost in terms of memory: a worker process consumes work_mem the same as any other backend, so when planning memory usage take both max_connections and max_worker_processes into consideration.

Before running tests, make sure you have a handle on what your parameters are set to: I frequently found I accidentally tested with max_parallel_workers set to 1.

show max_worker_processes;
show max_parallel_workers;
show max_parallel_workers_per_gather;

Aggregates

First, set max_parallel_workers and max_parallel_workers_per_gather to 8, so that the planner has as much room as it wants to parallelize the workload.

PostGIS only has one true spatial aggregate, the ST_MemUnion function, which is comically inefficient due to lack of input ordering. However, it’s possible to see some aggregate parallelism in action by wrapping a spatial function in a parallelizable aggregate, like Sum():

SET max_parallel_workers = 8;
SET max_parallel_workers_per_gather = 8;

EXPLAIN ANALYZE
SELECT Sum(ST_Area(geom))
FROM pd

Boom! We get a 3-worker parallel plan and execution about 3x faster than the sequential plan.

Finalize Aggregate  
(cost=15417.45..15417.46 rows=1 width=8) 
(actual time=236.925..236.925 rows=1 loops=1)
->  Gather  
(cost=15417.13..15417.44 rows=3 width=8) 
(actual time=236.915..236.921 rows=4 loops=1)
   Workers Planned: 3
   Workers Launched: 3
   ->  Partial Aggregate  
   (cost=14417.13..14417.14 rows=1 width=8) 
   (actual time=231.724..231.724 rows=1 loops=4)
       ->  Parallel Seq Scan on pd  
       (cost=0.00..13800.30 rows=22430 width=2308) 
       (actual time=0.049..30.407 rows=17384 loops=4)
Planning time: 0.111 ms
Execution time: 238.785 ms

Just to confirm, re-run it with parallelism turned off:

SET max_parallel_workers_per_gather = 0;

EXPLAIN ANALYZE
SELECT Sum(ST_Area(geom))
FROM pd

Back to one thread and taking about 3 times as long, as expected.

Scans

The simplest spatial parallel scan adds a spatial function to the filter clause.

SET max_parallel_workers = 8;
SET max_parallel_workers_per_gather = 8;

EXPLAIN ANALYZE
SELECT *
FROM pd
WHERE ST_Area(geom) > 10000;

Unfortunately, that does not give us a parallel plan.

The ST_Area() function is defined with a COST of 10. If we move it up, to 100, we can get a parallel plan.

SET max_parallel_workers_per_gather = 8;

ALTER FUNCTION ST_Area(geometry) COST 100;
EXPLAIN ANALYZE
SELECT *
FROM pd
WHERE ST_Area(geom) > 10000;

Boom! Parallel scan with three workers:

Gather  
(cost=1000.00..20544.33 rows=23178 width=2554) 
(actual time=0.253..293.016 rows=62158 loops=1)
Workers Planned: 5
Workers Launched: 5
->  Parallel Seq Scan on pd  
    (cost=0.00..17226.53 rows=4636 width=2554) 
    (actual time=0.091..210.581 rows=10360 loops=6)
     Filter: (st_area(geom) > '10000'::double precision)
     Rows Removed by Filter: 1229
Planning time: 0.128 ms
Execution time: 302.600 ms

It appears our spatial function costs may still be too low in general to get good planning. And as we will see with joins, it’s possible the planner is still discounting function costs too much in deciding whether to go parallel or not.

Joins

Starting with a simple join of all the polygons to the 100 points-per-polygon table, we get:

SET max_parallel_workers_per_gather = 4;

EXPLAIN
SELECT *
FROM pd
JOIN pts_100 pts
ON ST_Intersects(pd.geom, pts.geom);

PDs & Points

In order to give the PostgreSQL planner a fair chance, I started with the largest table, thinking that the planner would recognize that a “70K rows against 7M rows” join could use some parallel love, but no dice:

Nested Loop  
(cost=0.41..13555950.61 rows=1718613817 width=2594)
 ->  Seq Scan on pd  
     (cost=0.00..14271.34 rows=69534 width=2554)
 ->  Index Scan using pts_gix on pts  
     (cost=0.41..192.43 rows=232 width=40)
       Index Cond: (pd.geom && geom)
       Filter: _st_intersects(pd.geom, geom)

There are a number of knobs we can press on. There are two global parameters:

  • parallel_setup_cost defaults to 1000, but no amount of lowering the value, even to zero, causes a parallel plan.
  • parallel_tuple_cost defaults to 0.1. Reducing it by a factor of 100, to 0.001 causes the plan to flip over into a parallel plan.
SET parallel_tuple_cost = 0.001;

As with all parallel plans, it is a nested loop, but that’s fine since all PostGIS joins are nested loops.

Gather  (cost=0.28..4315272.73 rows=1718613817 width=2594)
Workers Planned: 4
->  Nested Loop  
    (cost=0.28..2596658.92 rows=286435636 width=2594)
     ->  Parallel Seq Scan on pts_100 pts  
         (cost=0.00..69534.00 rows=1158900 width=40)
     ->  Index Scan using pd_geom_idx on pd  
         (cost=0.28..2.16 rows=2 width=2554)
           Index Cond: (geom && pts.geom)
           Filter: _st_intersects(geom, pts.geom)

Running the parallel plan to completion on the 700K point table takes 18s with four workers and 53s with a sequential plan. We are not getting an optimal speed up from parallel processing anymore: four workers are completing in 1/3 of the time instead of 1/4.

If we set parallel_setup_cost and parallel_tuple_cost back to their defaults, we can also change the plan by fiddling with the function costs.

First, note that our query can be re-written like this, to expose the components of the spatial join:

SET parallel_tuple_cost=0.1;
SET parallel_setup_cost=1000;
SET max_parallel_workers_per_gather = 4;

EXPLAIN
SELECT *
FROM pd
JOIN pts_100 pts
ON pd.geom && pts.geom
AND _ST_Intersects(pd.geom, pts.geom);

The default cost of _ST_Intersects() is 100. If we adjust it up by a factor of 100, we can get a parallel plan.

ALTER FUNCTION _ST_Intersects(geometry, geometry) COST 10000;

However, what if our query only used a single spatial operator in the join filter? Can we still force a parallel plan on this query?

SET parallel_tuple_cost=0.1;
SET parallel_setup_cost=1000;
SET max_parallel_workers_per_gather = 4;

EXPLAIN
SELECT *
FROM pd
JOIN pts_100 pts
ON pd.geom && pts.geom;

The && operator could activate one of two functions:

  • geometry_overlaps(geom, geom) is bound to the && operator
  • geometry_gist_consistent_2d(internal, geometry, int4) is bound to the 2d spatial index

However, no amount of increasing their COST causes the operator-only query plan to flip into a parallel mode:

ALTER FUNCTION  geometry_overlaps(geometry, geometry) COST 1000000000000;
ALTER FUNCTION geometry_gist_consistent_2d(internal, geometry, int4) COST 10000000000000;

So for operator-only queries, it seems the only way to force a spatial join is to muck with the parallel_tuple_cost parameter.

More Joins

Can we parallelize a common GIS use case: the spatial overlay?

Shifted PDs

Here is a table that simply shifts the polling divisions up and over, so that they can be overlaid to create a new set of smaller polygons.

CREATE TABLE pd_translate AS 
SELECT ST_Translate(geom, 100, 100) AS geom,
fed_num, pd_num
FROM pd;

CREATE INDEX pd_translate_gix
ON pd_translate USING GIST (geom);
CREATE INDEX pd_fed_num_x
ON pd (fed_num);
CREATE INDEX pdt_fed_num_x
ON pd_translate (fed_num);

The overlay operation finds, for each geometry on one side, all the overlapping geometries, and then calculates the shape of those overlaps (the “intersection” of the pair). Calculating intersections is expensive, so it’s something want to happen in parallel, even more than we want the join to happen in parallel.

This query calculates the overlay of all polling divisions (and their translations) in British Columbia (fed_num > 59000):

EXPLAIN 
SELECT ST_Intersection(pd.geom, pdt.geom) AS geom
FROM pd
JOIN pd_translate pdt
ON ST_Intersects(pd.geom, pdt.geom)
WHERE pd.fed_num > 59000
AND pdt.fed_num > 59000;

Unfortunately, the default remains a non-parallel plan. The parallel_tuple_cost has to be adjusted down to 0.01 or the cost of _ST_Intersects() adjusted upwards to get a parallel plan.

Conclusions

  • The costs assigned to PostGIS functions still do not provide the planner a good enough guide to determine when to invoke parallelism. Costs assigned currently vary widely without any coherent reasons.
  • The planner behaviour on spatial joins remains hard to predict: is the deciding factor the join operator cost, the number of rows of resultants, or something else altogether? Counter-intuitively, it was easier to get join behaviour from a relatively small 6K x 6K polygon/polygon overlay join than it was for the 70K x 7M point/polygon overlay.

Adding PgSQL to PHP on OSX

I’m yak shaving this morning, and one of the yaks I need to ensmooth is running a PHP script that connects to a PgSQL database.

No problem, OSX ships with PHP! Oh wait, that PHP does not include PgSQL database support.

Adding PgSQL to PHP on OSX

At this point, you can either run to completely replace your in-build PHP with another PHP (probably good if you’re doing modern PHP development and want something newer than 5.5) or you can add PgSQL to your existing PHP installation. I chose the latter.

The key is to build the extension you want without building the whole thing. This is a nice trick available in PHP, similar to the Apache module system for independent module development.

First, figure out what version of PHP you will be extending:

> php --info | grep "PHP Version"

PHP Version => 5.5.38

For my version of OSX, Apple shipped 5.5.38, so I’ll pull down the code package for that version.

Then, unbundle it and go to the php extension directory:

tar xvfz php-5.5.38.tar.bz2
cd php-5.5.38/ext/pgsql

Now the magic part. In order to build the extension, without building the whole of PHP, we need to tell the extension how the PHP that Apple ships was built and configured. How do we do that? We run phpize in the extension directory.

> /usr/bin/phpize

Configuring for:
PHP Api Version:         20121113
Zend Module Api No:      20121212
Zend Extension Api No:   220121212

The phpize process reads the configuration of the installed PHP and sets up a local build environment just for the extension. All of a sudden we have a ./configure script, and we’re ready to build (assuming you have installed the MacOSX command-line developers tools with XCode).

> ./configure \
    --with-php-config=/usr/bin/php-config \
    --with-pgsql=/opt/pgsql/10

> make

Note that I have my own build of PostgreSQL in /opt/pgsql. You’ll need to supply the path to your own install of PgSQL so that the PHP extension can find the PgSQL libraries and headers to build against.

When the build is complete, you’ll have a new modules/ directory in the extension directory. Now figure out where your system wants extensions copied, and copy the module there.

> php --info | grep extension_dir

extension_dir => /usr/lib/php/extensions/no-debug-non-zts-20121212 => /usr/lib/php/extensions/no-debug-non-zts-20121212

> sudo cp modules/pgsql.so /usr/lib/php/extensions/no-debug-non-zts-20121212

Finally, you need to edit the /etc/php.ini file to enable the new module. If the file doesn’t already exist, you’ll have to copy in the template version and then edit that.

sudo cp /etc/php.ini.default /etc/php.ini
sudo vi /etc/php.ini

Find the line for the PgSQL module and uncomment and edit it appropriately.

;extension=php_pdo_sqlite.dll
extension=pgsql.so
;extension=php_pspell.dll

Now you can check and see if it has picked up the PgSQL module.

> php --info | grep PostgreSQL

PostgreSQL Support => enabled
PostgreSQL(libpq) Version => 10.0
PostgreSQL(libpq)  => PostgreSQL 10.0 on x86_64-apple-darwin15.6.0, compiled by Apple LLVM version 8.0.0 (clang-800.0.42.1)

That’s it!

BC IT Outsourcing 2016/17

Whoops, it snuck by me in the laze of summer, but the BC Public Accounts have come out, so I can do a (partial) update of my IT outsourcing summary. Why “partial”? Because I cannot include Health Region spending until their vendor spending summaries are released late in the year. So this summary is for central government only.

The year-over-year trend is flat, which means that last year’s steep drop-off of spending on IBM dominates the look of the chart.

The chart by vendor gives a better feel for who is up and who is down:

IBM, up a little year-over-year, but still way down after last years’ collapse. ESIT continues to dominate all vendors by a large margin. Note that ESIT is the new name for HP Advanced Solutions (HPAS) which was itself the new name for the BC operations of EDS.

Maximus is up a little, but with the MSP premium program (and thus the associated administration contract) potentially winding down it’s hard to imagine any long term trend for them but down. At a minimum the 50% cut in premium rates effectively doubles the administrative overhead represented by the Maximum contract, which is not a good look.

One thing I’m going to be looking at once the health numbers are in is whether billings by Cerner move up to compensate for the drop-off by IBM. By rights they should: Cerner has taken over the huge EHR project at PHSA/Coastal. On the other hand, I heard a rumour that much of that spending was shifted out to a “non-profit” entity by the BC Liberals, which would make it disappear from my survey of vendor payments reports.

PostGIS Operators in 2.4

TL;DR: If you are using ORDER BY or GROUP BY on geometry columns in your application and you have expectations regarding the order or groups you obtain, beware that PostGIS 2.4 changes the behaviour or ORDER BY and GROUP BY. Review your applications accordingly.


The first operators we learn about in elementary school are =, > and <, but they are the operators that are the hardest to define in the spatial realm.

PostGIS Operators in 2.4

When is = equal?

For example, take “simple” equality. Are geometry A and B equal? Should be easy, right?

But are we talking about:

  1. A and B have exactly the same vertices in the same order and with the same starting points?
  2. A and B have exactly the same vertices in any order? (see ST_OrderingEquals)
  3. A and B have the same vertices in any order but different starting points?
  4. A has some extra vertices that B does not, but they cover exactly the same area in space? (see ST_Equals)
  5. A and B have the same bounds?

Confusingly, for the first 16 years of its existence, PostGIS used definition 5, “A and B have the same bounds” when evaluating the = operator for two geometries.

However, for PostGIS 2.4, PostGIS will use definition 1: “A and B have exactly the same vertices in the same order and with the same starting points”.

Why does this matter? Because the behavour of the SQL GROUP BY operation is bound to the “=” operator: when you group by a column, an output row is generated for all groups where every item is “=” to every other item. With the new definition in 2.4, the semantics of GROUP BY should be more “intuitive” when used against geometries.

What is > and <?

Greater and less than are also tricky in the spatial domain:

  • Is POINT(0 0) less than POINT(1 1)? Sort of looks like it, but…
  • Is POINT(0 0) less than POINT(-1 1) or POINT(1 -1)? Hm, that makes the first one look less obvious…

Greater and less than are concepts that make sense for 1-dimensional values, but not for higher dimensions. The “>” and “<” operators have accordingly been an ugly hack for a long time: they compared the minima of the bounding boxes of the two geometries.

  • If they were sortable using the X coordinate of the minima, that was the sorting returned.
  • If they were equal in X, then the Y coordinate of the minima was used.
  • Etc.

This process returned a sorted order, but not a very satisfying one: a “good” sorting would tend to place objects that are near to each other in space, near to each other in the sorted set.

Here’s what the old sorting looked like, applied to world populated places:

Geometry sorting in PostGIS 2.3

The new sorting system for PostGIS 2.4 calculates a very simple “morton key” using the center of the bounds of a feature, keeping things simple for performance reasons. The result is a sorted order that tends to keep spatially nearby features closer together in the sorted set.

Geometry sorting in PostGIS 2.4

Just as the “=” operator is tied to the SQL GROUP BY operation, the “>” and “<” operators are tied to the SQL ORDER BY operation. The pictures above were created by generating a line string from the populated places points as follows:

CREATE TABLE places_line AS 
SELECT ST_MakeLine(geom ORDER BY geom) AS geom
FROM places;