Parallel PostGIS and PgSQL 12 (2)

In my last post I demonstrated that PostgreSQL 12 with PostGIS 3 will provide, for the first time, automagical parallelization of many common spatial queries.

This is huge news, as it opens up the possibility of extracting more performance from modern server hardware. Commenters on the post immediately began conjuring images of 32-core machines reducing their query times to miliseconds.

So, the next question is: how much more performance can we expect?

To investigate, I acquired a 16 core machine on AWS (m5d.4xlarge), and installed the current development snapshots of PostgreSQL and PostGIS, the code that will become versions 12 and 3 respectively, when released in the fall.

How Many Workers?

The number of workers assigned to a query is determined by PostgreSQL: the system looks at a given query, and the size of the relations to be processed, and assigns workers proportional to the log of the relation size.

For parallel plans, the “explain” output of PostgreSQL will include a count of the number of workers planned and assigned. That count is exclusive of the leader process, and the leader process actually does work outside of its duties in coordinating the query, so the number of CPUs actually working is more than the num_workers, but slightly less than num_workers+1. For these graphs, we’ll assume the leader fully participates in the work, and that the number of CPUs in play is num_workers+1.

Forcing Workers

PostgreSQL’s automatic calculation of the number of workers could be a blocker to performing analysis of parallel performance, but fortunately there is a workaround.

Tables support a “storage parameter” called parallel_workers. When a relation with parallel_workers set participates in a parallel plan, the value of parallel_workers over-rides the automatically calculated number of workers.

ALTER TABLE pd SET ( parallel_workers = 8);

In order to generate my data, I re-ran my queries, upping the number of parallel_workers on my tables for each run.

Setup

Before running the tests, I set all the global limits on workers high enough to use all the CPUs on my test server.

SET max_worker_processes = 16;
SET max_parallel_workers = 16;
SET max_parallel_workers_per_gather = 16;

I also loaded my data and created indexes as usual. The tables I used for these tests were:

Scan Performance

I tested two kinds of queries: a straight scan query, with only one table in play; and, a spatial join with two tables. I used the usual queries from my annual parallel tests.

EXPLAIN ANALYZE 
  SELECT Sum(ST_Area(geom)) 
    FROM pd;

Scan performance improved well at first, but started to flatten out noticably after 8 cores.

Workers 1 2 4 8 16
Time (ms) 318 167 105 62 47

The default number of CPUs the system wanted to use was 4 (1 leader + 3 workers), which is probably not a bad choice, as the expected gains from addition workers shallows out as the core count grows.

Join Performance

The join query computes the join of 69K polygons against 695K points. The points are actually generated from the polygons, so there are precisely 10 points in each polygon, so the resulting relation would be 690K records long.

EXPLAIN ANALYZE
 SELECT *
  FROM pd 
  JOIN pts_10 pts
  ON ST_Intersects(pd.geom, pts.geom);

For unknown reasons, it was impossible to force out a join plan with only 1 worker (aka 2 CPUs) so that part of our chart/table is empty.

Workers 1 2 4 8 16
Time (ms) 26789 - 9371 5169 4043

The default number of workers is 4 (1 leader + 3 workers) which, again, isn’t bad. The join performance shallows out faster than the scan performance, and above 10 CPUs is basically flat.

Conclusions