Moving on to Crunchy Data

Today is my first day with my new employer Crunchy Data. Haven’t heard of them? I’m not surprised: outside of the world of PostgreSQL, they are not particularly well known, yet.

Moving on to Crunchy Data

I’m leaving behind a pretty awesome gig at CARTO, and some fabulous co-workers. Why do such a thing?

While CARTO is turning in constant growth and finding good traction with some core spatial intelligence use cases, the path to success is leading them into solving problems of increasing specificity. Logistics optimization, siting, market evaluation.

Moving to Crunchy Data means transitioning from being the database guy (boring!) in a geospatial intelligence company, to being the geospatial guy (super cool!) in a database company. Without changing anything about myself, I get to be the most interesting guy in the room! What could be better than that?

Crunchy Data has quietly assembled an exceptionally deep team of PostgreSQL community members: Tom Lane, Stephen Frost, Joe Conway, Peter Geoghegan, Dave Cramer, David Steele, and Jonathan Katz are all names that will be familiar to followers of the PostgreSQL mailing lists.

They’ve also quietly assembled expertise in key areas of interest to large enterprises: security deployment details (STIGs, RLS, Common Criteria); Kubernetes and PaaS deployments; and now (ta da!) geospatial.

Why does this matter? Because the database world is at a technological inflection point.

Core enterprise systems change very infrequently, and only under pressure from multiple sources. The last major inflection point was around the early 2000s, when the fleet of enterprise proprietary UNIX systems came under pressure from multiple sources:

  • The RISC architecture began to fall noticeably behind x86 and particular x86-64.
  • Pricing on RISC systems began to diverge sharply from x86 systems.
  • A compatible UNIX operating system (Linux) was available on the alternative architecture.
  • A credible support company (Red Hat) was available and speaking the language of the enterprise.

The timeline of the Linux tidal wave was (extremely roughly):

  • 90s - Linux becomes the choice of the tech cognoscenti.
  • 00s - Linux becomes the choice of everyone for greenfield applications.
  • 10s - Linux becomes the choice of everyone for all things.

By my reckoning, PostgreSQL is on the verge of a Linux-like tidal wave that washes away much of the enterprise proprietary database market (aka Oracle DBMS). Bear in mind, these things pan out over 30 year timelines, but:

  • Oracle DBMS offers no important feature differentiation for most workloads.
  • Oracle DBMS price hikes are driving customers to distraction.
  • Server-in-a-cold-room architectures are being replaced with the cloud.
  • PostgreSQL in the cloud, deployed as PaaS or otherwise, is mature.
  • A credible support industry (including Crunchy Data) is at hand to support migrators.

I’d say we’re about half way through the evolution of PostgreSQL from “that cool database” to “the database”, but the next decade of change is going to be the one people notice. People didn’t notice Linux until it was already running practically everything, from web servers to airplane seatback entertainment systems. The same thing will obtain in database land; people won’t recognize the inevitability of PostgreSQL as the “default database” until the game is long over.

Having a chance to be a part of that change, and to promote geospatial as a key technology while it happens, is exciting to me, so I’m looking forward to my new role at Crunchy Data a great deal!

Meanwhile, I’m going to be staying on as a strategic advisor to CARTO on geospatial and database affairs, so I get to have a front seat on their continued growth too. Thanks to CARTO for three great years, I enjoyed them immensely!

BC IT Outsourcing 2017/18

A year has gone by and the public accounts are out again, so I have updated my running tally of IT outsourcing for the BC government.

While the overall growth is modest, I’m less sure of my numbers this year because of last year’s odd and sudden drop-off of the IBM amount.

I am guessing that the IBM drop should be associated with their loss of desktop support services for the health authorities, which went to Fujitsu, but for some reason the transfer is not reflected in the General Revenue accounts. The likely reason is that it’s now being paid out of some other bucket, like the Provincial Health Services Authority (PHSA), so fixing the trend line will involve finding that spend in other accounts. Unfortunately, Health Authorities won’t release their detailed accounts for another few months.

All this shows the difficulty in tracking IT spend over the whole of government, which is something the Auditor General remarked on when she reviewed IT spending a few years ago. Capital dollars are relatively easy to track, but operating dollars are not, and of course the spend is spread out over multiple authorities as well as in general revenue.

In terms of vendors, the drop off in HP/ESIT is interesting. For those following at home ESIT (formerly known as HP Advanced Solutions) holds the “back-office” contract, which means running the two government data centres (in Calgary and Kamloops (yes, the primary BC data centre is in Alberta)) as well as all the servers therein and the networks connecting them up. It’s a pretty critical contract, and the largest in government. Since being let (originally to Ross Perot’s EDS), one of the constants of this tracking has been that the amount always goes up. So this reduction is an interesting data point: will it hold?

And there’s still a large whale unaccounted for: the Coastal Health Authority electronic health record project, which has more-or-less dropped off the radar since Adrian Dix brought in Wynne Powell as a fixer. The vendors (probably Cerner and others) for that project appear on neither the PHSA nor Coastal Health public accounts, I think because it is all being run underneath a separate entity. I haven’t had a chance to figure out the name of it (if you know the way this is financed, drop me a line).

PostGIS Polygon Splitting

One of the joys of geospatial processing is the variety of tools in the tool box, and the ways that putting them together can yield surprising results. I have been in the guts of PostGIS for so long that I tend to think in terms of primitives: either there’s a function that does what you want or there isn’t. I’m too quick to ignore the power of combining the parts that we already have.

A community member on the users list asked (paraphrased): “is there a way to split a polygon into sub-polygons of more-or-less equal areas?”

I didn’t see the question, which is lucky, because I would have said: “No, you’re SOL, we don’t have a good way to solve that problem.” (An exact algorithm showed up in the Twitter thread about this solution, and maybe I should implement that.)

PostGIS developer Darafei Praliaskouski did answer, and provided a working solution that is absolutely brilliant in combining the parts of the PostGIS toolkit to solve a pretty tricky problem. He said:

The way I see it, for any kind of polygon:

  • Convert a polygon to a set of points proportional to the area by ST_GeneratePoints (the more points, the more beautiful it will be, guess 1000 is ok);
  • Decide how many parts you’d like to split into, (ST_Area(geom)/max_area), let it be K;
  • Take KMeans of the point cloud with K clusters;
  • For each cluster, take a ST_Centroid(ST_Collect(point));
  • Feed these centroids into ST_VoronoiPolygons, that will get you a mask for each part of polygon;
  • ST_Intersection of original polygon and each cell of Voronoi polygons will get you a good split of your polygon into K parts.

Let’s take it one step at a time to see how it works.

We’ll use Peru as the example polygon, it’s got a nice concavity to it which makes it a little trickier than an average shape.

  FROM countries
  WHERE name = 'Peru'

Original Polygon (Petu)

Now create a point field that fills the polygon. On average, each randomly placed point ends up “occupying” an equal area within the polygon.

  SELECT (ST_Dump(ST_GeneratePoints(geom, 2000))).geom AS geom
  FROM peru
  WHERE name = 'Peru'

2000 Random Points

Now, cluster the point field, setting the number of clusters to the number of pieces you want the polygon divided into. Visually, you can now see the divisions in the polygon! But, we still need to get actual lines to represent those divisions.

CREATE TABLE peru_pts_clustered AS
  SELECT geom, ST_ClusterKMmeans(geom, 10) over () AS cluster
  FROM peru_pts;

10 Clusters

Using a point field and K-means clustering to get the split areas was inspired enough. The steps to get actual polygons are equally inspired.

First, calculate the centroid of each point cluster, which will be the center of mass for each cluster.

CREATE TABLE peru_centers AS
  SELECT cluster, ST_Centroid(ST_collect(geom)) AS geom
  FROM peru_pts_clustered
  GROUP BY cluster;

Centroids of Clusters

Now, use a voronoi diagram to get actual dividing edges between the cluster centroids, which end up closely matching the places where the clusters divide!

CREATE TABLE peru_voronoi AS
  SELECT (ST_Dump(ST_VoronoiPolygons(ST_collect(geom)))).geom AS geom
  FROM peru_centers;

Voronoi of Centrois

Finally, intersect the voronoi areas with the original polygon to get final output polygons that incorporate both the outer edges of the polgyon and the voronoi dividing lines.

CREATE TABLE peru_divided AS
  SELECT ST_Intersection(a.geom, b.geom) AS geom
  FROM peru a
  CROSS JOIN peru_voronoi b;

Intersection with Original Polygon


Clustering a point field to get mostly equal areas, and then using the voronoi to extract actual dividing lines are wonderful insights into spatial processing. The final picture of all the components of the calculation is also beautiful.

All the Components Together

I’m not 100% sure, but it might be possible to use Darafei’s technique for even more interesting subdivisions, like “map of the USA subdivided into areas of equal GDP”, or “map of New York subdivided into areas of equal population” by generating the initial point field using an economic or demographic weighting.

PostGIS Talks @ FOSS4G North America

I presented my “PostGIS for Managers” talk for the last time (at least in this form) today at FOSS4G North America. The reason it’s for the last time is that the central conceit it’s built around, that a core decision is between a proprietary and an open source database, isn’t really operative anymore. The real decisions are now being driven by other considerations, like cloud platforms, and the services available there. So, it’s not really PostgreSQL versus Oracle anymore.

I also presented my “SQL Festival” talk, for the first time! New material is always a little challenging: will it work, is it the right level for the audience? It seemed to be well received, a mix of generic SQL tidbits, and some specific interesting queries you can do with PostGIS.

Changing the Direction of Government IT

I love that the UK government review of IT outsourcing deals has been nicknamed the “Ocean Liner report”, in reference to the amount of time it takes to change the direction of a large vessel.

The UK is hardly alone in having dug a deep hole of contractual obligation and learned helplessness, but they were among the first to publicly acknowledge their problem and start work on fixing it. So the rest of us watch with great interest, and try to formulate our own plans.

This afternoon I was invited to talk about technology directions to an audience of technical architects in the BC government. It was a great group, and they seemed raring to go, by and large.

Changing the Direction of Government IT

I titled my talk “Let’s Get Small” in homage to my favourite comedian, Steve Martin.

Our collective IT ocean liners will be slow to change course, but step one is turning the wheel.