Geomatiqué 2016 Keynote

21 Oct 2016

This week I had the pleasure of presenting the morning keynote at Geomatiqué 2016 in Montreal. I’ve been thinking a lot recently about symbiosis of technology and culture: how new technology is generating new norms to go along with it. We humans are enormously adaptable, so princples that were sacrosanct to one generation become unknown to the next, and back again.

On the chopping block for our generation: privacy.

Not that privacy is any human absolute. I doubt a hunter gatherer had a lot of personal space and privacy: the smaller the group, the more fevered the gossip-mill. On the other hand, the abolition of privacy within the context of an industrial-sized polity will truly be something new under the sun. It could be fine, in its way; it could also be Nineteen Eighty-Four realized.

Presentation PDF

Anyways, the talk is mostly a survey of technology trends, with some philosophizing at the end. Unfortunately, no video at this event, but if you’d like me to deliver this talk to your organization, drop me a line.

PgSQL Indexes and "LIKE"

25 Aug 2016

Do you write queries like this:

SELECT * FROM users 
WHERE name LIKE 'G%'

Are your queries unexpectedly slow in PostgreSQL? Is the index not doing what you expect? Surprise! You’ve just discovered a PostgreSQL quirk.

TL;DR: If you are running a locale other than “C” (show LC_COLLATE to check) you need to create a special index to support pattern searching with the LIKE operator: CREATE INDEX myindex ON mytable (mytextcolumn text_pattern_ops). Note the specification of the text_pattern_ops operator class after the column name.

As a beginner SQL student, you might have asked “will the index make my ‘like’ query fast” and been answered “as long as the wildcard character is at the end of the string, it will.”

PgSQL Indexes and "LIKE"

That statement is only true in general if your database is initialized using the “C” locale (the North America/English-friendly UNIX default). Running with “C” used to be extremely common, but is less and less so, as modern operating systems automagically choose appropriate regional locales to provide approriate time and formatting for end users.

For example, I run Mac OSX and I live in British Columbia, an English-speaking chunk of North America. I could use “C” just fine, but when I check my database locale (via my collation), I see this:

pramsey=# show LC_COLLATE;
 lc_collate  
-------------
 en_CA.UTF-8
(1 row)

It’s a good choice, it’s where I live, it supports lots of characters via UTF-8. However, it’s not “C”, so there are some quirks.

I have a big table of data linked to postal codes, this is what the table looks like:

              Table "gis.postal_segments"
      Column       |     Type     | Modifiers 
-------------------+--------------+-----------
 postal_code       | text         | not null
 segment           | character(4) | 
Indexes:
    "postal_segments_pkey" PRIMARY KEY, btree (postal_code)

Note the index on the postal code, a standard btree.

I want to search rows based on a postal code prefix string, so I run:

EXPLAIN ANALYZE 
SELECT * FROM postal_segments 
WHERE postal_code LIKE 'V8V1X%';

                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan on postal_segments  (cost=0.00..2496.85 rows=10 width=68) (actual time=30.320..34.219 rows=4 loops=1)
   Filter: (postal_code ~~ 'V8V1X%'::text)
   Rows Removed by Filter: 100144
 Planning time: 0.250 ms
 Execution time: 34.263 ms
(5 rows)

Ruh roh!

I have an index on the postal code, so why am I getting a sequence scan?!?! Because my index is no good for doing pattern matching in any collation other than “C”. I need a special index for that, which I create like this.

CREATE INDEX postal_segments_text_x 
  ON postal_segments (postal_code text_pattern_ops);

The magic part is at the end, invoking text_pattern_ops as the opclass for this index. Now my query works as expected:

EXPLAIN ANALYZE 
SELECT * FROM postal_segments 
WHERE postal_code LIKE 'V8V1X%';

                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using postal_segments_text_x on postal_segments  (cost=0.29..8.31 rows=10 width=68) (actual time=0.067..0.073 rows=4 loops=1)
   Index Cond: ((postal_code ~>=~ 'V8V1X'::text) AND (postal_code ~<~ 'V8V1Y'::text))
   Filter: (postal_code ~~ 'V8V1X%'::text)
 Planning time: 0.532 ms
 Execution time: 0.117 ms
(5 rows)

I have gotten so used to PostgreSQL doing exactly the right thing automatically that it took quite a long time to track down this quirk when I ran into it. I hope this page helps others save some time!

Your Broken PostGIS Upgrade

10 Aug 2016

Since the Dawn of Time, people have found PostGIS upgrades difficult and confusing, and this is entirely to be expected, because a PostGIS upgrade consists of a number of interlocking parts. Sometimes, they “upgrade” their version of PostGIS and find out they’ve bricked their system. What gives?

Your Broken PostGIS Upgrade

What Makes PostGIS Work?

Before talking about upgrades, it’s important to understand how PostGIS works at all, because that understanding is key to seeing how upgrade scenarios go bad.

PostGIS is a “run-time loadable library” for PostgreSQL. That means we have a block of C code that is added to a running PostgreSQL database. That C code sits in a “library file” which is named (for the current 2.2 version): postgis-2.2.so.

Just to add to the confusion: for Windows, the name of the library file is postgis-2.2.dll. For every rule, there must be an exception. For users of Apple OSX, yes, there’s a further exception for you: even though most dynamic libraries on OSX are suffixed .dylib, the PostgreSQL modules on OSX are suffixed .so, just like their Linux counterparts.

The location of the postgis-2.2.so file will vary from system to system.

The presence of the postgis-2.2.so alone is not sufficient to “PostGIS enable” a database. PostGIS consists of a large collection of SQL functions in the database.

The SQL functions are created when you run the CREATE EXTENSION postgis command. Until that time your database knows nothing about the existence or definition of the PostGIS functions.

Once the extension is installed, you can see the definitions of the PostGIS functions in the system tables.

The use of dynamic function and type management catalogs is one of the things which makes PostgreSQL so incredibly flexible for extensions like PostGIS

SELECT * 
  FROM pg_proc 
  WHERE proname = 'st_pointonsurface';

-[ RECORD 1 ]---+--------------------
proname         | st_pointonsurface
pronamespace    | 2200
proowner        | 10
prolang         | 13
procost         | 100
prorows         | 0
provariadic     | 0
protransform    | -
proisagg        | f
proiswindow     | f
prosecdef       | f
proleakproof    | f
proisstrict     | t
proretset       | f
provolatile     | i
pronargs        | 1
pronargdefaults | 0
prorettype      | 667466
proargtypes     | 667466
proallargtypes  | 
proargmodes     | 
proargnames     | 
proargdefaults  | 
prosrc          | pointonsurface
probin          | $libdir/postgis-2.2
proconfig       | 
proacl          | 

Lots to see here, but most important bit is the entry for the probin column: $libdir/postgis-2.2. This function (like all the other PostGIS functions) is bound to a particular version of the PostGIS C library.

Those of you thinking forward can now begin to see where upgrades could potentially go wrong.

How Things Go Wrong

Package Managers

The most common way for things to go wrong is to upgrade the library on the system without upgrading the database.

So, in Red Hat Linux terms, perhaps running:

yum upgrade postgresql94-postgis

This seems straight-forward, but think about what a package manager does during an upgrade:

Downloads a new version of the software
Removes the old version
Copies in the new version

So, if we had PostGIS 2.1.3 installed, and the latest version is 2.2.2, what has happend?

The postgis-2.1.so file has been removed
The postgis-2.2.so file has been added
So, the pg_proc entries in every PostGIS-enabled database now point to a library file that does not exist

Fortunately this mismatch between the pg_proc entries and the system state is usually solved during the very next step of the upgrade. But it’s a manual step, and if the DBA and system administrator are different people with different schedules, it might not happen.

Your next step should be to go and update the SQL function definitions by running an extension update on all your databases:

ALTER EXTENSION postgis UPDATE TO '2.2.2';

If you don’t, you’ll find that none of the PostGIS functions work. That, in fact, you cannot even dump your database. The very act of outputting a representation of the geometry data is something that requires the PostGIS C library file, and until you run ALTER EXTENSION the database doesn’t know where the new library file is.

Migrations

Since the use of CREATE EXTENSION postgis (available since PostgreSQL 9.1+ and PostGIS 2.0+) became commonplace, migrations now almost always “just work”, which is excellent news.

When you dump a modern PostGIS-enabled database, that was created using the CREATE EXTENSION postgis command, the dump file just includes a CREATE EXTENSION postgis command of its own at the top.
When you load the dump file into a new version of PostgreSQL even with a new version of PostGIS, the extension is created and the data magically loads.

However, there are still some old databases around that were created before the PostgreSQL extension system was invented, and when you dump them you get not only the data, but all the “custom” function and type definitions, including the defintions for PostGIS. A function definition looks like this:

CREATE OR REPLACE FUNCTION ST_PointOnSurface(geometry)
    RETURNS geometry
    AS '$libdir/postgis-2.2', 'pointonsurface'
    LANGUAGE 'c' IMMUTABLE STRICT;

And look what is hiding inside of it: a reference to a particular version of the PostGIS library! So you cannot simply dump your old PostGIS 1.5 database on PostgreSQL 8.4 and load it into a fresh PostGIS 2.2 database on PostgreSQL 9.5: the function definitions won’t reference the right library file.

The best bet for a really old database that was created without the extension mechanism is to use the “hard upgrade” process. The hard upgrade works by:

Taking a special “custom-format” back-up that includes an object catalog;
Filtering the back-up to clean out all the PostGIS-specific function and object definitions; and then
Loading the “cleaned” back-up into a new database with the desired version of PostGIS already installed (using CREATE EXTENSION postgis this time, so you never have to hard upgrade again).

Hacks

In the case of upgrades that change out the underlying library and other situations that result in a mismatch between the SQL definitions in the database and the state of the system, there are a couple hacks that provide short-term fixes for emergencies:

Symlink the library name the database is looking for to the library name you have. So if your database wants postgis-2.1.so and all you have is postgis-2.2.so, you can ln -s postgis-2.2.so postgis-2.1.so and your database will “work” again.
Update the PostgreSQL catalog definitions for the functions. As a super-user, you can do all kinds of dangerous things, and one of them is to just UPDATE pg_proc SET probin = '$libdir/postgis-2.2' WHERE probin ~ 'postgis-2.1'

Both hacks “work” because the PostGIS project doesn’t change underlying function names often, and inter-version changes mostly involve adding functions to the C library, not removing old ones.

However, there’s no guarantee that an underlying function name hasn’t change between versions, it’s just unlikely. In the worst case, the function name hasn’t changed, but the parameters have, so it’s now possible that calling the function will crash your database.

All this to say: linking and SQL catalogue hacks should be used temporarily only until you can properly upgrade your database using a hard upgrade.

BC IT Outsourcing 2015/16

04 Aug 2016

It’s BC Public Accounts time again (calm down!), and in this year’s analysis of IT outsourcing we have a surprise result: total spending on IT contractors by the BC central government actually fell in the 2015/16 budget year.

The fall-off is almost entirely the result of a collapse of billings to the central government by IBM, though Deloitte had a small fall-off also.

Other IT contractors continued to bill heartily, including massive category leader HP Advanced Systems (HPAS) who billed a very respectable $163,499,787. For comparison, that’s more than was spent on the Ministries of Energy & Mines, Agriculture, Small Business, and International Trade, combined.

Since 2010, IBM’s billings to central government have fallen in four of the last six years, which leads me to a question: what’s the matter with Big Blue?

After further research, the answer is this: nothing at all. Big Blue is doing just fine.

It turns out that the fall off in IBM billing to central government has been more than offset by a massive increase in IBM billing to the Provincial Health Authorities. The Ministry of Health budget is huge, but much of it is spent by the Health Authorities, and the provincial public accounts only record transfers to the Authorities, they don’t keep track of who the Authorities spend with.

Fortunately, the Authorities also have to publish annual financial statements, and I have now input all that data into my summary sheets, through to 2015 (Health Authorities do not publish their detailed payments data until the fall, so I am one year behind until then). The results are, frankly, staggering…

When Health Authority spending is taken into acount, IBM revenue from the BC government has not fallen at all. It has instead been on an almost unbroken tear upwards, taking total government IT outsourcing spending to just under $700 million dollars in 2014/15.

A reasonable chunk of that billing over the last couple years has been on the PHSA/Vancouver Coastal Health electronic medical records project, a $1B+ firehose of cash that IBM has been slurping on heavily, though to little practical effect.

Since IBM was kicked off the project in mid-2015, we can expect their take to fall in the 2015/16 data, but much of the slack should be taken up by medical software vendor Cerner who have been given the prime contractor role on a no-bid, sole-source basis (I’m sure that will work out fine).

While entering the Health Authority data, I had the opportunity to learn a little about the ecosystem of local vendors who support the health sector. Unsurprisingly, there is little overlap between those companies and the ones I already know of who support central government in Victoria: smaller companies tend to be more specialized.

When you lump every local vendor together, central government and health authority, and plot them up, the result is… underwhelming.

While there has been a trending upwards of local IT contracting over time, it is dwarfed 10:1 by the dollars spent on the large international consultancies. HPAS alone takes in over three times what every local IT services firm in the province bills.

The moral of the story, I think, is: if you have money, you will spend money.

As part of my research, I reviewed the school district financials and found very minimal spending on IT consulting: not enough to warrant the effort to enter all the data. School districts are just too poor to waste money on stupid IT consulting.
On the flip side, I reviewed the ICBC 2015 list of vendors and found, in just one year, spending of:
- $5,051,887 on Accenture
- $23,944,563 on IBM
- $3,177,241 on HP Advanced Solutions
- $7,962,866 on Deloitte, and
- $5,962,057 on Quartech Systems.
On administrative operating costs of about $350M, that’s $46M of IT spend, over 18%. If you’ve got money, you’ll find a way to spend it. However, since ICBC is not a direct arm of government, I didn’t include their crazy spend in my totals.

Next year is hard to predict: IBM should have a fall-off; offset partially by Cerner, as they pause the EMR project before ramping back up again. HPAS should continue incremental growth. A wild card is the Natural Resources Permitting Project, which hasn’t truly hit its spending stride yet. If it gets going, I expect Deloitte will increase their billing substantially in 2016/17.

Until next year, happy billing BC!

Politics Amid Plenty

12 Jun 2016

“Political leadership is a subtle art in times of plenty. When there are no great crises, there is no public demand for heroic acts. Politics becomes a parlor game, ignored by all but the most devoted citizens, a game practiced most assiduously by those interests–the business associations, trade unions, and single-issue groups–who have a direct stake in the outcome. The game turns on tactics and gestures, on the ability to placate factions, rather than to inspire the masses.”
- Joe Kein, The Natural

Older Newer

Paul Ramsey

Geomatiqué 2016 Keynote

PgSQL Indexes and "LIKE"

Your Broken PostGIS Upgrade

What Makes PostGIS Work?

How Things Go Wrong

Package Managers

Migrations

Hacks

BC IT Outsourcing 2015/16

Politics Amid Plenty