PgSQL Indexes and "LIKE"

Do you write queries like this:

SELECT * FROM users 
WHERE name LIKE 'G%'

Are your queries unexpectedly slow in PostgreSQL? Is the index not doing what you expect? Surprise! You’ve just discovered a PostgreSQL quirk.

TL;DR: If you are running a locale other than “C” (show LC_COLLATE to check) you need to create a special index to support pattern searching with the LIKE operator: CREATE INDEX myindex ON mytable (mytextcolumn text_pattern_ops). Note the specification of the text_pattern_ops operator class after the column name.

As a beginner SQL student, you might have asked “will the index make my ‘like’ query fast” and been answered “as long as the wildcard character is at the end of the string, it will.”

PgSQL Indexes and "LIKE"

That statement is only true in general if your database is initialized using the “C” locale (the North America/English-friendly UNIX default). Running with “C” used to be extremely common, but is less and less so, as modern operating systems automagically choose appropriate regional locales to provide approriate time and formatting for end users.

For example, I run Mac OSX and I live in British Columbia, an English-speaking chunk of North America. I could use “C” just fine, but when I check my database locale (via my collation), I see this:

pramsey=# show LC_COLLATE;
 lc_collate  
-------------
 en_CA.UTF-8
(1 row)

It’s a good choice, it’s where I live, it supports lots of characters via UTF-8. However, it’s not “C”, so there are some quirks.

I have a big table of data linked to postal codes, this is what the table looks like:

              Table "gis.postal_segments"
      Column       |     Type     | Modifiers 
-------------------+--------------+-----------
 postal_code       | text         | not null
 segment           | character(4) | 
Indexes:
    "postal_segments_pkey" PRIMARY KEY, btree (postal_code)

Note the index on the postal code, a standard btree.

I want to search rows based on a postal code prefix string, so I run:

EXPLAIN ANALYZE 
SELECT * FROM postal_segments 
WHERE postal_code LIKE 'V8V1X%';
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan on postal_segments  (cost=0.00..2496.85 rows=10 width=68) (actual time=30.320..34.219 rows=4 loops=1)
   Filter: (postal_code ~~ 'V8V1X%'::text)
   Rows Removed by Filter: 100144
 Planning time: 0.250 ms
 Execution time: 34.263 ms
(5 rows)

Ruh roh!

I have an index on the postal code, so why am I getting a sequence scan?!?! Because my index is no good for doing pattern matching in any collation other than “C”. I need a special index for that, which I create like this.

CREATE INDEX postal_segments_text_x 
  ON postal_segments (postal_code text_pattern_ops);

The magic part is at the end, invoking text_pattern_ops as the opclass for this index. Now my query works as expected:

EXPLAIN ANALYZE 
SELECT * FROM postal_segments 
WHERE postal_code LIKE 'V8V1X%';
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using postal_segments_text_x on postal_segments  (cost=0.29..8.31 rows=10 width=68) (actual time=0.067..0.073 rows=4 loops=1)
   Index Cond: ((postal_code ~>=~ 'V8V1X'::text) AND (postal_code ~<~ 'V8V1Y'::text))
   Filter: (postal_code ~~ 'V8V1X%'::text)
 Planning time: 0.532 ms
 Execution time: 0.117 ms
(5 rows)

I have gotten so used to PostgreSQL doing exactly the right thing automatically that it took quite a long time to track down this quirk when I ran into it. I hope this page helps others save some time!

Your Broken PostGIS Upgrade

Since the Dawn of Time, people have found PostGIS upgrades difficult and confusing, and this is entirely to be expected, because a PostGIS upgrade consists of a number of interlocking parts. Sometimes, they “upgrade” their version of PostGIS and find out they’ve bricked their system. What gives?

Your Broken PostGIS Upgrade

What Makes PostGIS Work?

Before talking about upgrades, it’s important to understand how PostGIS works at all, because that understanding is key to seeing how upgrade scenarios go bad.

PostGIS is a “run-time loadable library” for PostgreSQL. That means we have a block of C code that is added to a running PostgreSQL database. That C code sits in a “library file” which is named (for the current 2.2 version): postgis-2.2.so.

Just to add to the confusion: for Windows, the name of the library file is postgis-2.2.dll. For every rule, there must be an exception. For users of Apple OSX, yes, there’s a further exception for you: even though most dynamic libraries on OSX are suffixed .dylib, the PostgreSQL modules on OSX are suffixed .so, just like their Linux counterparts.

The location of the postgis-2.2.so file will vary from system to system.

The presence of the postgis-2.2.so alone is not sufficient to “PostGIS enable” a database. PostGIS consists of a large collection of SQL functions in the database.

The SQL functions are created when you run the CREATE EXTENSION postgis command. Until that time your database knows nothing about the existence or definition of the PostGIS functions.

Once the extension is installed, you can see the definitions of the PostGIS functions in the system tables.

The use of dynamic function and type management catalogs is one of the things which makes PostgreSQL so incredibly flexible for extensions like PostGIS

SELECT * 
  FROM pg_proc 
  WHERE proname = 'st_pointonsurface';
-[ RECORD 1 ]---+--------------------
proname         | st_pointonsurface
pronamespace    | 2200
proowner        | 10
prolang         | 13
procost         | 100
prorows         | 0
provariadic     | 0
protransform    | -
proisagg        | f
proiswindow     | f
prosecdef       | f
proleakproof    | f
proisstrict     | t
proretset       | f
provolatile     | i
pronargs        | 1
pronargdefaults | 0
prorettype      | 667466
proargtypes     | 667466
proallargtypes  | 
proargmodes     | 
proargnames     | 
proargdefaults  | 
prosrc          | pointonsurface
probin          | $libdir/postgis-2.2
proconfig       | 
proacl          | 

Lots to see here, but most important bit is the entry for the probin column: $libdir/postgis-2.2. This function (like all the other PostGIS functions) is bound to a particular version of the PostGIS C library.

Those of you thinking forward can now begin to see where upgrades could potentially go wrong.

How Things Go Wrong

Package Managers

The most common way for things to go wrong is to upgrade the library on the system without upgrading the database.

So, in Red Hat Linux terms, perhaps running:

yum upgrade postgresql94-postgis

This seems straight-forward, but think about what a package manager does during an upgrade:

  • Downloads a new version of the software
  • Removes the old version
  • Copies in the new version

So, if we had PostGIS 2.1.3 installed, and the latest version is 2.2.2, what has happend?

  • The postgis-2.1.so file has been removed
  • The postgis-2.2.so file has been added
  • So, the pg_proc entries in every PostGIS-enabled database now point to a library file that does not exist

Fortunately this mismatch between the pg_proc entries and the system state is usually solved during the very next step of the upgrade. But it’s a manual step, and if the DBA and system administrator are different people with different schedules, it might not happen.

Your next step should be to go and update the SQL function definitions by running an extension update on all your databases:

ALTER EXTENSION postgis UPDATE TO '2.2.2';

If you don’t, you’ll find that none of the PostGIS functions work. That, in fact, you cannot even dump your database. The very act of outputting a representation of the geometry data is something that requires the PostGIS C library file, and until you run ALTER EXTENSION the database doesn’t know where the new library file is.

Migrations

Since the use of CREATE EXTENSION postgis (available since PostgreSQL 9.1+ and PostGIS 2.0+) became commonplace, migrations now almost always “just work”, which is excellent news.

  • When you dump a modern PostGIS-enabled database, that was created using the CREATE EXTENSION postgis command, the dump file just includes a CREATE EXTENSION postgis command of its own at the top.
  • When you load the dump file into a new version of PostgreSQL even with a new version of PostGIS, the extension is created and the data magically loads.

However, there are still some old databases around that were created before the PostgreSQL extension system was invented, and when you dump them you get not only the data, but all the “custom” function and type definitions, including the defintions for PostGIS. A function definition looks like this:

CREATE OR REPLACE FUNCTION ST_PointOnSurface(geometry)
    RETURNS geometry
    AS '$libdir/postgis-2.2', 'pointonsurface'
    LANGUAGE 'c' IMMUTABLE STRICT; 

And look what is hiding inside of it: a reference to a particular version of the PostGIS library! So you cannot simply dump your old PostGIS 1.5 database on PostgreSQL 8.4 and load it into a fresh PostGIS 2.2 database on PostgreSQL 9.5: the function definitions won’t reference the right library file.

The best bet for a really old database that was created without the extension mechanism is to use the “hard upgrade” process. The hard upgrade works by:

  • Taking a special “custom-format” back-up that includes an object catalog;
  • Filtering the back-up to clean out all the PostGIS-specific function and object definitions; and then
  • Loading the “cleaned” back-up into a new database with the desired version of PostGIS already installed (using CREATE EXTENSION postgis this time, so you never have to hard upgrade again).

Hacks

In the case of upgrades that change out the underlying library and other situations that result in a mismatch between the SQL definitions in the database and the state of the system, there are a couple hacks that provide short-term fixes for emergencies:

  • Symlink the library name the database is looking for to the library name you have. So if your database wants postgis-2.1.so and all you have is postgis-2.2.so, you can ln -s postgis-2.2.so postgis-2.1.so and your database will “work” again.
  • Update the PostgreSQL catalog definitions for the functions. As a super-user, you can do all kinds of dangerous things, and one of them is to just UPDATE pg_proc SET probin = '$libdir/postgigs-2.2' WHERE probin ~ 'postgis-2.1'

Both hacks “work” because the PostGIS project doesn’t change underlying function names often, and inter-version changes mostly involve adding functions to the C library, not removing old ones.

However, there’s no guarantee that an underlying function name hasn’t change between versions, it’s just unlikely. In the worst case, the function name hasn’t changed, but the parameters have, so it’s now possible that calling the function will crash your database.

All this to say: linking and SQL catalogue hacks should be used temporarily only until you can properly upgrade your database using a hard upgrade.

BC IT Outsourcing 2015/16

It’s BC Public Accounts time again (calm down!), and in this year’s analysis of IT outsourcing we have a surprise result: total spending on IT contractors by the BC central government actually fell in the 2015/16 budget year.

The fall-off is almost entirely the result of a collapse of billings to the central government by IBM, though Deloitte had a small fall-off also.

Other IT contractors continued to bill heartily, including massive category leader HP Advanced Systems (HPAS) who billed a very respectable $163,499,787. For comparison, that’s more than was spent on the Ministries of Energy & Mines, Agriculture, Small Business, and International Trade, combined.

Since 2010, IBM’s billings to central government have fallen in four of the last six years, which leads me to a question: what’s the matter with Big Blue?

After further research, the answer is this: nothing at all. Big Blue is doing just fine.

It turns out that the fall off in IBM billing to central government has been more than offset by a massive increase in IBM billing to the Provincial Health Authorities. The Ministry of Health budget is huge, but much of it is spent by the Health Authorities, and the provincial public accounts only record transfers to the Authorities, they don’t keep track of who the Authorities spend with.

Fortunately, the Authorities also have to publish annual financial statements, and I have now input all that data into my summary sheets, through to 2015 (Health Authorities do not publish their detailed payments data until the fall, so I am one year behind until then). The results are, frankly, staggering…

When Health Authority spending is taken into acount, IBM revenue from the BC government has not fallen at all. It has instead been on an almost unbroken tear upwards, taking total government IT outsourcing spending to just under $700 million dollars in 2014/15.

A reasonable chunk of that billing over the last couple years has been on the PHSA/Vancouver Coastal Health electronic medical records project, a $1B+ firehose of cash that IBM has been slurping on heavily, though to little practical effect.

Since IBM was kicked off the project in mid-2015, we can expect their take to fall in the 2015/16 data, but much of the slack should be taken up by medical software vendor Cerner who have been given the prime contractor role on a no-bid, sole-source basis (I’m sure that will work out fine).

While entering the Health Authority data, I had the opportunity to learn a little about the ecosystem of local vendors who support the health sector. Unsurprisingly, there is little overlap between those companies and the ones I already know of who support central government in Victoria: smaller companies tend to be more specialized.

When you lump every local vendor together, central government and health authority, and plot them up, the result is… underwhelming.

While there has been a trending upwards of local IT contracting over time, it is dwarfed 10:1 by the dollars spent on the large international consultancies. HPAS alone takes in over three times what every local IT services firm in the province bills.

The moral of the story, I think, is: if you have money, you will spend money.

  • As part of my research, I reviewed the school district financials and found very minimal spending on IT consulting: not enough to warrant the effort to enter all the data. School districts are just too poor to waste money on stupid IT consulting.
  • On the flip side, I reviewed the ICBC 2015 list of vendors and found, in just one year, spending of:

    • $5,051,887 on Accenture
    • $23,944,563 on IBM
    • $3,177,241 on HP Advanced Solutions
    • $7,962,866 on Deloitte, and
    • $5,962,057 on Quartech Systems.

    On administrative operating costs of about $350M, that’s $46M of IT spend, over 18%. If you’ve got money, you’ll find a way to spend it. However, since ICBC is not a direct arm of government, I didn’t include their crazy spend in my totals.

Next year is hard to predict: IBM should have a fall-off; offset partially by Cerner, as they pause the EMR project before ramping back up again. HPAS should continue incremental growth. A wild card is the Natural Resources Permitting Project, which hasn’t truly hit its spending stride yet. If it gets going, I expect Deloitte will increase their billing substantially in 2016/17.

Until next year, happy billing BC!

Politics Amid Plenty

“Political leadership is a subtle art in times of plenty. When there are no great crises, there is no public demand for heroic acts. Politics becomes a parlor game, ignored by all but the most devoted citizens, a game practiced most assiduously by those interests–the business associations, trade unions, and single-issue groups–who have a direct stake in the outcome. The game turns on tactics and gestures, on the ability to placate factions, rather than to inspire the masses.”
- Joe Kein, The Natural

“Innovative” Government IT

Jennifer Pahlka from Code for America has published a pair of essays on IT leadership that anyone thinking about “organizational transformation” should read. They are very high level and a bit lengthy, but are full of good ideas:

“Innovative” Government IT

In talking about innovation in the second essay, Pahlka has this wonderful section, about the difference between adopting modern development practices and actually innovating.

The problem is that if you want “government technology as good as what we have at home,” you’re going to have to do things like move to the cloud and test prototypes with actual users.

That’s not innovation. That’s just how tech works today.

The practices and tools that result in good digital services vary from organization to organization, to be sure, but there is a lot in common that the private sector, and increasingly the public sector, pretty much agree on as standard.

When we frame these practices as somehow cutting-edge, risky, or non-standard, we do the mission a great disservice.

Sing it! Adopting open source, agile development and cloud technology are not “innovative” any more, they’re just table stakes, the minimum possible ante upon which to build a responsive technology organization.

The other big take home for me is in the first essay, decomposing the functional roles that are traditionally mushed into a single “CIO” position and pointing out how unlikely they are to match the capabilities of any one person:

  1. Digital services: the services residents use to engage and do business with the City. This can also include APIs and open data programs, though this is often the domain of the other CIO (the innovation officer.)
  2. Back office software: Day-to-day core services like email, human resources management, budgeting, fiscal and accounting that all departments rely on.
  3. Mission IT: The business applications that run the internal processes of departments and agencies. These are often custom, but can now make use of underlying commodity technology.
  4. Infrastructure: Network and connectivity, hosting and device management.

I tend to break the IT roles into just two pieces, but I think Pahlka makes a strong case for all four. My two pieces are:

  • Infrastructural. Email, network, desktop, phones, backup, payroll, finance. All skills that are replicated across every organization in existance, where little domain knowledge is required. With appropriate contracting safe-guards, you can outsource all of this stuff.
  • Strategic. Business systems unique to your organization and all the facets thereof. Back-end, front-end, user experience, API, etc. The IT tools are common, but the domain knowledge of the data and business processes are unique to your organization. You should own both the technology and the people creating it, for maximum flexibility.

Either way, the idea that the folks who are best at handling one category are also good at handing the other is dangerous.

You no more want your ultra-cautious infra manager (“let’s map out a 4 month plan for that…”) running development than you want a cowboy lead developer making decisions (“deploy!”) that might affect network uptime.

Anyways, go read! Time well spent.