PostGIS Apologia

Nathaniel Kelso has provided feedback from an (occasionally disgruntled) users point-of-view about ways to make PostGIS friendlier. I encourage you to read the full post, since it includes explanatory material that I’m going to trim away here to explain the whys and wherefores of how we got to where we are.

TL;DR: philosophical reasons for doing things; historical reasons for doing things; not my problem; just never got around to that.

Request 1a: Core FOSS4G projects should be stable and registered with official, maintained APT Ubuntu package list.

Request 1b: The APT package distribution of core FOSS4G projects should work with the last 2 versions (equivalent to 2 years) of Ubuntu LTS support releases, not just the most recent cutting edge dot release.

Spoken like an Ubuntu user! I would put the list of “platforms that have enough users to require packaging support” at: Windows, OSX, Centos (RHEL), Fedora, Ubuntu, Debian, SUSE. Multiply by 2 for 32/64 bit support, and add a few variants for things like multiple OSX package platforms (MacPorts, HomeBrew, etc). Reality: the PostGIS team doesn’t have the bandwidth to do this, people who want support for their favourite platform have to do it themselves.

The only exception to this rule is Windows, which Regina Obe supports, but that’s because she’s actually a dual category person: a PostGIS developer who also really wants her platform supported.

The best Linux support is for Red Hat variants, provided by Devrim Gunduz in the PostgreSQL Yum repositories. I think Devrim’s example is actually the best one, since it takes a PostgreSQL packager to do a really bang up job of packaging a PostgreSQL add-on like PostGIS. Unfortunately the Ubuntu PostgreSQL packager doesn’t do PostGIS as well (yet?).

Request 1c: Backport key bug fixes to the prior release series

This is actually done as a matter of course. If you know of a fix that is not backported ticket it. In general, if you put your tickets against the earliest milestone they apply to, the odds of a fix hitting all extant versions goes up, since the developer doesn’t have to go back and confirm the bug is historical rather than new to the development version. The only fixes that might not get done are ones that can’t be done without major code re-structuring, since that kind of thing tends to introduce as many problems as it solves.

Request 2.1a: Include a default PostGIS spatial database as part of the basic install, called “default_postgis_db” or something similar.

This is a packaging issue, and some packagers (Windows in particular, but also the OpenGeo Suite) include a template_postgis database, since it makes it easier to create new spatial databases (create database foo template template_postgis).

Anyways, as a packaging issue unless the PostGIS team took on all packaging there would be no way to ensure this happened in a uniform way everywhere, which is what one would need to do to have it makes things easier (for it to become general knowledge so that “oh, just use the __ database” became global advice).

More on creating spatial databases below.

Request 2.1b: Include a default PostGIS Postgres user as part of the basic install, called “postgis_user” or something similar.

I’m not sure I see the utility of this. From a data management point of view, you already have the PostgreSQL super user, postgres, around as a guaranteed-to-exist default user.

Request 2.1c: If I name a spatially enabled database in shp2pgsql that doesn’t yet exist, make one for me

Unless you have superuser credentials I can’t do this. So, maybe?

Request 2.1d: It’s too hard to manually setup a spatial database, with around a printed page of instructions that vary with install. It mystifies Postgres pros as well as novices.

Indeed it is! I will hide behind the usual defence, of course, “it’s not our fault!” It’s just the way PostgreSQL deals with extensions, including their own (load pgcrypto, for example, or fuzzystring). The best hack we have is the packaging hack that pre-creates a template_postgis, which works pretty well.

Fortunately, as of PostgreSQL 9.1+ and PostGIS 2.0+ we have the “CREATE EXTENSION” feature, so from here on in spatializing (and unspatializing (and upgrading)) a spatial database will be blissfully easy, just CREATE EXTENSION postgis (and DROP EXTENSION postgis (and ALTER EXTENSION postgis UPDATE TO 2.1.0)).

Request 2.1e: Default destination table names in shp2pgsql.

We have this, I just checked (1.5 and 2.0). The usage line indicates it, and it actually happens. I’m pretty sure it’s worked this way for a long time too, it’s not a new thing.

Request 2.1f: Automatically pipe the output to actually put the raw SQL results into PostGIS.

I’ll plead historical legacy on this one. The first version (c. 2001) of the loader was just a loader, no dumper, so adding in a database connection would have been needless complexity: just pipe it to psql, right?

Then we got a dumper, so now we had database connection logic lying around, but the loader had existing semantics and users. Also the code was crufty and it would have had to be re-written to get a direct load.

Then we got a GUI (1.5), and that required re-writing the core of the loader to actually do a direct database load. But we wanted to keep the commandline version working the same as before so our existing user base wouldn’t get a surprise change. So at this point doing a direct database loader is actually trivial, but we deliberately did not, to avoid tossing a change at our 10 years of legacy users.

So this is very doable, the question is whether we want to make a change like this to a utility that has been unaltered for years.

Incidentally, from an easy-to-use-for-newbies point of view the GUI is obviously way better than the command line. Why not use that? It’s what I use in all my PostGIS courses now.

Request 2.1g: If my shapefile has a PRJ associated with it (as most do), auto populate the -s option.

You have no idea how long I’ve wanted to do this. A very long time. It is, however, very hard to do. PRJ files don’t come (except the ones generated by GeoTools) with EPSG numbers in them. You have to figure out the numbers by (loosely) comparing the values in the file to the values in the full EPSG database. (That’s what the http://prj2epsg.org web site does.)

Now that we’ve added GDAL as a dependency in 2.0 we do at least have access to an existing PRJ WKT parser. However, I don’t think the OGR API provides enough hooks though to do something like load up all the WKT definitions in spatial_ref_sys (which is what we’ll have to do regardless) and search through them with sufficient looseness.

So this remains an area of active research. Sadly, probably not something that anyone will ever fund, which means given the level of effort necessary to make it happen, probably won’t happen.

Related 2.1h Projection on the fly: If you still can’t reproject data on the fly, something is wrong. If table X is in projection 1 (eg web merc) and table Y is in projection 2 (eg geographic), PostGIS ought to “just work”, without me resorting to a bunch of ST_Transform commands that include those flags. The SRID bits in those functions should be optional, not required.

Theoretically possible, but it has some potentially awful consequences for performance. You can only do index-assisted things with objects that share an SRS (SRID), since the indexes are built in one SRS space. So picking a side of an argument and pushing it into the same SRS as the other argument could cause you to miss out on an index opportunity. It’s worth perhaps thinking more about, though, since people with heterogenous SRID situations will be stuck in low performing situations whether we auto-transform or not.

The downside of all such “automagic” is that it leads people into non-optimal set-ups very naturally (and completely silently) so they end up wondering why PostGIS sucks for performance when actually it is their data setup that sucks.

Request 2.1i: Reasonable defaults in shp2pgsql import flags.

Agree 100%. Again, we’re just not changing historical defaults fast enough. The GUI has better defaults, but it wouldn’t hurt for the commandline to have them too.

Request 2.1j: Easier creation of point features from csv or dbf.

A rat-hole of unknowable depth (csv handling, etc) but agreed, a really very common and useful utility it would be. I just write a new perl script every time :)

Request 2.3a: Forward compatible pgdumps. Dumps from older PostGIS & Postgres combinations should always import into newer combinations of PostGIS and Postgres.

Upgrade has been ugly for a long time, and again it’s “not our fault”, in that until PostgreSQL 9.1, pg_dump always included our functions in the dump files. If you strip out the PostGIS function signature stuff (which is what the utils/postgis_restore.pl script does), it’s easy to get a clean and quiet restore into new versions, since we happily read old dumped PostGIS data and always have.

If you don’t mind a noisy restore it’s also always been possible to just drop a dump onto a new database and ignore the errors as function signatures collide and get a good restore.

With “CREATE EXTENSION” in PostgreSQL 9.1, we will now finally be able to pg_dump clean dumps that don’t include the function information, so this story more or less goes away.

Request 2.3b: Offer an option to skip PostGIS simple feature topology checks when importing a pgdump.

It’s important to note that there are two levels of validity checking in PostGIS. One level is “dumbass validity checking”, which can happen at parse time. Do rings close? Do linestrings have more than one point? That kind of thing. For a brief period in PostGIS history we have had some ugly situations where it was possible to create or ingest dumbass geometry through one code path and impossible to output it or ingest it through others. This was bad and wrong. It’s hopefully mostly gone. We should now mostly ingest and output dumbass things, because those things do happen. We hope you’ll clean or remove them though at a later time.

Be thankful we aren’t ArcSDE, which not only doesn’t accept dumbass things, it doesn’t accept anything that fails any rule of their whole validity model.

Request 3a: Topology should only be enforced as an optional add on, even for simple Polygon geoms. OGC’s view of polygon topology for simple polygons is wrong (or at the very least too robust).

Request 3b: Teach PostGIS the same winding rule that allows graphics software to fill complex polygons regarding self-intersections. Use that for simple point in polygon tests, etc. Only force me to clean the geometry for complicated map algebra.

Request 3c: Teach OGC a new trick about “less” simple features.

Request 3d: Beyond the simple polygon gripe, I’d love it if GEOS / PostGIS could become a little more sophisticated. Adobe Illustrator for several versions now allows users to build shapes using their ShapeBuilder tool where there are loops, gaps, overshoots, and other geometry burrs. It just works. Wouldn’t that be amazing? And it would be even better that ArcGIS.

We don’t enforce validity, we just don’t work very well if it’s not present.

Most of these complaints stem presumably from working with Natural Earth data which, since it exists, is definitionally “real world” data, but also includes some of the most unbelievably degenerate geometry I have ever experienced.

Rather than build special cases to handle degeneracy into every geometry function in the system, the right approach, IMO, is to build two functions that convert degenerate data into structured data that captures the “intent” of the original.

One function, ST_MakeValid, has already made an appearance in PostGIS 2.0. It isn’t 100% perfect, but it makes a good attempt and fixes many common invalidities that previously we had no answer for beyond “you’re SOL”. ST_MakeValid tries to fix invalidity without changing the input vertices at all.

The second function, ST_MakeClean, does not exist yet. ST_MakeClean would do everything ST_MakeValid does, but would also include a tolerance factor to use in tossing out unimportant structures (little spikes, tiny loops, minor ring crossings) that aren’t part of the “intent” of the feature.

Summary

I wish we had better packaging or the ability to do all the packaging ourselves so we could create a 100% consistent user experience across every platform, but that’s not possible. Please beg your favorite PostgreSQL packager to package PostGIS too.

The upgrade and install stories are going to get better with EXTENSIONS. So, just hang in there and things will improves.

The geometry validity story will get better with cleaning functions, and any extra dollars folks can invest in continuing to improve GEOS and fix obscure issues in the overlay code. The “ultimate fix”, if anyone wants to fund that, is to complete “snap rounding” code in JTS, and port that to GEOS, to support a fixed-precision overlay system. That should remove all overlay failures (which actually show up in intersections, unions and buffers, really all constructive geometry operations) once and for all.

IT Bubble Economics

As an owner of BC real estate, one of the more frightening economic stories I’ve heard over the last year relates to how our own real estate market is in fact tied tightly to others through the mechanism of offshore investment. The story goes like this:

  • The Chinese real estate market is bubbling. That is, it is appreciating because people now expect it to appreciate.
  • Chinese folks in the market are thus experiencing the heady feelings Americans circa 2006 were feeling. They can purchase a property and a year later through the magic of appreciation be in possession of surplus capital to invest in yet another property.
  • Canny Chinese investors invest some of their surplus capital overseas, which creates artificial demand propping up and sometimes actually inflating those overseas markets, causing smaller dependent bubbles in those markets. Think: Vancouver.
  • When the Chinese bubble bursts, expect the dependent bubbles to also burst, to the surprise of the inhabitants, who attributed their good fortune to the “fundamentals” of their markets.

(Since I live in Victoria, the “fundamentals” of our market fortunately relate a lot to wealthy Canadians getting old and wanting to retire here, and happily for me there is still be a lot of that happening.)

Anyhow, IT?

The Victoria IT marketplace is surprisingly tight. We’re a government town, in the midst of a moderate fiscal belt-tightening, and yet: it’s hard to find talent. The answer? Not Victoria’s fundamentals, but a reflection of the fact that other regional IT bubbles are now, thanks to the magic of the internets, highly portable.

Check out the niche market this local consultancy is serving.

Priologic is a software development firm based in Victoria, BC, Canada focused on serving San Francisco area Fortune 100 enterprise companies.

That’s right, the IT market in San Francisco has become so tight that there is now a niche for a company near-shoring talent to the uncool kids in the Fortune 100.

We’re also seeing satellite offices for Microsoft gaming and Zynga coming to town, and Amazon has had a big foothold here for the past several years since their acquisition of AbeBooks.

We’re all swimming in the same pool, now! (Nobody pee.)

Going "All In" on IT Projects

I was thinking about IT project failure this weekend (OK, actually I was thinking about the ICM project, but you can see how I got there) and did a little Googling around on failure rates.

The estimated range of failure rates is surprising large, from a high of 68% down to a lower bound around 30%. A lot of the variability is definitional (what is a “failure”, after all?) and some of it seems to be methodological (who are you asking, and what are you asking them about?). However the recent studies seem to cluster more around the lower range.

The web references are all pretty confident that poor requirements gathering is the root of most of the problems, and I can see some validity in that, but the variable I keep coming back to in my mental models is project size. The Sauer/Germino/Reich paper takes on project size directly, and finds that the odds of failure (or non-success) go up to steeply as project size increases.

Sadly, they don’t find that there is an optimal small-team size that brings down failure rates to what might be considered an acceptable level: even small-team projects fail at a rate of at best 25%.

So this brings me back to my weekend musings: suppose you have a capital budget of, say, $180M. Even assuming all projects of all sizes fail at the same rates (and as we’ve seen, they don’t) if you bet all your money on a single project there’s at least a 25% chance that you’ll lose 100% it in a failure.

On the other hand, if you split your money into multiple smaller independent projects (and smaller teams) the impact of any one failure goes way down. Rather than losing your whole capital budget (and, we hope, your job) in a massive IT meltdown, you lose a few of your sub-projects, about 25% of your total budget.

Fundamentally, organizing a project to deal with the 100% certainty of a 25% failure rate is a more defensible approach that gambling with the 25% chance of a 100% failure rate.

Don’t just go “all in”, don’t bet it all on black, know that the unknowns inherent in software development make it a probabilistic process, not a deterministic one, and plan appropriately.

End note: Expecting failure, and planning for it, makes more sense than crossing your fingers and hoping you’ll dodge the bullet. And it surfaces some attractive approaches to achieving success: small independent teams can be evaluated and measured against one another, allowing you to find and better utilize your top performers; because you now expect some teams to fail, you can insulate yourself against some failures by running duplicate teams on critical sub-projects; because small failures are non-fatal, they can be used as learning tools for succeeding rounds. The similarity to agile development is probably no coincidence.

BC IT Outsourcing Update

This week the BC 2011/2012 Public Accounts (finally) came out, so that means it’s time to update our tracking of IT outsourcing spending with another year’s data points!

The headline number for 2012: $321,773,530, a new record! After a brief dip last fiscal year, IT outsourcing continued growing robustly in 2011/2012.

Of course, there are always winners and losers, and the surprise loser this year is perennial outsourcing champ, IBM. With only $61M in billings in 2011/2012, IBM notched its lowest take since 2005! Fortunately, HP Advanced Solutions was there to pick up the slack, hauling in $109M this fiscal year for not only a personal best, but an all time record in single-firm IT billings (besting the old record of $107M set by IBM in 2010).

Also notable this fiscal year are the rise of Deloitte (to $37M) and local body-shoppers TP Systems ($9M) and Quartech ($13M) on the strength of the ongoing ICM debacle (in related news, the BC Children’s Advocate is the latest to take a dump on ICM).

Thanks to all our contestants, I know you’ll all be back next year!

Obscure end material: One surprise entry in the larger IT spends is “Oracle Microsystems” at $24M. What’s that? Looks like the remains of “Sun Microsystems”, since their entry is gone. It also leads me to wonder, why are we still dropping tens of millions of bucks on Sun hardware at this late date in the enterprise computing era? Guys, it’s called the “cloud”, look it up.

@OSCon 2012

I’m at OSCON this week, taking in the open source gestalt in it’s most grandiose form: a gathering of over 3000 technophiles in the “open source city”, Portland, Oregon.

This keynotes this morning were a lovely mix of core open source core concerns and philosophy and related-yet-different topics.

The core of open source is sharing and cooperation, and David Eaves addressed that in his talk, advocating a new focus on community building. Not from a qualitative “boy, it would be good to pay attention to community” point of view, but from a quantitative, “hey, we can and should measure engagement and seek to improve it, in our tooling and our processes”. If we do so, we’ll learn some obvious things (being blunt is good for saving keystrokes but bad for incubating new contributors) and maybe we’ll also learn from surprising things, like who our best bug reporters are.

Danny Hillis gave an overview of one of this current projects, creating a “learning map” that allows agents to help guide learning. He used a wonderful analogy to his elementary school librarian, who guided his early learning by not only finding him information on the topics he said he was interested in, but also brought him information she though he might be interested in and was ready to comprehend.

Kay Thaney talked about changing the process of research science, to modernize and align the incentives of researchers for faster progress. I feel like she talked over the subject a bit, with not enough concrete examples of how modern science is being held back by traditional promotion and funding structures, but I agree with her overall premise. I recently read a piece on the travails of a computer science PhD candidate which distilled a bunch of the problems with modern academia: the power of the principal investigator and his/her grants; the requirement for publication first, foremost and always; and the power of academic social networks in controlling access to funding and career opportunities.

One of my database idols, Brian Aker, talked about his role role at (!!) HP, overseeing their public cloud offering based on OpenStack. Learning about OpenStack and the incredible development velocity around the project has been one of the huge eye-openers for me here as OSCon.

And finally Tim O’Reilly brought us back to the open source philosophy with a talk about how we value open source contribution to society (tl;dr: we don’t). Like all the energy we save through conservation or all the fresh air trees and plants produce, the value of freely available open source is unaccounted for in the formal economy.

My previous experience of OSCon was one of my favourite conference experiences, and informed how I tried to organize FOSS4G 2007 the following year. This event is the same: quality speakers, good opportunities to network and meet as many people you can wish, and a range of surprising new things that send you home going “wow”. I’m looking forward to the rest of the day and tomorrow!