ICM and ExaData

I went to an Oracle Users Group meeting yesterday afternoon, to see a presentation by Marcin Zaranski on the Integrated Case Management system’s use of Oracle ExaData hardware.

Disclosure: I went expecting to be shocked, shocked at yet another criminal waste of money on the part of ICM.

ExaData is Oracle’s fairly successful attempt to turn the hardware engineering skills they acquired with Sun Microsystems into something with unique marketability. I’d say they’ve succeeded, which is good news for the excellent engineers at Sun. By combining the kinds of hardware database optimizations pioneered at Netezza and Teradata with Sun’s overall server engineering prowess and their industry-leading sales and marketing team, Oracle has a winner. It might not be the best, but they are going to sell more of them than anyone else.

ExaData (I’m going to speak in the singular, about the ExaData Database Machine even though the “Exa” prefix has now been splashed across a wide range of Oracle “engineered systems”) is basically an enterprise appliance. A database in a box, where the “box” is a server rack. It ships with the database pre-installed and configured for the underlying hardware. The underlying hardware not only includes the kind of monitoring, reliability and redundancy that Sun fanboys like myself have come to expect, but also includes custom storage modules that can push portions of SQL queries down to just above the disk heads, dramatically improving query performance, particularly for OLAP workloads.

It’s really clever technology, but the true cleverness is the sales pitch, which Zaranski touched on and Oracle rep Dev Dhindsa hammered us over the head with during his talk: because ExaData is basically an appliance, all the coordination costs of getting systems and network administrators to interface with database and application administrators goes away. It’s a technology fix for the organizational problem of IT silos, and the pitch to the database and apps departments is simple: “buy this product so you don’t have to talk to those f***ers in system admin and networking anymore.”

And any pitch to IT that involves talking to people less is a stone cold winner.

So, back to ICM. The justification for ICM buying ExaData was to alleviate performance problems experienced in their phase 1 rollout to about 1800 users before they rolled out to 8000 users in phase 2. The result: success! They didn’t have any complaints in phase 2… about performance.

After his presentation, I asked Zaranski how much ExaData cost the ICM project, and he would not provide a number, presumably thanks to the magic of “secret contracts” with Oracle (pricing is a “trade secret” and thus protected by FOI, one of the many counterproductive consequences of the “third party confidential” exception in the BC FOI law).

However, later the Oracle hardware rep was nice enough to tell me the list price for ExaData: $250K for a “quarter rack”. ICM purchased two of those (one for production, one for fail-over), presumably for less than list. It’s a lot of money for servers, but within the context of a $200M project I find it hard to get worked up. It makes their system run faster, which makes it less awful for the users. And the cost of the ExaData hardware will look small next to the cost of the Oracle and Siebel software licenses that are going to run on it.

Way before the project was forced to buy top-end hardware to coax reasonable performance out of their application, the clusterf*** that is ICM was already baked in: by the decision to simultaneously integrate so many systems; by the decision to use the “COTS” Seibel solution; and by the decision to outsource to expensive international consultancies.

So, enjoy your cool hardware ICM, it’s pretty boss.

Best moment: In his presentation, Zaranski repeated the ICM mantra: that one of the big wins is replacing the 30-year-old “legacy” systems previously doing the social services records keeping. “Legacy” is a favourite put-down of all IT presenters. “Legacy” software is crufty old stuff, in the process of being phased out, unlike the cool software you work with. So I got a perverse kick when Dev Dhindsa, in praising Oracle’s new “cloud enabled” Fusion Middleware contrasted it favorably with their suite of “legacy” applications, such as PeopleSoft, JD Edwards, and … Siebel, the software ICM is using to replace the “legacy” social services systems.

Spatial IT vs GIS

So, Stephen Mather has taken a crack at analyzing my Spatial IT meme.

And why then the artificial distinction between GIS and Planning? If GIS is Planning technically embodied, should they not be conflated? Two reasons why not. One: The efficacy of GIS can be hindered by slavishly tying it to Planning in large part because there is wider and deeper applicability to GIS than to Planning’s typical functions. Lemma: Paul is partially right.

Stephen’s found a weak seam in my argument, and it’s around the planning aspects of GIS. There’s a place where GIS provides the interface between raw data and planning decisions, which remains:

  • high touch and interpersonal;
  • qualitative and presentational;
  • ad hoc and unpredictable.

This is the GIS that is taught in schools, because it’s the “interesting” GIS, the place where decision meets data.

However, as we know, GIS courses are just the bait in the trap, to suck naïve students into a career where 90% of the activity is actually in data creation (digitization monkey!) and publication (map monkey!), not in analysis. The trap that “GIS” has fallen into is to assume that these low-skill, repetitive tasks are (a) worth defending and (b) should be done with specific “GIS technology”. They aren’t, and they shouldn’t; they should (and, pace Brian Timoney, will) be folded into generic IT workflows, automated, and systematized.

That will leave the old core of “real GIS” behind, and that’s probably a good thing, because training people for analysis and then turning them into map monkeys and digitization monkeys (and image color-balancing monkeys, and change detection monkeys) is a cruel bait-and-switch.

Open Source for IT Managers

I’ve given this talk a number of times over the past 18 months (Minnesota, Illinois, Wisconsin, DC), but this was the first time there was a decent recording set-up at the venue. Got a spare 48 minutes?

PostGIS Apologia

Nathaniel Kelso has provided feedback from an (occasionally disgruntled) users point-of-view about ways to make PostGIS friendlier. I encourage you to read the full post, since it includes explanatory material that I’m going to trim away here to explain the whys and wherefores of how we got to where we are.

TL;DR: philosophical reasons for doing things; historical reasons for doing things; not my problem; just never got around to that.

Request 1a: Core FOSS4G projects should be stable and registered with official, maintained APT Ubuntu package list.

Request 1b: The APT package distribution of core FOSS4G projects should work with the last 2 versions (equivalent to 2 years) of Ubuntu LTS support releases, not just the most recent cutting edge dot release.

Spoken like an Ubuntu user! I would put the list of “platforms that have enough users to require packaging support” at: Windows, OSX, Centos (RHEL), Fedora, Ubuntu, Debian, SUSE. Multiply by 2 for 32/64 bit support, and add a few variants for things like multiple OSX package platforms (MacPorts, HomeBrew, etc). Reality: the PostGIS team doesn’t have the bandwidth to do this, people who want support for their favourite platform have to do it themselves.

The only exception to this rule is Windows, which Regina Obe supports, but that’s because she’s actually a dual category person: a PostGIS developer who also really wants her platform supported.

The best Linux support is for Red Hat variants, provided by Devrim Gunduz in the PostgreSQL Yum repositories. I think Devrim’s example is actually the best one, since it takes a PostgreSQL packager to do a really bang up job of packaging a PostgreSQL add-on like PostGIS. Unfortunately the Ubuntu PostgreSQL packager doesn’t do PostGIS as well (yet?).

Request 1c: Backport key bug fixes to the prior release series

This is actually done as a matter of course. If you know of a fix that is not backported ticket it. In general, if you put your tickets against the earliest milestone they apply to, the odds of a fix hitting all extant versions goes up, since the developer doesn’t have to go back and confirm the bug is historical rather than new to the development version. The only fixes that might not get done are ones that can’t be done without major code re-structuring, since that kind of thing tends to introduce as many problems as it solves.

Request 2.1a: Include a default PostGIS spatial database as part of the basic install, called “default_postgis_db” or something similar.

This is a packaging issue, and some packagers (Windows in particular, but also the OpenGeo Suite) include a template_postgis database, since it makes it easier to create new spatial databases (create database foo template template_postgis).

Anyways, as a packaging issue unless the PostGIS team took on all packaging there would be no way to ensure this happened in a uniform way everywhere, which is what one would need to do to have it makes things easier (for it to become general knowledge so that “oh, just use the __ database” became global advice).

More on creating spatial databases below.

Request 2.1b: Include a default PostGIS Postgres user as part of the basic install, called “postgis_user” or something similar.

I’m not sure I see the utility of this. From a data management point of view, you already have the PostgreSQL super user, postgres, around as a guaranteed-to-exist default user.

Request 2.1c: If I name a spatially enabled database in shp2pgsql that doesn’t yet exist, make one for me

Unless you have superuser credentials I can’t do this. So, maybe?

Request 2.1d: It’s too hard to manually setup a spatial database, with around a printed page of instructions that vary with install. It mystifies Postgres pros as well as novices.

Indeed it is! I will hide behind the usual defence, of course, “it’s not our fault!” It’s just the way PostgreSQL deals with extensions, including their own (load pgcrypto, for example, or fuzzystring). The best hack we have is the packaging hack that pre-creates a template_postgis, which works pretty well.

Fortunately, as of PostgreSQL 9.1+ and PostGIS 2.0+ we have the “CREATE EXTENSION” feature, so from here on in spatializing (and unspatializing (and upgrading)) a spatial database will be blissfully easy, just CREATE EXTENSION postgis (and DROP EXTENSION postgis (and ALTER EXTENSION postgis UPDATE TO 2.1.0)).

Request 2.1e: Default destination table names in shp2pgsql.

We have this, I just checked (1.5 and 2.0). The usage line indicates it, and it actually happens. I’m pretty sure it’s worked this way for a long time too, it’s not a new thing.

Request 2.1f: Automatically pipe the output to actually put the raw SQL results into PostGIS.

I’ll plead historical legacy on this one. The first version (c. 2001) of the loader was just a loader, no dumper, so adding in a database connection would have been needless complexity: just pipe it to psql, right?

Then we got a dumper, so now we had database connection logic lying around, but the loader had existing semantics and users. Also the code was crufty and it would have had to be re-written to get a direct load.

Then we got a GUI (1.5), and that required re-writing the core of the loader to actually do a direct database load. But we wanted to keep the commandline version working the same as before so our existing user base wouldn’t get a surprise change. So at this point doing a direct database loader is actually trivial, but we deliberately did not, to avoid tossing a change at our 10 years of legacy users.

So this is very doable, the question is whether we want to make a change like this to a utility that has been unaltered for years.

Incidentally, from an easy-to-use-for-newbies point of view the GUI is obviously way better than the command line. Why not use that? It’s what I use in all my PostGIS courses now.

Request 2.1g: If my shapefile has a PRJ associated with it (as most do), auto populate the -s option.

You have no idea how long I’ve wanted to do this. A very long time. It is, however, very hard to do. PRJ files don’t come (except the ones generated by GeoTools) with EPSG numbers in them. You have to figure out the numbers by (loosely) comparing the values in the file to the values in the full EPSG database. (That’s what the http://prj2epsg.org web site does.)

Now that we’ve added GDAL as a dependency in 2.0 we do at least have access to an existing PRJ WKT parser. However, I don’t think the OGR API provides enough hooks though to do something like load up all the WKT definitions in spatial_ref_sys (which is what we’ll have to do regardless) and search through them with sufficient looseness.

So this remains an area of active research. Sadly, probably not something that anyone will ever fund, which means given the level of effort necessary to make it happen, probably won’t happen.

Related 2.1h Projection on the fly: If you still can’t reproject data on the fly, something is wrong. If table X is in projection 1 (eg web merc) and table Y is in projection 2 (eg geographic), PostGIS ought to “just work”, without me resorting to a bunch of ST_Transform commands that include those flags. The SRID bits in those functions should be optional, not required.

Theoretically possible, but it has some potentially awful consequences for performance. You can only do index-assisted things with objects that share an SRS (SRID), since the indexes are built in one SRS space. So picking a side of an argument and pushing it into the same SRS as the other argument could cause you to miss out on an index opportunity. It’s worth perhaps thinking more about, though, since people with heterogenous SRID situations will be stuck in low performing situations whether we auto-transform or not.

The downside of all such “automagic” is that it leads people into non-optimal set-ups very naturally (and completely silently) so they end up wondering why PostGIS sucks for performance when actually it is their data setup that sucks.

Request 2.1i: Reasonable defaults in shp2pgsql import flags.

Agree 100%. Again, we’re just not changing historical defaults fast enough. The GUI has better defaults, but it wouldn’t hurt for the commandline to have them too.

Request 2.1j: Easier creation of point features from csv or dbf.

A rat-hole of unknowable depth (csv handling, etc) but agreed, a really very common and useful utility it would be. I just write a new perl script every time :)

Request 2.3a: Forward compatible pgdumps. Dumps from older PostGIS & Postgres combinations should always import into newer combinations of PostGIS and Postgres.

Upgrade has been ugly for a long time, and again it’s “not our fault”, in that until PostgreSQL 9.1, pg_dump always included our functions in the dump files. If you strip out the PostGIS function signature stuff (which is what the utils/postgis_restore.pl script does), it’s easy to get a clean and quiet restore into new versions, since we happily read old dumped PostGIS data and always have.

If you don’t mind a noisy restore it’s also always been possible to just drop a dump onto a new database and ignore the errors as function signatures collide and get a good restore.

With “CREATE EXTENSION” in PostgreSQL 9.1, we will now finally be able to pg_dump clean dumps that don’t include the function information, so this story more or less goes away.

Request 2.3b: Offer an option to skip PostGIS simple feature topology checks when importing a pgdump.

It’s important to note that there are two levels of validity checking in PostGIS. One level is “dumbass validity checking”, which can happen at parse time. Do rings close? Do linestrings have more than one point? That kind of thing. For a brief period in PostGIS history we have had some ugly situations where it was possible to create or ingest dumbass geometry through one code path and impossible to output it or ingest it through others. This was bad and wrong. It’s hopefully mostly gone. We should now mostly ingest and output dumbass things, because those things do happen. We hope you’ll clean or remove them though at a later time.

Be thankful we aren’t ArcSDE, which not only doesn’t accept dumbass things, it doesn’t accept anything that fails any rule of their whole validity model.

Request 3a: Topology should only be enforced as an optional add on, even for simple Polygon geoms. OGC’s view of polygon topology for simple polygons is wrong (or at the very least too robust).

Request 3b: Teach PostGIS the same winding rule that allows graphics software to fill complex polygons regarding self-intersections. Use that for simple point in polygon tests, etc. Only force me to clean the geometry for complicated map algebra.

Request 3c: Teach OGC a new trick about “less” simple features.

Request 3d: Beyond the simple polygon gripe, I’d love it if GEOS / PostGIS could become a little more sophisticated. Adobe Illustrator for several versions now allows users to build shapes using their ShapeBuilder tool where there are loops, gaps, overshoots, and other geometry burrs. It just works. Wouldn’t that be amazing? And it would be even better that ArcGIS.

We don’t enforce validity, we just don’t work very well if it’s not present.

Most of these complaints stem presumably from working with Natural Earth data which, since it exists, is definitionally “real world” data, but also includes some of the most unbelievably degenerate geometry I have ever experienced.

Rather than build special cases to handle degeneracy into every geometry function in the system, the right approach, IMO, is to build two functions that convert degenerate data into structured data that captures the “intent” of the original.

One function, ST_MakeValid, has already made an appearance in PostGIS 2.0. It isn’t 100% perfect, but it makes a good attempt and fixes many common invalidities that previously we had no answer for beyond “you’re SOL”. ST_MakeValid tries to fix invalidity without changing the input vertices at all.

The second function, ST_MakeClean, does not exist yet. ST_MakeClean would do everything ST_MakeValid does, but would also include a tolerance factor to use in tossing out unimportant structures (little spikes, tiny loops, minor ring crossings) that aren’t part of the “intent” of the feature.

Summary

I wish we had better packaging or the ability to do all the packaging ourselves so we could create a 100% consistent user experience across every platform, but that’s not possible. Please beg your favorite PostgreSQL packager to package PostGIS too.

The upgrade and install stories are going to get better with EXTENSIONS. So, just hang in there and things will improves.

The geometry validity story will get better with cleaning functions, and any extra dollars folks can invest in continuing to improve GEOS and fix obscure issues in the overlay code. The “ultimate fix”, if anyone wants to fund that, is to complete “snap rounding” code in JTS, and port that to GEOS, to support a fixed-precision overlay system. That should remove all overlay failures (which actually show up in intersections, unions and buffers, really all constructive geometry operations) once and for all.

IT Bubble Economics

As an owner of BC real estate, one of the more frightening economic stories I’ve heard over the last year relates to how our own real estate market is in fact tied tightly to others through the mechanism of offshore investment. The story goes like this:

  • The Chinese real estate market is bubbling. That is, it is appreciating because people now expect it to appreciate.
  • Chinese folks in the market are thus experiencing the heady feelings Americans circa 2006 were feeling. They can purchase a property and a year later through the magic of appreciation be in possession of surplus capital to invest in yet another property.
  • Canny Chinese investors invest some of their surplus capital overseas, which creates artificial demand propping up and sometimes actually inflating those overseas markets, causing smaller dependent bubbles in those markets. Think: Vancouver.
  • When the Chinese bubble bursts, expect the dependent bubbles to also burst, to the surprise of the inhabitants, who attributed their good fortune to the “fundamentals” of their markets.

(Since I live in Victoria, the “fundamentals” of our market fortunately relate a lot to wealthy Canadians getting old and wanting to retire here, and happily for me there is still be a lot of that happening.)

Anyhow, IT?

The Victoria IT marketplace is surprisingly tight. We’re a government town, in the midst of a moderate fiscal belt-tightening, and yet: it’s hard to find talent. The answer? Not Victoria’s fundamentals, but a reflection of the fact that other regional IT bubbles are now, thanks to the magic of the internets, highly portable.

Check out the niche market this local consultancy is serving.

Priologic is a software development firm based in Victoria, BC, Canada focused on serving San Francisco area Fortune 100 enterprise companies.

That’s right, the IT market in San Francisco has become so tight that there is now a niche for a company near-shoring talent to the uncool kids in the Fortune 100.

We’re also seeing satellite offices for Microsoft gaming and Zynga coming to town, and Amazon has had a big foothold here for the past several years since their acquisition of AbeBooks.

We’re all swimming in the same pool, now! (Nobody pee.)