What's so hard about that download?

Twitter is a real accountability tool. This post is penance for moaning about things in public.

Soooo…. last Friday, while cooling my heels in Denver on the way home, I took another stab at Chad Dickerson’s electoral district clipping problem, and came up with this version.

I ended up taking the old 1:50K “3rd order watershed” layer, using ST_Union to generate maximal provincial outline, using ST_Dump and ST_ExteriorRing to get out just the land boundaries (no lakes or wide rivers), used ST_Buffer to and ST_Simplify to get a reduced-yet-still-attractive version, differenced this land polygon from an outline polygon to get an “ocean” polygon, then (as I did previously) differenced that ocean from the electoral districts to get the clipped results. Phew.

And then I complained on the Twitter about the webstacle that now exists for anyone like me who wants to access those old 1:50K GIS files.

And the OpenData folks in BC, to their credit, wonder what I’m on about.

So, first of all, caveats:

  • The obstacles to access to this data were constructed years before open data existed as an explicit government initiative in BC. This is not a problem with the work the open data folks have done.
  • It could certainly be a whole lot harder to access, it is still theoretically available for download, I don’t need to file an FOI or go to court or anything like that to get this data.

This is a story of contrasts and “progress”.

Back when I actually downloaded these GIS files, in the early 2000s, I was able to access the whole dataset like this (the fact that I can still type out the process from memory should be indicative of how useful I found the old regime):

ftp ftp.env.gov.bc.ca
cd /dist/arcwhse/watersheds/
cd wsd3
mget *.gz

Here’s how it works now.

I don’t know where this data is anymore, so I go to data.gov.bc.ca. This is an improvement, I don’t have to (a) troll through Ministry sites first trying to figure out which one holds the data or (b) not troll though anything because I have no idea the data exists.

Due to the magic of inflexible design standards, the data.gov.bc.ca site has two search boxes, one that does what I want (the smaller one, below), and one that just does a google search of all the gov.bc.ca sites (that larger one, at the top). Ask me how I figured that out.

So, I type “watersheds” into the search box and get 10 results. Here I have to lean on my domain knowledge and go to #10, which is the old 3rd order watersheds layer.

The dataset record is pretty good, my only complaint would be that unlike the old FTP filesystem there’s no obvious indication that there are other related data sets that together form a collection of related data, the watershed atlas. The keywords field gets towards that intent, but a breadcrumb trail or something else might be clearer. I think the idea of a data collection made of parts is common to a lot of data domains, and might help people organically discover things more easily.

Anyhow, here’s where things get “fun”, because here we leave the domain of open data and enter the domain of the “GIS data warehouse”. I click on the “SHP” download link:

The difference between hosting data on an FTP site and hosting it in a big ArcSDE warehouse is that the former has very few moving parts, is really simple, and practically never does down, while the latter is the opposite of that.

Let’s just skip the convenient direct open data link, and try to download the data directly from the warehouse. Go to the warehouse distribution service entry page:

I like ad for Internet Explorer, that’s super stuff. It’s almost like these pages are put up and never looked at again. We’ll enter as a guest.

Two search boxes again, but at least this time the one we’re supposed to use is the big one. Thanks to our trip through the data.gov.bc.ca system, we know that typing “WSA” is the thing most likely to get us the “watershed atlas”.

Boohyah, it’s even the top set of entries. Let’s compare the metadata for fun (click on the “Info” (i)).

Pretty voluminous, and there’s a tempting purple download button up there… hey, this one works!

Hm, it wants my email address and for me to assent to a license… I wonder what the license is?

Why make people explicitly assent to a license that is only implicitly defined? Fun. Ok, fine, have my email address, and I assent to something or other. I Submit (in both senses)!

And now I wait for my e-mail to arrive…

Hey presto, it’s alive! (Sunday 11:27AM) But no data yet.

W00t! Data is ready! (Sunday 11:30AM)

Uh, oh, something is wrong here. My browser perhaps? Let’s try wget.

wget ftp://slkftp.env.gov.bc.ca/outgoing/apps/lrdw/dwds/LRDW-1235441-Public.zip
--2012-12-02 11:33:10--  ftp://slkftp.env.gov.bc.ca/outgoing/apps/lrdw/dwds/LRDW-1235441-Public.zip
           => ‘LRDW-1235441-Public.zip’
Resolving slkftp.env.gov.bc.ca (slkftp.env.gov.bc.ca)... 142.36.245.171
Connecting to slkftp.env.gov.bc.ca (slkftp.env.gov.bc.ca)|142.36.245.171|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /outgoing/apps/lrdw/dwds ... 
No such directory ‘outgoing/apps/lrdw/dwds’.

This is awesome. OK, back to the FTP client!

Connected to slkftp.env.gov.bc.ca.
220-Microsoft FTP Service
220 This server slkftp.env.gov.bc.ca is for British Columbia Government business use only.
500 'AUTH GSSAPI': command not understood
Name (slkftp.env.gov.bc.ca:pramsey): ftp
331 Anonymous access allowed, send identity (e-mail name) as password.
Password:
230 Anonymous user logged in.
Remote system type is Windows_NT.
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls.
dr-xr-xr-x   1 owner    group               0 Jul 23  2010 MidaFTP
dr-xr-xr-x   1 owner    group               0 Aug 25  2010 outgoing
226 Transfer complete.
ftp> cd outgoing
250 CWD command successful.
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls.
dr-xr-xr-x   1 owner    group               0 Jul  2  2010 apps
226 Transfer complete.
ftp> cd apps
250 CWD command successful.
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls.
d---------   1 owner    group               0 Jul  2  2010 lrdw
d---------   1 owner    group               0 Jul  2  2010 mtec
226 Transfer complete.
ftp> cd lrdw
550 lrdw: Access is denied.

So, the anonymous FTP directory where the jobs are landing is not readable (by anyone). Oh, and serious demerits for running an FTP server on Windows (NT!).

The whole data warehouse/data distribution thing substantially pre-dates open data, and actually one of the reasons it (a) exists and (b) is so f***ing terrible is because at the time it was conceived and designed BC was still trying to sell GIS data, so the distribution system has crazy layers of security and differentiation between free and non-free data (even though it still forces you to go through “checkout” for free data (which all data now is)).

My request was for only 50Mb of data, and the system is (theoretically) willing to give it to me in one chunk. If I had wanted to access all of TRIM (the 1:20 BC planimetric base map product) I would be, as the French say, “up sh** creek”.

The current process is also, clearly, not amenable to automation. If I wanted to regularly download a volatile data set, I would also be, as in the German proverb, FUBAR.

So, there you go, open data folks. I am fully cognizant that the problem is 100% Not Of Your Design or Doing, I watched it happen in real time (and even won a contract to maintain the system after it was built! gah!) But it is also, still, now many years on, a Problem.

Remember, I originally got the data like this.

ftp ftp.env.gov.bc.ca
cd /dist/arcwhse/watersheds/
cd wsd3
mget *.gz

It ain’t Rocket Science, we’ve just made it seem like it is.

ICM and ExaData

I went to an Oracle Users Group meeting yesterday afternoon, to see a presentation by Marcin Zaranski on the Integrated Case Management system’s use of Oracle ExaData hardware.

Disclosure: I went expecting to be shocked, shocked at yet another criminal waste of money on the part of ICM.

ExaData is Oracle’s fairly successful attempt to turn the hardware engineering skills they acquired with Sun Microsystems into something with unique marketability. I’d say they’ve succeeded, which is good news for the excellent engineers at Sun. By combining the kinds of hardware database optimizations pioneered at Netezza and Teradata with Sun’s overall server engineering prowess and their industry-leading sales and marketing team, Oracle has a winner. It might not be the best, but they are going to sell more of them than anyone else.

ExaData (I’m going to speak in the singular, about the ExaData Database Machine even though the “Exa” prefix has now been splashed across a wide range of Oracle “engineered systems”) is basically an enterprise appliance. A database in a box, where the “box” is a server rack. It ships with the database pre-installed and configured for the underlying hardware. The underlying hardware not only includes the kind of monitoring, reliability and redundancy that Sun fanboys like myself have come to expect, but also includes custom storage modules that can push portions of SQL queries down to just above the disk heads, dramatically improving query performance, particularly for OLAP workloads.

It’s really clever technology, but the true cleverness is the sales pitch, which Zaranski touched on and Oracle rep Dev Dhindsa hammered us over the head with during his talk: because ExaData is basically an appliance, all the coordination costs of getting systems and network administrators to interface with database and application administrators goes away. It’s a technology fix for the organizational problem of IT silos, and the pitch to the database and apps departments is simple: “buy this product so you don’t have to talk to those f***ers in system admin and networking anymore.”

And any pitch to IT that involves talking to people less is a stone cold winner.

So, back to ICM. The justification for ICM buying ExaData was to alleviate performance problems experienced in their phase 1 rollout to about 1800 users before they rolled out to 8000 users in phase 2. The result: success! They didn’t have any complaints in phase 2… about performance.

After his presentation, I asked Zaranski how much ExaData cost the ICM project, and he would not provide a number, presumably thanks to the magic of “secret contracts” with Oracle (pricing is a “trade secret” and thus protected by FOI, one of the many counterproductive consequences of the “third party confidential” exception in the BC FOI law).

However, later the Oracle hardware rep was nice enough to tell me the list price for ExaData: $250K for a “quarter rack”. ICM purchased two of those (one for production, one for fail-over), presumably for less than list. It’s a lot of money for servers, but within the context of a $200M project I find it hard to get worked up. It makes their system run faster, which makes it less awful for the users. And the cost of the ExaData hardware will look small next to the cost of the Oracle and Siebel software licenses that are going to run on it.

Way before the project was forced to buy top-end hardware to coax reasonable performance out of their application, the clusterf*** that is ICM was already baked in: by the decision to simultaneously integrate so many systems; by the decision to use the “COTS” Seibel solution; and by the decision to outsource to expensive international consultancies.

So, enjoy your cool hardware ICM, it’s pretty boss.

Best moment: In his presentation, Zaranski repeated the ICM mantra: that one of the big wins is replacing the 30-year-old “legacy” systems previously doing the social services records keeping. “Legacy” is a favourite put-down of all IT presenters. “Legacy” software is crufty old stuff, in the process of being phased out, unlike the cool software you work with. So I got a perverse kick when Dev Dhindsa, in praising Oracle’s new “cloud enabled” Fusion Middleware contrasted it favorably with their suite of “legacy” applications, such as PeopleSoft, JD Edwards, and … Siebel, the software ICM is using to replace the “legacy” social services systems.

Spatial IT vs GIS

So, Stephen Mather has taken a crack at analyzing my Spatial IT meme.

And why then the artificial distinction between GIS and Planning? If GIS is Planning technically embodied, should they not be conflated? Two reasons why not. One: The efficacy of GIS can be hindered by slavishly tying it to Planning in large part because there is wider and deeper applicability to GIS than to Planning’s typical functions. Lemma: Paul is partially right.

Stephen’s found a weak seam in my argument, and it’s around the planning aspects of GIS. There’s a place where GIS provides the interface between raw data and planning decisions, which remains:

  • high touch and interpersonal;
  • qualitative and presentational;
  • ad hoc and unpredictable.

This is the GIS that is taught in schools, because it’s the “interesting” GIS, the place where decision meets data.

However, as we know, GIS courses are just the bait in the trap, to suck naïve students into a career where 90% of the activity is actually in data creation (digitization monkey!) and publication (map monkey!), not in analysis. The trap that “GIS” has fallen into is to assume that these low-skill, repetitive tasks are (a) worth defending and (b) should be done with specific “GIS technology”. They aren’t, and they shouldn’t; they should (and, pace Brian Timoney, will) be folded into generic IT workflows, automated, and systematized.

That will leave the old core of “real GIS” behind, and that’s probably a good thing, because training people for analysis and then turning them into map monkeys and digitization monkeys (and image color-balancing monkeys, and change detection monkeys) is a cruel bait-and-switch.

Open Source for IT Managers

I’ve given this talk a number of times over the past 18 months (Minnesota, Illinois, Wisconsin, DC), but this was the first time there was a decent recording set-up at the venue. Got a spare 48 minutes?

PostGIS Apologia

Nathaniel Kelso has provided feedback from an (occasionally disgruntled) users point-of-view about ways to make PostGIS friendlier. I encourage you to read the full post, since it includes explanatory material that I’m going to trim away here to explain the whys and wherefores of how we got to where we are.

TL;DR: philosophical reasons for doing things; historical reasons for doing things; not my problem; just never got around to that.

Request 1a: Core FOSS4G projects should be stable and registered with official, maintained APT Ubuntu package list.

Request 1b: The APT package distribution of core FOSS4G projects should work with the last 2 versions (equivalent to 2 years) of Ubuntu LTS support releases, not just the most recent cutting edge dot release.

Spoken like an Ubuntu user! I would put the list of “platforms that have enough users to require packaging support” at: Windows, OSX, Centos (RHEL), Fedora, Ubuntu, Debian, SUSE. Multiply by 2 for 32/64 bit support, and add a few variants for things like multiple OSX package platforms (MacPorts, HomeBrew, etc). Reality: the PostGIS team doesn’t have the bandwidth to do this, people who want support for their favourite platform have to do it themselves.

The only exception to this rule is Windows, which Regina Obe supports, but that’s because she’s actually a dual category person: a PostGIS developer who also really wants her platform supported.

The best Linux support is for Red Hat variants, provided by Devrim Gunduz in the PostgreSQL Yum repositories. I think Devrim’s example is actually the best one, since it takes a PostgreSQL packager to do a really bang up job of packaging a PostgreSQL add-on like PostGIS. Unfortunately the Ubuntu PostgreSQL packager doesn’t do PostGIS as well (yet?).

Request 1c: Backport key bug fixes to the prior release series

This is actually done as a matter of course. If you know of a fix that is not backported ticket it. In general, if you put your tickets against the earliest milestone they apply to, the odds of a fix hitting all extant versions goes up, since the developer doesn’t have to go back and confirm the bug is historical rather than new to the development version. The only fixes that might not get done are ones that can’t be done without major code re-structuring, since that kind of thing tends to introduce as many problems as it solves.

Request 2.1a: Include a default PostGIS spatial database as part of the basic install, called “default_postgis_db” or something similar.

This is a packaging issue, and some packagers (Windows in particular, but also the OpenGeo Suite) include a template_postgis database, since it makes it easier to create new spatial databases (create database foo template template_postgis).

Anyways, as a packaging issue unless the PostGIS team took on all packaging there would be no way to ensure this happened in a uniform way everywhere, which is what one would need to do to have it makes things easier (for it to become general knowledge so that “oh, just use the __ database” became global advice).

More on creating spatial databases below.

Request 2.1b: Include a default PostGIS Postgres user as part of the basic install, called “postgis_user” or something similar.

I’m not sure I see the utility of this. From a data management point of view, you already have the PostgreSQL super user, postgres, around as a guaranteed-to-exist default user.

Request 2.1c: If I name a spatially enabled database in shp2pgsql that doesn’t yet exist, make one for me

Unless you have superuser credentials I can’t do this. So, maybe?

Request 2.1d: It’s too hard to manually setup a spatial database, with around a printed page of instructions that vary with install. It mystifies Postgres pros as well as novices.

Indeed it is! I will hide behind the usual defence, of course, “it’s not our fault!” It’s just the way PostgreSQL deals with extensions, including their own (load pgcrypto, for example, or fuzzystring). The best hack we have is the packaging hack that pre-creates a template_postgis, which works pretty well.

Fortunately, as of PostgreSQL 9.1+ and PostGIS 2.0+ we have the “CREATE EXTENSION” feature, so from here on in spatializing (and unspatializing (and upgrading)) a spatial database will be blissfully easy, just CREATE EXTENSION postgis (and DROP EXTENSION postgis (and ALTER EXTENSION postgis UPDATE TO 2.1.0)).

Request 2.1e: Default destination table names in shp2pgsql.

We have this, I just checked (1.5 and 2.0). The usage line indicates it, and it actually happens. I’m pretty sure it’s worked this way for a long time too, it’s not a new thing.

Request 2.1f: Automatically pipe the output to actually put the raw SQL results into PostGIS.

I’ll plead historical legacy on this one. The first version (c. 2001) of the loader was just a loader, no dumper, so adding in a database connection would have been needless complexity: just pipe it to psql, right?

Then we got a dumper, so now we had database connection logic lying around, but the loader had existing semantics and users. Also the code was crufty and it would have had to be re-written to get a direct load.

Then we got a GUI (1.5), and that required re-writing the core of the loader to actually do a direct database load. But we wanted to keep the commandline version working the same as before so our existing user base wouldn’t get a surprise change. So at this point doing a direct database loader is actually trivial, but we deliberately did not, to avoid tossing a change at our 10 years of legacy users.

So this is very doable, the question is whether we want to make a change like this to a utility that has been unaltered for years.

Incidentally, from an easy-to-use-for-newbies point of view the GUI is obviously way better than the command line. Why not use that? It’s what I use in all my PostGIS courses now.

Request 2.1g: If my shapefile has a PRJ associated with it (as most do), auto populate the -s option.

You have no idea how long I’ve wanted to do this. A very long time. It is, however, very hard to do. PRJ files don’t come (except the ones generated by GeoTools) with EPSG numbers in them. You have to figure out the numbers by (loosely) comparing the values in the file to the values in the full EPSG database. (That’s what the http://prj2epsg.org web site does.)

Now that we’ve added GDAL as a dependency in 2.0 we do at least have access to an existing PRJ WKT parser. However, I don’t think the OGR API provides enough hooks though to do something like load up all the WKT definitions in spatial_ref_sys (which is what we’ll have to do regardless) and search through them with sufficient looseness.

So this remains an area of active research. Sadly, probably not something that anyone will ever fund, which means given the level of effort necessary to make it happen, probably won’t happen.

Related 2.1h Projection on the fly: If you still can’t reproject data on the fly, something is wrong. If table X is in projection 1 (eg web merc) and table Y is in projection 2 (eg geographic), PostGIS ought to “just work”, without me resorting to a bunch of ST_Transform commands that include those flags. The SRID bits in those functions should be optional, not required.

Theoretically possible, but it has some potentially awful consequences for performance. You can only do index-assisted things with objects that share an SRS (SRID), since the indexes are built in one SRS space. So picking a side of an argument and pushing it into the same SRS as the other argument could cause you to miss out on an index opportunity. It’s worth perhaps thinking more about, though, since people with heterogenous SRID situations will be stuck in low performing situations whether we auto-transform or not.

The downside of all such “automagic” is that it leads people into non-optimal set-ups very naturally (and completely silently) so they end up wondering why PostGIS sucks for performance when actually it is their data setup that sucks.

Request 2.1i: Reasonable defaults in shp2pgsql import flags.

Agree 100%. Again, we’re just not changing historical defaults fast enough. The GUI has better defaults, but it wouldn’t hurt for the commandline to have them too.

Request 2.1j: Easier creation of point features from csv or dbf.

A rat-hole of unknowable depth (csv handling, etc) but agreed, a really very common and useful utility it would be. I just write a new perl script every time :)

Request 2.3a: Forward compatible pgdumps. Dumps from older PostGIS & Postgres combinations should always import into newer combinations of PostGIS and Postgres.

Upgrade has been ugly for a long time, and again it’s “not our fault”, in that until PostgreSQL 9.1, pg_dump always included our functions in the dump files. If you strip out the PostGIS function signature stuff (which is what the utils/postgis_restore.pl script does), it’s easy to get a clean and quiet restore into new versions, since we happily read old dumped PostGIS data and always have.

If you don’t mind a noisy restore it’s also always been possible to just drop a dump onto a new database and ignore the errors as function signatures collide and get a good restore.

With “CREATE EXTENSION” in PostgreSQL 9.1, we will now finally be able to pg_dump clean dumps that don’t include the function information, so this story more or less goes away.

Request 2.3b: Offer an option to skip PostGIS simple feature topology checks when importing a pgdump.

It’s important to note that there are two levels of validity checking in PostGIS. One level is “dumbass validity checking”, which can happen at parse time. Do rings close? Do linestrings have more than one point? That kind of thing. For a brief period in PostGIS history we have had some ugly situations where it was possible to create or ingest dumbass geometry through one code path and impossible to output it or ingest it through others. This was bad and wrong. It’s hopefully mostly gone. We should now mostly ingest and output dumbass things, because those things do happen. We hope you’ll clean or remove them though at a later time.

Be thankful we aren’t ArcSDE, which not only doesn’t accept dumbass things, it doesn’t accept anything that fails any rule of their whole validity model.

Request 3a: Topology should only be enforced as an optional add on, even for simple Polygon geoms. OGC’s view of polygon topology for simple polygons is wrong (or at the very least too robust).

Request 3b: Teach PostGIS the same winding rule that allows graphics software to fill complex polygons regarding self-intersections. Use that for simple point in polygon tests, etc. Only force me to clean the geometry for complicated map algebra.

Request 3c: Teach OGC a new trick about “less” simple features.

Request 3d: Beyond the simple polygon gripe, I’d love it if GEOS / PostGIS could become a little more sophisticated. Adobe Illustrator for several versions now allows users to build shapes using their ShapeBuilder tool where there are loops, gaps, overshoots, and other geometry burrs. It just works. Wouldn’t that be amazing? And it would be even better that ArcGIS.

We don’t enforce validity, we just don’t work very well if it’s not present.

Most of these complaints stem presumably from working with Natural Earth data which, since it exists, is definitionally “real world” data, but also includes some of the most unbelievably degenerate geometry I have ever experienced.

Rather than build special cases to handle degeneracy into every geometry function in the system, the right approach, IMO, is to build two functions that convert degenerate data into structured data that captures the “intent” of the original.

One function, ST_MakeValid, has already made an appearance in PostGIS 2.0. It isn’t 100% perfect, but it makes a good attempt and fixes many common invalidities that previously we had no answer for beyond “you’re SOL”. ST_MakeValid tries to fix invalidity without changing the input vertices at all.

The second function, ST_MakeClean, does not exist yet. ST_MakeClean would do everything ST_MakeValid does, but would also include a tolerance factor to use in tossing out unimportant structures (little spikes, tiny loops, minor ring crossings) that aren’t part of the “intent” of the feature.

Summary

I wish we had better packaging or the ability to do all the packaging ourselves so we could create a 100% consistent user experience across every platform, but that’s not possible. Please beg your favorite PostgreSQL packager to package PostGIS too.

The upgrade and install stories are going to get better with EXTENSIONS. So, just hang in there and things will improves.

The geometry validity story will get better with cleaning functions, and any extra dollars folks can invest in continuing to improve GEOS and fix obscure issues in the overlay code. The “ultimate fix”, if anyone wants to fund that, is to complete “snap rounding” code in JTS, and port that to GEOS, to support a fixed-precision overlay system. That should remove all overlay failures (which actually show up in intersections, unions and buffers, really all constructive geometry operations) once and for all.