Sunday, December 02, 2012

What's so hard about that download?

Twitter is a real accountability tool. This post is penance for moaning about things in public.

Soooo.... last Friday, while cooling my heels in Denver on the way home, I took another stab at Chad Dickerson's electoral district clipping problem, and came up with this version.

I ended up taking the old 1:50K "3rd order watershed" layer, using ST_Union to generate maximal provincial outline, using ST_Dump and ST_ExteriorRing to get out just the land boundaries (no lakes or wide rivers), used ST_Buffer to and ST_Simplify to get a reduced-yet-still-attractive version, differenced this land polygon from an outline polygon to get an "ocean" polygon, then (as I did previously) differenced that ocean from the electoral districts to get the clipped results. Phew.

And then I complained on the Twitter about the webstacle that now exists for anyone like me who wants to access those old 1:50K GIS files.

And the OpenData folks in BC, to their credit, wonder what I'm on about.

So, first of all, caveats:

  1. The obstacles to access to this data were constructed years before open data existed as an explicit government initiative in BC. This is not a problem with the work the open data folks have done.
  2. It could certainly be a whole lot harder to access, it is still theoretically available for download, I don't need to file an FOI or go to court or anything like that to get this data.

This is a story of contrasts and "progress".

Back when I actually downloaded these GIS files, in the early 2000s, I was able to access the whole dataset like this (the fact that I can still type out the process from memory should be indicative of how useful I found the old regime):

cd /dist/arcwhse/watersheds/
cd wsd3
mget *.gz

Here's how it works now.

I don't know where this data is anymore, so I go to This is an improvement, I don't have to (a) troll through Ministry sites first trying to figure out which one holds the data or (b) not troll though anything because I have no idea the data exists.

Due to the magic of inflexible design standards, the site has two search boxes, one that does what I want (the smaller one, below), and one that just does a google search of all the sites (that larger one, at the top). Ask me how I figured that out.

So, I type "watersheds" into the search box and get 10 results. Here I have to lean on my domain knowledge and go to #10, which is the old 3rd order watersheds layer.

The dataset record is pretty good, my only complaint would be that unlike the old FTP filesystem there's no obvious indication that there are other related data sets that together form a collection of related data, the watershed atlas. The keywords field gets towards that intent, but a breadcrumb trail or something else might be clearer. I think the idea of a data collection made of parts is common to a lot of data domains, and might help people organically discover things more easily.

Anyhow, here's where things get "fun", because here we leave the domain of open data and enter the domain of the "GIS data warehouse". I click on the "SHP" download link:

The difference between hosting data on an FTP site and hosting it in a big ArcSDE warehouse is that the former has very few moving parts, is really simple, and practically never does down, while the latter is the opposite of that.

Let's just skip the convenient direct open data link, and try to download the data directly from the warehouse. Go to the warehouse distribution service entry page:

I like ad for Internet Explorer, that's super stuff. It's almost like these pages are put up and never looked at again. We'll enter as a guest.

Two search boxes again, but at least this time the one we're supposed to use is the big one. Thanks to our trip through the system, we know that typing "WSA" is the thing most likely to get us the "watershed atlas".

Boohyah, it's even the top set of entries. Let's compare the metadata for fun (click on the "Info" (i)).

Pretty voluminous, and there's a tempting purple download button up there... hey, this one works!

Hm, it wants my email address and for me to assent to a license... I wonder what the license is?

Why make people explicitly assent to a license that is only implicitly defined? Fun. Ok, fine, have my email address, and I assent to something or other. I Submit (in both senses)!

And now I wait for my e-mail to arrive...

Hey presto, it's alive! (Sunday 11:27AM) But no data yet.

W00t! Data is ready! (Sunday 11:30AM)

Uh, oh, something is wrong here. My browser perhaps? Let's try wget.

--2012-12-02 11:33:10--
           => ‘’
Resolving (
Connecting to (||:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /outgoing/apps/lrdw/dwds ... 
No such directory ‘outgoing/apps/lrdw/dwds’.

This is awesome. OK, back to the FTP client!

Connected to
220-Microsoft FTP Service
220 This server is for British Columbia Government business use only.
500 'AUTH GSSAPI': command not understood
Name ( ftp
331 Anonymous access allowed, send identity (e-mail name) as password.
230 Anonymous user logged in.
Remote system type is Windows_NT.
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls.
dr-xr-xr-x   1 owner    group               0 Jul 23  2010 MidaFTP
dr-xr-xr-x   1 owner    group               0 Aug 25  2010 outgoing
226 Transfer complete.
ftp> cd outgoing
250 CWD command successful.
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls.
dr-xr-xr-x   1 owner    group               0 Jul  2  2010 apps
226 Transfer complete.
ftp> cd apps
250 CWD command successful.
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls.
d---------   1 owner    group               0 Jul  2  2010 lrdw
d---------   1 owner    group               0 Jul  2  2010 mtec
226 Transfer complete.
ftp> cd lrdw
550 lrdw: Access is denied.

So, the anonymous FTP directory where the jobs are landing is not readable (by anyone). Oh, and serious demerits for running an FTP server on Windows (NT!).

The whole data warehouse/data distribution thing substantially pre-dates open data, and actually one of the reasons it (a) exists and (b) is so f***ing terrible is because at the time it was conceived and designed BC was still trying to sell GIS data, so the distribution system has crazy layers of security and differentiation between free and non-free data (even though it still forces you to go through "checkout" for free data (which all data now is)).

My request was for only 50Mb of data, and the system is (theoretically) willing to give it to me in one chunk. If I had wanted to access all of TRIM (the 1:20 BC planimetric base map product) I would be, as the French say, "up sh** creek".

The current process is also, clearly, not amenable to automation. If I wanted to regularly download a volatile data set, I would also be, as in the German proverb, FUBAR.

So, there you go, open data folks. I am fully cognizant that the problem is 100% Not Of Your Design or Doing, I watched it happen in real time (and even won a contract to maintain the system after it was built! gah!) But it is also, still, now many years on, a Problem.

Remember, I originally got the data like this.

cd /dist/arcwhse/watersheds/
cd wsd3
mget *.gz

It ain't Rocket Science, we've just made it seem like it is.


Regina said...

US Tiger Census is so much nicer.
You have your web interface for new aged stupid kids
and ftp for experienced old foggies like me :)

and if you really want documentation you've got 700 pages of it to entertain you.

It must suck to be a Canadian. You should defect (You are a dual citizen aren't you?)

JMH said...

I recall you once asking me why PCIC had taken on the responsibility of writing a web application to distribute BC's meteorological data.
I think that you may have now answered your own question.
(Not an invitation to compare our web service to FTP!)

Seven said...

Thanks Paul, great post, yet again and I can absolutely relate.

To be honest I initially actually had fun jumping through all the loops. In my case it was DEMs from USGS, not that bad at all but still painful. And OpenData from Ordnance Survey: also quite good but still, not there yet at all.

Why was it fun? Because it gave me the satisfaction of actually getting it done in the end. I won the fight. But this is a one-off satisfaction. Next time it is going to be a nuisance and after the third time a real pita.

So what to do? As you suggest, it is simply a matter of stripping all the software and give access to the data. The Web is about access, not restriction. Maybe the Web 2.0 got in the way. Or it is not the Web at all and we must go back to the Internet to get the data?

Paul Ramsey said...

@Seven, every resource must have a URL.

Darcy Buskermolen said...

I agree, I have an MOU with the gov which allows me access to about 1200 map sheets of trim data, across about a dozen NTS sheets, So far I have about 200 sheets downloaded, and it's taken me a better part of a week to obtain. Each sheet needs to be individually added, taking about 20 seconds per click. All and all there has to be a better way, I understand that the Gov still wants to keep some of their data private and chargeable, but at least change up the warehouse to allow me to click a "Select all" button. Also there is no way with the wonderful search to say "show me the data updated in the last 30 days" so i can overcome the the overhead of volatile data. Because now the only way I have to do that is to go through the whole process again retrieve the data and compare it to the previously downloaded data. Overall I give them a mark of "C", Not a complete failure because I can make the system spit out data, but plenty of room for improvement.

About Me

My Photo
Victoria, British Columbia, Canada

Blog Archive


bc (44) it (35) postgis (24) video (15) enterprise IT (11) icm (11) gis (9) sprint (9) foi (8) open source (8) osgeo (8) enterprise (7) cio (6) foippa (6) foss4g (6) management (6) politics (6) spatial it (6) outsourcing (5) mapserver (4) bcesis (3) boundless (3) email (3) opengeo (3) oracle (3) rant (3) COTS (2) architecture (2) cartodb (2) deloitte (2) esri (2) hp (2) idm (2) javascript (2) natural resources (2) ogc (2) open data (2) openstudent (2) oss (2) postgresql (2) technology (2) vendor (2) web (2) 1.4.0 (1) HR (1) access to information (1) accounting (1) agile (1) archive (1) aspen (1) bcpoli (1) benchmark (1) buffer (1) build vs buy (1) business (1) business process (1) c (1) career (1) cathedral (1) client (1) cloud (1) code (1) common sense (1) consulting (1) contracting (1) core review (1) crm (1) crockofshit (1) cunit (1) custom (1) data science (1) data warehouse (1) design (1) development (1) digital (1) environment (1) essentials (1) evil (1) exadata (1) fcuk (1) fgdb (1) fme (1) foocamp (1) foss4g2007 (1) ftp (1) gdal (1) gds (1) geocortex (1) geometry (1) geoserver (1) geotiff (1) google (1) google earth (1) government (1) grass (1) hadoop (1) iaas (1) icio (1) imagery (1) industry (1) innovation (1) integrated case management (1) introversion (1) iso (1) isss (1) isvalid (1) jpeg (1) jts (1) lawyers (1) mapping (1) mcfd (1) media (1) microsoft (1) money (1) mysql (1) new it (1) nosql (1) nrs transformation (1) oipc (1) opengis (1) openlayers (1) paas (1) pgconfsv (1) pirates (1) policy (1) portal (1) proprietary software (1) public accounts (1) qgis (1) r (1) rdbms (1) recursion (1) redistribution (1) regression (1) rfc (1) right to information (1) saas (1) salesforce (1) sardonic (1) scandal (1) seibel (1) sermon (1) server (1) siebel (1) snark (1) spatial (1) standards (1) statistics (1) svr (1) taxi (1) tempest (1) texas (1) tired (1) transit (1) tripledelete (1) twitter (1) uber (1) udig (1) uk (1) uk gds (1) verbal culture (1) victoria (1) waterfall (1) wfs (1) where (1) with recursive (1) wkb (1)