What's so hard about that download?

Twitter is a real accountability tool. This post is penance for moaning about things in public.

Soooo…. last Friday, while cooling my heels in Denver on the way home, I took another stab at Chad Dickerson’s electoral district clipping problem, and came up with this version.

I ended up taking the old 1:50K “3rd order watershed” layer, using ST_Union to generate maximal provincial outline, using ST_Dump and ST_ExteriorRing to get out just the land boundaries (no lakes or wide rivers), used ST_Buffer to and ST_Simplify to get a reduced-yet-still-attractive version, differenced this land polygon from an outline polygon to get an “ocean” polygon, then (as I did previously) differenced that ocean from the electoral districts to get the clipped results. Phew.

And then I complained on the Twitter about the webstacle that now exists for anyone like me who wants to access those old 1:50K GIS files.

And the OpenData folks in BC, to their credit, wonder what I’m on about.

So, first of all, caveats:

This is a story of contrasts and “progress”.

Back when I actually downloaded these GIS files, in the early 2000s, I was able to access the whole dataset like this (the fact that I can still type out the process from memory should be indicative of how useful I found the old regime):

ftp ftp.env.gov.bc.ca
cd /dist/arcwhse/watersheds/
cd wsd3
mget *.gz

Here’s how it works now.

I don’t know where this data is anymore, so I go to data.gov.bc.ca. This is an improvement, I don’t have to (a) troll through Ministry sites first trying to figure out which one holds the data or (b) not troll though anything because I have no idea the data exists.

Due to the magic of inflexible design standards, the data.gov.bc.ca site has two search boxes, one that does what I want (the smaller one, below), and one that just does a google search of all the gov.bc.ca sites (that larger one, at the top). Ask me how I figured that out.

So, I type “watersheds” into the search box and get 10 results. Here I have to lean on my domain knowledge and go to #10, which is the old 3rd order watersheds layer.

The dataset record is pretty good, my only complaint would be that unlike the old FTP filesystem there’s no obvious indication that there are other related data sets that together form a collection of related data, the watershed atlas. The keywords field gets towards that intent, but a breadcrumb trail or something else might be clearer. I think the idea of a data collection made of parts is common to a lot of data domains, and might help people organically discover things more easily.

Anyhow, here’s where things get “fun”, because here we leave the domain of open data and enter the domain of the “GIS data warehouse”. I click on the “SHP” download link:

The difference between hosting data on an FTP site and hosting it in a big ArcSDE warehouse is that the former has very few moving parts, is really simple, and practically never does down, while the latter is the opposite of that.

Let’s just skip the convenient direct open data link, and try to download the data directly from the warehouse. Go to the warehouse distribution service entry page:

I like ad for Internet Explorer, that’s super stuff. It’s almost like these pages are put up and never looked at again. We’ll enter as a guest.

Two search boxes again, but at least this time the one we’re supposed to use is the big one. Thanks to our trip through the data.gov.bc.ca system, we know that typing “WSA” is the thing most likely to get us the “watershed atlas”.

Boohyah, it’s even the top set of entries. Let’s compare the metadata for fun (click on the “Info” (i)).

Pretty voluminous, and there’s a tempting purple download button up there… hey, this one works!

Hm, it wants my email address and for me to assent to a license… I wonder what the license is?

Why make people explicitly assent to a license that is only implicitly defined? Fun. Ok, fine, have my email address, and I assent to something or other. I Submit (in both senses)!

And now I wait for my e-mail to arrive…

Hey presto, it’s alive! (Sunday 11:27AM) But no data yet.

W00t! Data is ready! (Sunday 11:30AM)

Uh, oh, something is wrong here. My browser perhaps? Let’s try wget.

wget ftp://slkftp.env.gov.bc.ca/outgoing/apps/lrdw/dwds/LRDW-1235441-Public.zip
--2012-12-02 11:33:10--  ftp://slkftp.env.gov.bc.ca/outgoing/apps/lrdw/dwds/LRDW-1235441-Public.zip
           => ‘LRDW-1235441-Public.zip’
Resolving slkftp.env.gov.bc.ca (slkftp.env.gov.bc.ca)... 142.36.245.171
Connecting to slkftp.env.gov.bc.ca (slkftp.env.gov.bc.ca)|142.36.245.171|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /outgoing/apps/lrdw/dwds ... 
No such directory ‘outgoing/apps/lrdw/dwds’.

This is awesome. OK, back to the FTP client!

Connected to slkftp.env.gov.bc.ca.
220-Microsoft FTP Service
220 This server slkftp.env.gov.bc.ca is for British Columbia Government business use only.
500 'AUTH GSSAPI': command not understood
Name (slkftp.env.gov.bc.ca:pramsey): ftp
331 Anonymous access allowed, send identity (e-mail name) as password.
Password:
230 Anonymous user logged in.
Remote system type is Windows_NT.
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls.
dr-xr-xr-x   1 owner    group               0 Jul 23  2010 MidaFTP
dr-xr-xr-x   1 owner    group               0 Aug 25  2010 outgoing
226 Transfer complete.
ftp> cd outgoing
250 CWD command successful.
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls.
dr-xr-xr-x   1 owner    group               0 Jul  2  2010 apps
226 Transfer complete.
ftp> cd apps
250 CWD command successful.
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls.
d---------   1 owner    group               0 Jul  2  2010 lrdw
d---------   1 owner    group               0 Jul  2  2010 mtec
226 Transfer complete.
ftp> cd lrdw
550 lrdw: Access is denied.

So, the anonymous FTP directory where the jobs are landing is not readable (by anyone). Oh, and serious demerits for running an FTP server on Windows (NT!).

The whole data warehouse/data distribution thing substantially pre-dates open data, and actually one of the reasons it (a) exists and (b) is so f***ing terrible is because at the time it was conceived and designed BC was still trying to sell GIS data, so the distribution system has crazy layers of security and differentiation between free and non-free data (even though it still forces you to go through “checkout” for free data (which all data now is)).

My request was for only 50Mb of data, and the system is (theoretically) willing to give it to me in one chunk. If I had wanted to access all of TRIM (the 1:20 BC planimetric base map product) I would be, as the French say, “up sh** creek”.

The current process is also, clearly, not amenable to automation. If I wanted to regularly download a volatile data set, I would also be, as in the German proverb, FUBAR.

So, there you go, open data folks. I am fully cognizant that the problem is 100% Not Of Your Design or Doing, I watched it happen in real time (and even won a contract to maintain the system after it was built! gah!) But it is also, still, now many years on, a Problem.

Remember, I originally got the data like this.

ftp ftp.env.gov.bc.ca
cd /dist/arcwhse/watersheds/
cd wsd3
mget *.gz

It ain’t Rocket Science, we’ve just made it seem like it is.