What's so hard about that download?02 Dec 2012
Twitter is a real accountability tool. This post is penance for moaning about things in public.
Soooo…. last Friday, while cooling my heels in Denver on the way home, I took another stab at Chad Dickerson’s electoral district clipping problem, and came up with this version.
I ended up taking the old 1:50K “3rd order watershed” layer, using
ST_Union to generate maximal provincial outline, using
ST_ExteriorRing to get out just the land boundaries (no lakes or wide rivers), used ST_Buffer to and ST_Simplify to get a reduced-yet-still-attractive version, differenced this land polygon from an outline polygon to get an “ocean” polygon, then (as I did previously) differenced that ocean from the electoral districts to get the clipped results. Phew.
And then I complained on the Twitter about the webstacle that now exists for anyone like me who wants to access those old 1:50K GIS files.
And the OpenData folks in BC, to their credit, wonder what I’m on about.
So, first of all, caveats:
- The obstacles to access to this data were constructed years before open data existed as an explicit government initiative in BC. This is not a problem with the work the open data folks have done.
- It could certainly be a whole lot harder to access, it is still theoretically available for download, I don’t need to file an FOI or go to court or anything like that to get this data.
This is a story of contrasts and “progress”.
Back when I actually downloaded these GIS files, in the early 2000s, I was able to access the whole dataset like this (the fact that I can still type out the process from memory should be indicative of how useful I found the old regime):
ftp ftp.env.gov.bc.ca cd /dist/arcwhse/watersheds/ cd wsd3 mget *.gz
Here’s how it works now.
I don’t know where this data is anymore, so I go to data.gov.bc.ca. This is an improvement, I don’t have to (a) troll through Ministry sites first trying to figure out which one holds the data or (b) not troll though anything because I have no idea the data exists.
Due to the magic of inflexible design standards, the data.gov.bc.ca site has two search boxes, one that does what I want (the smaller one, below), and one that just does a google search of all the gov.bc.ca sites (that larger one, at the top). Ask me how I figured that out.
So, I type “watersheds” into the search box and get 10 results. Here I have to lean on my domain knowledge and go to #10, which is the old 3rd order watersheds layer.
The dataset record is pretty good, my only complaint would be that unlike the old FTP filesystem there’s no obvious indication that there are other related data sets that together form a collection of related data, the watershed atlas. The keywords field gets towards that intent, but a breadcrumb trail or something else might be clearer. I think the idea of a data collection made of parts is common to a lot of data domains, and might help people organically discover things more easily.
Anyhow, here’s where things get “fun”, because here we leave the domain of open data and enter the domain of the “GIS data warehouse”. I click on the “SHP” download link:
The difference between hosting data on an FTP site and hosting it in a big ArcSDE warehouse is that the former has very few moving parts, is really simple, and practically never does down, while the latter is the opposite of that.
Let’s just skip the convenient direct open data link, and try to download the data directly from the warehouse. Go to the warehouse distribution service entry page:
I like ad for Internet Explorer, that’s super stuff. It’s almost like these pages are put up and never looked at again. We’ll enter as a guest.
Two search boxes again, but at least this time the one we’re supposed to use is the big one. Thanks to our trip through the data.gov.bc.ca system, we know that typing “WSA” is the thing most likely to get us the “watershed atlas”.
Boohyah, it’s even the top set of entries. Let’s compare the metadata for fun (click on the “Info” (i)).
Pretty voluminous, and there’s a tempting purple download button up there… hey, this one works!
Hm, it wants my email address and for me to assent to a license… I wonder what the license is?
Why make people explicitly assent to a license that is only implicitly defined? Fun. Ok, fine, have my email address, and I assent to something or other. I Submit (in both senses)!
And now I wait for my e-mail to arrive…
Hey presto, it’s alive! (Sunday 11:27AM) But no data yet.
W00t! Data is ready! (Sunday 11:30AM)
Uh, oh, something is wrong here. My browser perhaps? Let’s try wget.
wget ftp://slkftp.env.gov.bc.ca/outgoing/apps/lrdw/dwds/LRDW-1235441-Public.zip --2012-12-02 11:33:10-- ftp://slkftp.env.gov.bc.ca/outgoing/apps/lrdw/dwds/LRDW-1235441-Public.zip => ‘LRDW-1235441-Public.zip’ Resolving slkftp.env.gov.bc.ca (slkftp.env.gov.bc.ca)... 22.214.171.124 Connecting to slkftp.env.gov.bc.ca (slkftp.env.gov.bc.ca)|126.96.36.199|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /outgoing/apps/lrdw/dwds ... No such directory ‘outgoing/apps/lrdw/dwds’.
This is awesome. OK, back to the FTP client!
Connected to slkftp.env.gov.bc.ca. 220-Microsoft FTP Service 220 This server slkftp.env.gov.bc.ca is for British Columbia Government business use only. 500 'AUTH GSSAPI': command not understood Name (slkftp.env.gov.bc.ca:pramsey): ftp 331 Anonymous access allowed, send identity (e-mail name) as password. Password: 230 Anonymous user logged in. Remote system type is Windows_NT. ftp> dir 200 PORT command successful. 150 Opening ASCII mode data connection for /bin/ls. dr-xr-xr-x 1 owner group 0 Jul 23 2010 MidaFTP dr-xr-xr-x 1 owner group 0 Aug 25 2010 outgoing 226 Transfer complete. ftp> cd outgoing 250 CWD command successful. ftp> dir 200 PORT command successful. 150 Opening ASCII mode data connection for /bin/ls. dr-xr-xr-x 1 owner group 0 Jul 2 2010 apps 226 Transfer complete. ftp> cd apps 250 CWD command successful. ftp> dir 200 PORT command successful. 150 Opening ASCII mode data connection for /bin/ls. d--------- 1 owner group 0 Jul 2 2010 lrdw d--------- 1 owner group 0 Jul 2 2010 mtec 226 Transfer complete. ftp> cd lrdw 550 lrdw: Access is denied.
So, the anonymous FTP directory where the jobs are landing is not readable (by anyone). Oh, and serious demerits for running an FTP server on Windows (NT!).
The whole data warehouse/data distribution thing substantially pre-dates open data, and actually one of the reasons it (a) exists and (b) is so f***ing terrible is because at the time it was conceived and designed BC was still trying to sell GIS data, so the distribution system has crazy layers of security and differentiation between free and non-free data (even though it still forces you to go through “checkout” for free data (which all data now is)).
My request was for only 50Mb of data, and the system is (theoretically) willing to give it to me in one chunk. If I had wanted to access all of TRIM (the 1:20 BC planimetric base map product) I would be, as the French say, “up sh** creek”.
The current process is also, clearly, not amenable to automation. If I wanted to regularly download a volatile data set, I would also be, as in the German proverb, FUBAR.
So, there you go, open data folks. I am fully cognizant that the problem is 100% Not Of Your Design or Doing, I watched it happen in real time (and even won a contract to maintain the system after it was built! gah!) But it is also, still, now many years on, a Problem.
Remember, I originally got the data like this.
ftp ftp.env.gov.bc.ca cd /dist/arcwhse/watersheds/ cd wsd3 mget *.gz
It ain’t Rocket Science, we’ve just made it seem like it is.