Counting Squares

One of the last projects I had a substantial hand in formulating and designing while at Refractions was a project for providing provincial-level environmental summaries, using the best possible high resolution data. The goal is to be able to answer questions like:

  • What is volume of old-growth pine in BC? By timber supply area? In caribou habitat?
  • What young forest areas on on south facing slopes of less than 10%, within 200 meters of water?
  • What is the volume of fir in areas affected by mountain pine beetle?
  • How much bear habitat is more than 5km from a road but not in an existing protected area? Where is it?

This is all standard GIS stuff, but we wanted to make answering these questions the matter of a few mouse gestures, with no data preparation required, so that a suitably motivated environmental scientist or forester could figure out how to do the analysis with almost no training.

Getting there involves solving two problems: what kind of engine can generate fairly complicated summaries based on arbitrary summary areas, and; how do you make that engine maximally usable with minimal interface complexity.

The solution to the engine was double-barreled.

First, to enable arbitrary summary areas, move from vector analysis units to a province-wide raster grid. For simplicity, we chose one hectare (100m x 100m), which means about 90M cells for all the non-ocean area in the jurisdiction. Second, to enable a roll-up engine on those cells, put all the information into a database, in our case PostgreSQL. Data entering the system is pre-processed, rasterized onto the hectare grid, and then saved in a master table that has one row for each hectare. At this point, each hectare in the province has over 100 variables associated with it in the system.

An example of the 1-hectare grid

To provide a usable interface on the engine, we took the best of breed everywhere we could: Google Web Toolkit as the overall UI framework; OpenLayers as a mapping component; server-side Java and Tomcat for all the application logic. The summary concept was very similar to OLAP query building, so we stole the ideas for the working of that tab from the SQL Server OLAP query interface.

The final result is Hectares BC, which is one of the cooler things I have been involved in, going from a coffee shop “wouldn’t this be useful” discussion to a prototype, to a funding proposal, to the completed pilot in about 24 months.

From 0 to 65 Million in 2 Hours

I’m doing some performance benchmarking for a client this week, so getting a big, real test database is a priority. The USGS TIGER data is one of the largest uniform data sets around, so I’ve started with that.

I just loaded all the edges, 64,830,691 of them, and it took just under 2 hours! Fortunately, the 2007 data comes in shape files, and the schemas are identical for each file, so the load script is as simple as this controller:

find . -name "fe*_edges.zip" -exec ./append_edges.sh {} ';'

And this runner (append_edges.sh):

unzip $1
shp2pgsql -W WINDOWS-1252 -D -a -s 4269 \`basename $1 .zip\`.shp fe_edges | psql tiger
rm fe*edges.*

Note the use of the -W parameter, to ensure that the high-bit “charàctérs” are handled correctly, and the -a parameter, to append the file contents to the table.

GeoSilk

Who is building the GeoWeb? Must be the GeoSpider! Spinning it with GeoSilk!

I’ll be talking at GeoWeb about Mapserver, an important piece of GeoSilk, in a no-holds-barred panel with representatives from GeoServer (Justin Deoliveira) and MapGuide (Bob Bray). Yes, we’re in a bit of an open source ghetto – it would have been better to get all the functionally similar products in one room, and only let the victor out alive.

Mapserver! Strong like ox! Fast like cheetah!

Standards for Geospatial REST?

One of the things that has been powerful about the open source geospatial community has been the care with which open standards have been implemented. Frequently the best early implementations of the OGC standards have open source ones. It would be nice if the open source community could start providing REST APIs to the great core rendering and data access services that underly products like Mapserver, Geoserver, and so on.

However, it’s bad enough that Google, Microsoft, ESRI, all invent their own addressing schemes for data and services. Do we want the open source community to create another forest of different schemes, one for each project? Also, inventing a scheme takes time and lots of email wanking in any community with more than a handful of members. Replicating that effort for each community will waste a lot of time.

No matter how much we (me) might bitch about OGC standards, they have a huge advantage over DIY, in that they provide a source of truth. You want an answer to how that works? Check the standard.

There’s been enough test bedding of REST and some nice working implementations, perhaps it is time to document some of that knowledge and get into the playing field of the OGC to create a source of truth to guide the large community of implementors.

A lot of the work is done, in that most of the the potential representations have been documented: GML, KML, GeoRSS, GeoJSON, WKT. Perhaps there is little more to be done: writing up how to apply APP to feature management.

For Sean

In talking about mass scaling geospatial web services, Sean Gillies expressed that he wasn’t sure that Mapserver (for example) was up to the task, or alternatively, really provided anything that couldn’t be provided with ArcServer, for example (he’s not a fanboi, but he does like playing devil’s advocate).

For mass scaling a system on Mapserver, I’d have Mapserver generating maps out to a tile cache, so really, the number of Mapservers I would need would be limited to the amount of fresh tiles I was generating at any given time. Sean correctly pointed out that in that architecture, ArcServer wouldn’t be nearly as punitive, from a licensing point-of-view, as I was painting it, since it too could be deployed in more limited numbers while the tile cache handled most of the load.

I just thought of an interesting use case which is totally incompatible with a proprietary per-CPU licensing scheme: suppose I want to generate a complete tile cache for the USA. Do a TIGER street map of all of the USA, for example? No problem, I fire up 1000 EC2 instances and let ‘em rip! With Mapserver, that’s not a problem. With ArcServer, or any other per-CPU licensed product, it’s just not on.

The whole area of virtual rentable infrastructure is a place where old fashioned proprietary licensing has a problem.