PostGIS and the Public Sector

The EU’s Open Source Observatory and Repository reports on the ESLAP conference in Portugal. Looks like some big public agencies are moving to PostGIS as their spatial data store. We already knew about France’s Institut Géographique National, but to that add Portugal’s Instituto Geográphico Português.

Data is a Long Lived Investment

Technology is not.

Double-plus super-terrific hands-on-the-ceiling support to Sean Gorman’s take on the various self-serving NSDI proposals circulating the internet.

Someone posted this link to the story of the rescue of the Canada Land Inventory recently, and I really enjoyed it. Note that, a decade after the program wound down, the technology value had declined to zero. Less than zero, actually, since it was now an impediment to accessing the data. Meanwhile, the data, some of it 25 years old, had retained its value.

Here in my neighborhood, British Columbia has no unified parcel inventory / ownership inventory. The ownership database is separate from the parcel information (which is incomplete and out of date). Bits and pieces, hither and yon. Modernization talk always tends to focus on the technology, but the real project is a big, expensive, labor intensive slog through the data, to apply consistency rules to the ownership database, and complete the parcel data. Make sure all the work is done here in BC, and it’s a 100% stimulus that, like building a bridge, will create an asset with a multi-generational life-span.

(Speaking of multi-generational, the Patullo Bridge in Vancouver recently caught fire and a section burned to the ground. It’s a four-lane free-way bridge, how did this happen? Turns out, it’s really old, older than most people think, and one section was built of wood! In 1936. In the Depression. Call your representatives, tell them you don’t want tax cuts, you want bridges, you want basic scientific knowledge, you want railways, you want power lines, you want dams, things that will still be around when your grandchildren have grandchildren.)

(Much) Faster Unions in PostGIS 1.4

Originally posted at GeoSpeil.

I have had a very geeky week, working on bringing the “cascaded union” functionality to PostGIS.

By way of background, about a year ago, a PostGIS user brought a question up. He had about 30K polygons he was unioning and the process was taking hours. ArcMap could to it in 20 minutes, what was up? First of all, he had an unbelievably degenerate data set, which really did a great job of exposing the inefficiency of the PostGIS union aggregate. Second, the union aggregate really was inefficient.

This is what his data looked like, before and after union.

The old PostGIS ST_Union() aggregate just naively built the final result from the input table: union rows 1 and 2 together, then add row 3, then row 4, etc. As a result, each new row generally made the interim polygon more complex — more vertices, more parts. In contrast, the “cascaded union” approach first structures the data set into an STR-Tree, then unions the tree from the bottom up. As a result, adjacent bits are merged together progressively, so each stage of the union does the minimum amount of work, and creates an interim result simpler than the input components.

Implementing this new functionality in PostGIS required a few steps: first, the algorithm had to be ported from JTS in Java to the GEOS C++ computational geometry library; second, the C++ algorithm in GEOS had to be exposed in the public GEOS C API; third, PostGIS functions to call the new GEOS function had to be added.

The difference on the test data set from our user was stark. My first cut brought the execution time in PostGIS from 3.5 hours to 4.5 minutes for the sample data set. That was excellent! But, we knew that the JTS implementation could carry out the same union on the same data in a matter of seconds. Where was the extra 4 minutes going in PostGIS? Some profiling turned up the answer.

Before you can run the cascaded union process, you need to aggregate all the data in memory, so that a tree can be built on it. The PostGIS aggregation was being done using ST_Accum() to build an array of geometry[], then handing that to the union operation. But the ST_Accum() aggregation was incredibly inefficient! Four minutes of overhead isn’t a bit deal when your union is taking hours, but now that it was taking seconds, the overhead was swamping the processing.

Running a profiler found the problem immediately. The ST_Accum() aggregate built the geometry[] array in memory, repeatedly memcpy()‘ing each interim array. So the array was being copied thousands of times. Fortunately, the upcoming version of PostgreSQL (8.4) had a new array_agg() function, which used a much more efficient approach to array building. I took that code and ported it into PostGIS, for use in all versions of PostgreSQL. That reduced the aggregation overhead to a few seconds.

Final result, the sample union now takes 26 seconds! A big improvement on the original 3.5 hour time.

Here’s a less contrived result, the 3141 counties in the United States. Using the old ST_Union(), the union takes 42 seconds. Using the new ST_Union() (coming in PostGIS 1.4.0) the union takes 3.7 seconds.

C Code Sprint, Toronto, March 7-10

As I mentioned earlier there is going to be a code sprint in Toronto from March 7-10, with a particular focus on C/C++ based open source geospatial software. The attendance list is now a veritable who’s who of Mapserver, PostGIS, GEOS, proj4 and GDAL developers. Of particular excitement to me, two members of the PostGIS team who I have not met – Mark Cave-Ayland and Regina Obe – are planning to attend.

Thanks to our event sponsors: Rich Greenwood, OSGIS.nl, Coordinate Solutions, LizardTech, and SJ Geophysics. If you or your organization would like to be an event sponsor too, please contact me directly!

"Responsive" Applications

One of the first development projects I’m doing with OpenGeo is putting together a GUI for data loading/dumping in PostGIS. The command-line tools have served the project well for a long time, but with such good GUI tools like PgAdmin available for all the other interaction with the database, it is a shame to have to head to the command-line to load a shape-file.

[PostGIS Loader GUI](http://www.flickr.com/photos/92995391@N00/3176884653/" title="PostGIS Loader GUI by pwramsey3, on Flickr)
Shape File Loader GUI

I’ve had a grand time learning GTK+ to build the GUI window, got it all laid out, and then spent some time re-working the output portions of the existing loader code so I could write into a database handle rather than to STDOUT.

I got everything hooked up, ran small small test cases (victory!) then got a big file, and hit the “Import” button and … everything froze for 20 seconds, then whoooosh all the logging information updated. What gives? Oh, right, I took control away from the GUI thread for 20 seconds while I was loading the data, then handed it back.

Mono uses GTK for their Linux widgets, and they have some good advice on how to build responsive applications – applications that stay usable and informative, even while doing computationally demanding tasks.

I thought I already knew the answer – I was going to have to put my loading process into its own thread and figure out all the joys of thread programming. I still may do that, but in the meantime I’ve implemented one of the other options the Mono project suggested. There is an “idle” event in GTK, which is thrown whenever the application has “time on its hands”. By breaking a large process into small chunks, and executing them during idle moments, the application remains responsive, and the work still gets done.

Is the application now “responsive”? I would argue, not really, it just seems responsive. But in this modern world, the appearance of a thing is usually just as good as the thing itself.

The code now looks a bit uglier though… instead of simply running the translation, the GUI moves through little stages: create the tables, load the first 100 features, the second, etc, finish the process. However, it’s now possible to stop an import mid-stream, and the logging information hits the window in real time.