PgPatch!

It’s not much, but it’s not nothing, my first ever patch to PostgreSQL proper is part of yesterday’s 8.4.2 release:

Fix incorrect logic for GiST index page splits, when the split depends on a non-first column of the index (Paul Ramsey)

It was just a minor syntax error I found because I was reading through that section of the code very, very, very closely (it’s true, I copy the work of smarter people than I) while implementing the indexes for GEOGRAPHY in PostGIS 1.5.

Extra, extra, PostGIS Really Fast!

Because, frankly, I love nothing more than approbation, I am going to quote this comment on “Much Faster Unions in PostGIS” in full:

This is a truly spectacular piece of work. We have often been asked by clients to buffer and merge point datasets with several million points. We attempted this using ArcWhatever (could barely open the points, let along buffer them) and FME, which ran for a week and then gave an out of memory error. So, I do the whole configure, make, make install thing, 4 times, for postgres, goes, proj4 and postgis. After a lot of swearing and running ldconfig a few million times I eventually get postgis to accept that geos really is installed – MySQL might have more limited spatial functionality, but it sure is a lot easier to build from source. Anyway, I digress. I run a few random queries using the excellent generate series capability in postgres, and manage to create, buffer and merge 100,000 points in a few seconds. Finally, I try this on a real world dataset, namely all of the postal addresses in Wales, 1.4 million or so. With a 200m buffer, this ran on a reasonably pokey 64-bit linux box in 19 minutes. Truly astonishing. Well done. Much as I love MySQL, this was a bit of St. Paul on the road to Damascus moment.

Full credit to Martin Davis, who implemented this technique in JTS. We just borrowed it for database land.

Code Sprint 2010: New York City

It’s official! Following the success of the 2009 Toronto Code Sprint a second event will be taking place in New York City, this winter.

This code sprint will bring together the developers for the MapServer, PostGIS, GDAL, GRASS, OGR, LibLAS, QGIS, Proj4 and other projects that are part of the “C Tribe” of open source software.

Timing: Saturday, February 20 to Tuesday, February 23, 9am to 4pm.

Location: OpenGeo Event Room, 148 Lafayette St, New York, New York

More Information: Wiki Page, Mailing List

Sponsor this event! Last year, sponsorship dollars paid for venue and dinners for the sprinters, who gave up their weekends and more to invest a collective 736 man hours in making their open source project stronger. Thanks again to our 2009 sponsors: Rich Greenwood, OSGIS.nl, Coordinate Solutions, LizardTech, SJ Geophysics, qPublic.net, GDAL! If you want to sponsor this event, and raise your profile with the 2 dozen people who make your software sing, please contact me.

Update: Thanks to our first 2010 sponsors, Coordinate Solutions, LizardTech, and qPublic.net! We are still looking for sponsors for 2010!

Perfect

The New York Times political blog has instant commentary on Obama’s Afghanistan speech tonight, and the one-sentence teaser in the RSS feed speaks volumes:

In an address aimed at rank-and-file Americans, Mr. Obama did not call for sacrifices.

Is "Good Enough" Good Enough?

The gift that keeps on giving, for me, is performance work. Users love it, and it tickles my cerebral cortex in a way other work just cannot touch. Some of my first substantive patches to MapServer were performance patches, and I’ve also taken that tack into PostGIS.

One of the things that surprised me while investigating the MapServer QIX index structure was how little a badly built tree impacted real world performance. The QIX file, when built on point files, tends to build in far more depth than it needs, strictly speaking. The extra depth adds theoretical time to traverse the tree for searches and yet… even when I tested quite huge shape files the performance was fine.

Recently I’ve been looking again at internally indexing geometries for high performance testing of things like distance and intersections. We already have some of this for PostGIS, using the GEOS “PreparedGeometry” construction, but I’d like a native implementation, and something that doesn’t necessarily depend on a cached implementation.

There is already a form of internally indexed testing in PostGIS, in the point-in-polygon case, and while reading it over I was struck by the elegance of the underlying assumption: rather than pre-sort the geometry edges, and then build the tree, the implementation simply scans the point array from start to finish and builds parent nodes from each successive pair of nodes. The result is not a perfect tree, by any stretch of the imagination, but… it’s good enough. Because the edges are spatially auto-correlated, the parent nodes end up having good locality.

The beauty of this approach to index building is that since there is no sorting step and no re-balancing of the tree or any of that fancy stuff, you can actually build an “index” in O(n) time. A brute force intersection test is O(nm) time. But if you can build your indexes in O(n) time the cost of doing an intersection starts to get closer to O(n+m) time! (Note, I am not a computer scientist and the O() term for the actual tree-on-tree intersection test is beyond my powers.)

Now, since no sorting is being applied in the building of these “indexes” they could theoretically be terrible indexes. But since we’re indexing GIS data, and the edges have this wonderful autocorrelation, they are actually “good enough”, and obtainable in very little time.

The same tricks apply to building indexes for intersection and distance in geodetic space, which I predict will be in hot demand once people experience just how computationally expensive operations on the new PostGIS 1.5 geography type are!