Lies, Damn Lies...

Green shoots…” ah, for the good old days of only two weeks ago, when green shoots were in our future…

Job Losses

I never really understood why decreases in the rate of change of unemployment were considered such great news. “Good news, the second derivative has gone positive! we’re plunging into the abyss slightly less quickly!” Only in a world of rampant, congenital optimism – or statistics-induced myopia – could four months in which 18,300 Americans lost their jobs every day be described as a period of “improving conditions”.

Working in the Cathedral

In February, at the Toronto Code Sprint, the PostGIS team looked each other in the eye (for the first time) and committed to get the 1.4 release out by late April.

Well, it’s late June now. It seems very likely that I will get to cut 1.4.0RC1 tomorrow morning.

My personal preference has always been to release early and often. In the hacker ethic, this sounds like a good thing, it’s the “bazaar” model that Eric Raymond promoted over the “cathedral” model of development. In the bazaar, you dump out regular releases, and let the community dictate whether they are of quality (“don’t use 2.31.2a, it’s garbage!”). I still remember being told by a more knowledgeable Linux user that I could upgrade to 1.1.53 (?), but not any further than that, because the succeeding releases were unstable. In the cathedral, you release no wine before its time, aiming for a polished diamond of a release.

So, 1.4.0 has taken much longer than expected, the confluence of a development team that is now unwilling to accept the existence of any “crasher” bugs at all (no matter how unlikely they are to be exercised) and a growing comprehensiveness in the test suite, which is now covering all the functions, in most every combination of inputs. Because of the enhanced testing, we discovered crashers we didn’t know we had – and then we had to fix them.

Despite chafing to release! release! release! I have come to appreciate our new conservatism. Among my favorite feedbacks on PostGIS is the users who say “it just works, install it and forget about it, rock solid”. That feels good, and to keep things that way, our new austerity is only going to help.

The maturation of PostGIS into a product you can just “install and forget” has been multi-stage.

Prior to the 1.0 release, Sandro Santilli added the first regression tests. These tests have been growing ever since and have been invaluable in ensuring that old bugs don’t re-enter the code base, and that new features don’t break old features.

For the 1.4 release, the documentation was upgraded substantially, by adding a great deal of extra structuring to the reference section. Regina Obe discovered that a side effect of the extra structure was that she could automatically generate a test for most every documented function using XSLT on the docbook XML. This new “garden test” found a number of previously undetected bugs, that have since been removed.

For the 1.4 release, I added the start of a CUnit test suite that exercises the PostGIS functions without requiring a database back-end. Even in it’s early state, it has saved me from a couple booboos already. For future releases, this extra regression suite is going to help keep things stable.

For the 1.4 release, Mark Cave-Ayland re-worked the logging and debugging infrastructure, to make the coding cleaner and easier to maintain during debugging cycles. He also split out the underlying geometry implementations, which are now used in the loader/dumper utilities, for a more consistent approach to geometry handling.

These are all under-the-covers improvements that end-users never see. But they all contribute to that “it just works, it just runs” end-user experience that I have come to treasure even more than the sensation of slamming out a point release at 2am. I hope everyone tries out RC1 so that we can slay any remaining bugs before the 1.4.0 release!

MySQL vs PostGIS

Did I say I would publish my performance results? I did. Here they are.

MySQL Snark #2

I am doing a little benchmarking as a learning experience with JMeter and I will publish the throughput numbers in a few days, after I run the full suite I have developed on the various combinations of concurrency and insert/select ratios.

Because MySQL has so few functions that actually do anything (see the note here) there’s not a great deal to test beyond raw performance. The early throughput results seem to indicate it’s comparable for simple CRUD on one table, but for anything non-trivial it falls down.

Here’s a basic spatial join: pull 23 roads from a 3.4M row line table and spatially join to a 66K row tract polygons table, calculating the sum of the areas of tract polygons found. There are spatial indexes on both tables.

mysql> select sum(area(t.geom)) 
from tiger_roads_texas r, tiger_tracts t 
where 
  mbrintersects(r.geom, GeomFromText('LINESTRING(453084 -1650742,452384 -1650442)')) 
and 
  mbrintersects(r.geom,t.geom);

+-------------------+
| sum(area(t.geom)) |
+-------------------+
|  1260394420.00453 | 
+-------------------+
1 row in set (9.43 sec)

And in PostGIS:

tiger=# select sum(area(t.geom)) 
from tiger_roads_texas r, tiger_tracts t 
where r.geom && GeomFromText('LINESTRING(453084 -1650742,452384 -1650442)',2163) 
and r.geom && t.geom;

       sum        
------------------
 1260394420.00684
(1 row)

Time: 5.574 ms</pre>

Those are both “hot cache” results, after running them a couple times each.

MySQL Snark

OK, this one I have to share. Here’s two queries, the first with a syntax error in the WKT (oops!) and the second one correct.

First, as processed by MySQL:

mysql> select count(*) from tiger_roads_texas 
  where mbrintersects(geom, 
    GeomFromText('LINESTRING(452284 -1651542, 452484 -1651342'));
+----------+
| count(*) |
+----------+
|        0 | 
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from tiger_roads_texas 
  where mbrintersects(geom, 
    GeomFromText('LINESTRING(452284 -1651542, 452484 -1651342)'));
+----------+
| count(*) |
+----------+
|        1 | 
+----------+
1 row in set (0.06 sec)

Now as processed by PostGIS:

tiger=# select count(*) from tiger_roads_texas 
where geom && 
  GeomFromText('LINESTRING(452284 -1651542, 452484 -1651342',2163);
ERROR:  parse error - invalid geometry
HINT:  "...RING(452284 -1651542, 452484 -1651342" <-- parse error at position 43 within geometry
CONTEXT:  SQL function "geomfromtext" statement 1

tiger=# select count(*) from tiger_roads_texas 
  where geom && 
    GeomFromText('LINESTRING(452284 -1651542, 452484 -1651342)',2163);
 count 
-------
     1
(1 row)

Can you spot the difference? Snark! Another one for the list.