How to get Your bug fixed

Mike Leahy is providing a textbook demonstration of bug-dogging on the PostGIS users list this week, and anyone interested in learning how to interact with a development community to get something done would do well to study it.

Some key points:

He does as much of the work in diagnosing the problem as possible, including combing through Google for references, cutting down the test data as small as possible, trying to find smaller cases that exercise the issue.

He responds very quickly (you’ll have to read the timestamps) to questions and suggestions for gathering more information. Since the problem appears initially to manifest only on his machine, any delay on his part risks disengaging the folks helping him.

He prepares a sample database and query to allow the development team to easily replicate the situation on their machines.

And he also gets lucky. The problem is replicable, and the discussion catches the attention of Greg Stark, who recalls and digs up some changes Tom Lane made to PostgreSQL which in turn leads me to find the one-argument-change that can remove the problem. Very lucky, really, it’s unlikely I would have been able to debug it by brute force.


I’m really looking forward to seeing the “NoSQL” buzzword head over the top of the hype-cycle and start heading downwards, since I think it’s really doing damage in what should be a fairly straightforward discussion of matching customer use cases to appropriate technology.

NoNoSQLBecause the term is framed (“No!”) in opposition to the almost the entire family of existing data persistence technology, anyone who comes to the discussion fresh assumes there’s a replacement process going on, wherein NoSQL stands in opposition to SQL.

That’s a shame, because the term is only one letter away from a (slightly) less polarizing buzzword: NotSQL. Even then, though, the core discussion would be lost, because the big difference isn’t programmatic API versus 4GL. The big difference is use case matching, in particular the high-volume, high-availability use case which has emerged in the age of consumer web services.

People with public-facing web applications face a potentially unconstrained read/write load (in their happiest dreams) and the techniques necessary to scale a traditional RDBMS to match that load proceed from the straightforward at the low end to the increasingly byzantine at the high end.

The scaling story for traditional RDBMS technology just is not great: start by adding servers and extra technology to hook them together; get increasingly smart people to handle your increasingly complex infrastructure; finally, start hacking at your data model to allow even further partitioning and duplication.

The new breed of databases have a great scaling story: once you get set up, scaling requires plugging in new nodes and turning them on. That’s it. No model changes, no extra replication and high availability technology.

There’s no free lunch though. In exchange for the high-throughput/high-availability you lose the expressiveness and power of SQL. Henceforth you will write your joins and summaries yourself, at the application level. Henceforth performing an ad hoc query may require you to build and populate a whole new “table” (the terminology is highly variable at this point) in your model. And of course this technology is all pretty fresh meat, so just learning enough to get started can be a bit of a slog – kids aren’t exactly coming out of school with a course in this stuff under their belt.

So, it’s a whole new world, and if you are planning on serving an application where the numbers (hits, pages, requests, whatever) are heading into the 7-digits, it might be good idea to start with this technology (can you tell I can’t stand to use the “NoSQL” term? It is just too awful, there really needs to be a non-pejorative term for this application category).

Of course, there’s nothing new under the sun. Getting the best performance for a specialized use case requires (a) modifying your data model and (b) using technology that can leverage your specialized data model. This is exactly what OLAP databases have been doing for a generation to provide data analysis on multi-billion record historical databases (special data model, special technology).

Database guru Michael Stonebraker wrote a nice article about the brave new world of databases, called “One Size Fits All: An Idea Whose Time has Come and Gone” in 2005, and the conclusion is that we are going to see increasing fragmentation of database technology based on use case. “NoSQL” (shudder) is just the latest iteration in this process.

Meanwhile, I’ll put my oar in for the general purpose database: it’s easy to run OLAP queries on a general purpose database, you just can’t do it once your table size gets over a billion; it’s easy to run a public web site on a general purpose database, you just can’t do it once your load gets over a million. On the other hand it’s well nigh impossible to run even a small web site on an OLAP database and pretty darn hard to build even a small OLAP system on a NoSQL foundation.

Horses for courses folks, horses for courses.

For more of this kind of geekery, see the several articles linked off of “The Case for the Bit Bucket” at the Oracle Nerd blog.

Where 2.0 Drinking Game

It’s that time of year again, and I’ll be sitting in the audience with my flask (I hope you will too!) playing the Where 2.0 drinking game. Here’s some of my phrases, what are yours?

  • … find a Starbucks…
  • … we’re releasing an API …
  • … friends list …

Also, take a big slug if someone talks about working with geospatial data more complex than a lat/lon point!

Send us Jeremy and Keyur!

ESRI (pretty new web page, by the way) has put their open source position on-line and also produced a short podcast with Victoria Kouyoumjian on the same topic.

One thing that struck me in the podcast was when Victoria noted that ESRI has sponsored open source events in the past (most notably FOSS4G 2007 directly, but also 2008 and 2009 to a lesser extent through 50°North). She says,

These events allow us the opportunity to engage in conversations and dialogs with various technologists because we want to gain feedback about the needs of open source developers and users. The objective of course is to channel this information back to development so we can reflect this in future products and business decisions in order to best support our customers.

So far ESRI attendance has been at the managerial level, and while I love those guys (hugs to Satish and Victoria!) some real sparks could fly and serious interoperability improvements be made if we started seeing the developers, the project leads and software designers, at the events. We can do better than “channeling” information back to development, let’s immerse development in it!

Update: We promise to send them back. Really.

Nothing, Nada, Zip, Bupkus

There is nothing new under the sun, and I have been wrestling this week with writing out ISO-standard well-known binary from PostGIS.

The most obvious difference is that the type numbers for encoding the presence of Z- and M-dimensions are not the ones described in the old OGC extension document [OGC members only, cited by Martin Daly in 2004, and extended further for PostGIS by Sandro Santilli that year] for WKB. Instead of setting high-bits to indicate the presence of Z and M, as OGC did, the ISO spec simply adds 1000.

So, the ISO geometry number for a PolygonZ is 3 (Polygon) + 1000 = 1003.

The, old OGC geometry number for a PolygonZ is 3 (Polygon) | 0x80000000 = 2147483651.

OGC seems more complex until you note that the function WKB_HASZ(num) can be written (num & 0x80000000). While the ISO test is (num >= 1000 && num < 2000). Setting flags for binary values (has-z, has-m, has-a-piece-of-pie) is nice.

Anyhow, that change was well-known and expected. What I didn’t expect was the amount of ambiguity surrounding the definition of an empty geometry in WKB.

To review, the spatial SQL definition includes the concept of an “empty geometry”, which is an empty set of a particular geometry type. The empty geometry has more information than a simple database NULL, which is a typeless emptiness. A ‘POLYGON ZM EMPTY’ has an implied dimensionality. It makes some sense that ST_Intersection() of two disjoint polygons would return a ‘POLYGON EMPTY’.

The ISO SQL/MM well-known text specification has clear directions for writing empty geometries of all types. In fact, I’ve just written two of them above: the type name plus the ‘EMPTY’ keyword.

For well-known binary, ISO SQL/MM includes the following useless guidance:

i) Case:
i) If <point binary representation> immediately contains a <wkbpoint binary>, then <point binary representation> is the well-known binary representation for an ST_Point value that is produced by <wkbpoint binary>.
ii) Otherwise, <point binary representation> produces an empty set of type ST_Point

Representing an empty point in WKB is hard because there’s nowhere obvious to indicate the lack of ordinates. But the ISO specification makes no attempt to solve the problem, they instead provide explicit guidance that is impossible to implement. Basically, if you are reading a WKB POINT and there are doubles after the TYPE number, you have a POINT(x y). If not, you have a POINT EMPTY. All well and good, but how do you distinguish, in a collection of WKB geometries, between the presence of doubles in the byte stream and the presence of another geometry in the stream? You don’t.

The ISO guidance for empty Linestrings is even worse!

q) Case:
i) If <linestring binary representation> immediately contains <num>, then <linestring binary representation> is the well-known binary representation for an ST_LineString value. Let APA be an ST_Point ARRAY value with cardinality of <num> that contains the ST_Point values specified by the immediately contained <wkbpoint binary>s. <linestring binary representation> produces an ST_LineString value as the result of the value expression: NEW ST_LineString(APA).
ii) Otherwise, <linestring binary representation> produces an empty set of type ST_LineString.

As with the POINT case, the WKB reader is supposed to magically distinguish between an element of the current geometry (the <num>) in the byte-stream and an element of the next geometry in the byte-stream. And worse, the “clarifying” comment implicitly adds a whole new kind of empty geometry! What if the <num> is present, but the value is zero!?!

This is where the snake starts eating its tail. The way that implementations of OGC WKB have been encoding EMPTY geometries has been to provide the type number and an element count of zero. Back when PostGIS was first getting WKB support, Dave Blasby wrestled with the fact that the specification did not describe how to encode EMPTY. Mateusz Loskot recently published some information showing the WKB EMPTY implementation that Microsoft used for SQLServer. Their implementation is one of the options Dave described five years ago – there’s only so many ways to solve this problem.

If ISO didn’t like the use of a zero-valued <num> count as a way of indicating EMPTY, they had another option available, which was to follow the original OGC WKB standard and use bitmask flags on their type numbers. There could have been a bitmask for Z, a bitmask for M, and a bitmask for EMPTY. There could even have been a bitmask for SRID, fixing up a huge drawback in WKB, namely that WKB does not include a slot for the SRID, which is an important element in the geometry model.

Sidenote: As a result of WKB not having SRID support, it’s not possible to round-trip a geometry through WKB without losing the SRID value. Try this standard SQL and see what happens:

   ST_GeomFromText('POINT(0 0)', 4326) 

Then try the bastardized PostGIS EWKB format instead:

   ST_GeomFromText( 'POINT(0 0)', 4326)

As it stands now, the specification is out of synch with the implementations on the ground, which is bad news for the relevance of the specification. I will be implementing EMPTY using the same semantics as SQLServer, which will make the kinds of EMPTY PostGIS can represent slightly richer, but remain backwards compatible to the old schemes.