libpostal for PostgreSQL

Dealing with addresses is a common problem in information systems: people live and work in buildings which are addressed using “standard” postal systems. The trouble is, the postal address systems and the way people use them aren’t really all that standard.

libpostal for PostgreSQL

Postal addressing systems are just standard enough that your average programmer can whip up a script to handle 80% of the cases correctly. Or a good programmer can handle 90%. Which just leaves all the rest of the cases. And also all the cases from countries where the programmer doesn’t live.

The classic resource on postal addressing is called Falsehoods programmers believe about addresses and includes such gems as:

An address will start with, or at least include, a building number.

Counterexample: Royal Opera House, Covent Garden, London, WC2E 9DD, United Kingdom.

No buildings are numbered zero

Counterexample: 0 Egmont Road, Middlesbrough, TS4 2HT

A street name won’t include a number

Counterexample: 8 Seven Gardens Burgh, WOODBRIDGE, IP13 6SU (pointed out by Raphael Mankin)

Most solutions to address parsing and normalization have used rules, hand-coded by programmers. These solutions can take years to write, as special cases are handled as they are uncovered, and are generally restricted in the language/country domains they cover.

There’s now an open source, empirical approach to address parsing and normalization: libpostal.

Libpostal is built using machine learning techniques on top of Open Street Map input data to produce parsed and normalized addresses from arbitrary input strings. It has binding for lots of languages: Perl, PHP, Python, Ruby and more.

And now, it also has a binding for PostgreSQL: pgsql-postal.

You can do the same things with the PostgreSQL binding as you can with the other languages: convert raw strings into normalized or parsed addresses. The normalization function returns an array of possible normalized forms:

SELECT unnest(
  postal_normalize('412 first ave, victoria, bc')
  );
                  unnest                  
------------------------------------------
 412 1st avenue victoria british columbia
 412 1st avenue victoria bc
(2 rows)

The parsing function returns a jsonb object holding the various parse components:

SELECT postal_parse('412 first ave, victoria, bc');
         postal_parse                                   
----------------------------------
 {"city": "victoria", 
  "road": "first ave", 
  "state": "bc", 
  "house_number": "412"}
(1 row)

The core library is very fast once it has been initialized, and the binding has been shown to be acceptably fast, despite some unfortunate implementation tradeoffs.

Thanks to Darrell Fuhriman for motivating this work!

Enter the Panopticon

In 1790, Jeremy Bentham published his plans for the “panopticon”, his design for a new kind of prison that would leverage surveillance to ensure well-behaved prisoners. The panopticon was laid out so that all cells were visible from a central monitoring station, from which a hidden guard could easily watch any inmate, without being observed himself.

Enter the Panopticon

Bentham expected the panopticon approach to improve inmate behaviour at low cost: inmates would obey the rules because they could never be certain when the observer was watching them or not.

The developed world is rapidly turning into a digital panopticon.

Your trail through digital media is 100% tracked.

  • Every web site you visit is traceable, via ad cookies or “like” badges, or Google analytics.
  • Every web search is tracked via cookies or even your own Google login info.

In some respects this tracking is still “opt in” since it is possible, if undesirable, to opt out of digital culture. Drop your email, eschew the web, leave behind your smart phone.

But your trail in the physical work is increasingly being tracked too.

  • If you carry a cell phone, your location is known to within one mobile “cell”, as long as the device is powered.
  • If you use a credit or debit card, or an ATM, you are localised to a particular point of sale when you make a purchase.
  • If you use a car “safety” system like OnStar your location is known while you drive.

Again, these are active signals, and you could opt out. No cell phone, cash only, no vehicles after 1995.

But as I discussed this year in my talk about the future of geo we are rapidly moving beyond the point of “opt out”.

  • Within our lifetimes, most urban areas will be under continuous video surveillance, and more importantly,
  • within our lifetimes, the computational power and algorithms to make sense of all those video feeds in real time will be available.

We take for granted that we have some moments of privacy. When I leave the house and walk to pick up my son at school, for 15 minutes, nobody knows where I am. Not for much longer. All too soon, it will be possible for someone in a data center to say “show me Paul” and get a live picture of me, wherever I may be. A camera will see me, and a computer will identify me in the video stream: there he is.

Speculative fiction is a wonderful thing, and there’s a couple books I read in the last year that are worth picking up for anyone interested in what life in the panopticon might be like.

  • Rainbows End by Verner Vinge (2007) is an exploration of the upcoming world of augmented reality and connectivity. In many ways Vinge leaps right over the period of privacy loss: his characters have already come to terms with a world of continuous visibility.
  • The Circle by David Eggers (2013) jumps into a world right on the cusp of the transition from our current “opt-in” world of partial privacy to one of total transparency, of life in the panopticon.

Both books are good reads, and tightly written, though in the tradition of science fiction the characterizations tend to be a little flat.

The Guardian also has a recent (2015) take on the digital panopticon:

In many ways, the watchtower at the heart of the panopticon is a precursor to the cameras fastened to our buildings – purposely visible machines with human eyes hidden from view.

Once you come to terms with the idea that, at any time, you could be surveilled, the next question is: does that knowledge alter your behaviour? Are you ready for the panopticon? I’m not sure I am.

Paris Code Sprint, PostGIS Recap

At the best of times, I find it hard to generate a lot of sympathy for my work-from-home lifestyle as an international coder-of-mystery. However, the last few weeks have been especially difficult, as I try to explain my week-long business trip to Paris, France to participate in an annual OSGeo Code Sprint.

Paris Code Sprint

Yes, really, I “had” to go to Paris for my work. Please, stop sobbing. Oh, that was light jealous retching? Sorry about that.

Anyhow, my (lovely, wonderful, superterrific) employer, CartoDB was an event sponsor, and sent me and my co-worker Paul Norman to the event, which we attended with about 40 other hackers on PDAL, GDAL, PostGIS, MapServer, QGIS, Proj4, PgPointCloud etc.

Paul Norman got set up to do PostGIS development and crunched through a number of feature enhancements. The feature enhancement ideas were courtesy of Remi Cura, who brought in some great power-user ideas for making the functions more useful. As developers, it is frequently hard to distinguish between features that are interesting to us and features that are using to others so having feedback from folks like Remi is invaluable.

The Oslandia team was there in force, naturally, as they were the organizers. Because they work a lot in the 3D/CGAL space, they were interested in making CGAL faster, which meant they were interested in some “expanded object header” experiments I did last month. Basically the EOH code allows you to return an unserialized reference to a geometry on return from a function, instead of a flat serialiation, so that calls that look like ST_Function(ST_Function(ST_Function())) don’t end up with a chain of three serialize/deserialize steps in them. When the deserialize step is expensive (as it in for their 3D objects) the benefit of this approach is actually measureable. For most other cases it’s not.

(The exception is in things like mutators, called from within PL/PgSQL, for example doing array appends or insertions in a tight loop. Tom Lane wrote up this enhancement of PgSQL with examples for array manipulation and did find big improvements for that narrow use case. So we could make things like ST_SetPoint() called within PL/PgSQL much faster with this approach, but for most other operations probably the overhead of allocating our objects isn’t all that high to make it worthwhile.)

There was also a team from Dalibo and 2nd Quadrant. They worked on a binding for geometry to the BRIN indexes (9.5+). I was pretty sceptical, since BRIN indexes require useful ordering, and spatial data is not necessarily well ordered, unlike something like time, for example. However, they got a prototype working, and showed the usual good BRIN properties: indexes were extremely small and extremely cheap to build. For narrow range queries, they were about 5x slower than GIST-rtree, however, the differences were on the order of 25ms vs 5ms, so not completely unusable. They managed this result with presorted data, and with some data in its “natural” order, which worked because the “natural” order of GIS data is often fairly spatially autocorrelated.

I personally thought I would work on merging the back-log of GitHub pull-requests that have built up on the PostGIS git mirror, and did manage to merge several, both new ones from Remi’s group and some old ones. I merged in my ST_ClusterKMeans() clustering function, and Dan Baston merged in his ST_ClusterDBSCAN() one, so PostGIS 2.3 will have a couple of new clustering implementations.

However, in the end I spent probably 70% of my time on a blocker in 2.2, which was related to upgrade. Because the bug manifests during upgrade, when there are two copies of the postgis.so library floating in memory, and because it only showed up on particular Ubuntu platforms, it was hard to debug: but in the end we found the problem and put in a fix, so we are once again able to do upgrades on all platforms.

The other projects also got lots done, and there are more write-ups at the event feedback page. Thanks to Olivier Courtin from Oslandia for taking on the heavy weight of organizing such an amazing event!

The Future and All That

I gave this talk in December, at the CartoDB 2015 partners conference, at the galactic headquarters in glamorous Bushwick, Brooklyn. A bit of a late posting, but hopefully I can still sneak under the “new year predictions bar”.

What's up with Mr. Loukidelis?

Ever feel like people are talking about you behind your back? Usually it’s just perfectly normal paranoia. But sometimes, they actually are. Maybe.

Backgrounder for those from abroad: Our provincial government was recently caught destroying public records by an Officer of the Legislature, who produced a detailed report with a dozen recommendations on how to stop breaking the law so much. But rather than simply implementing the recommendations, the Premier instead appointed her own smart important guy, David Loukidelis, to go over those recommendations and produce yet another set of This Time It’s For Real recommendations for her to take Very, Very Seriously. Mr. Loukidelis produced his recommendations on Wednesday, and the government said it would “accept them all” (for certain definitions of the words “all” and “accept”).

David Loukidelis

Anyways, I wasn’t even through reading the introduction to the Loukidelis report on the Denham report on government information access policy when I hit this line:

“Nonetheless, some observers have suggested in the wake of the investigation report that all emails should be kept.”

As far as I know, I’ve been the only “observer” to suggest that government emails should be archived and retained more-or-less in their entirety, as we expect Canadian financial institutions to do, and as the US government expects all public corporations to do. So I took this as a little bit of a throw down.

David Loukidelis wants to get it on! Is it on? Oh yes, it’s on, baby!

(This would be a good moment to go do something a lot more engaging, like picking lint out of your toes, or feeling that sensitive place at the back of your second left molar. I’m about to take apart Recommendation #2 of a 70 page report that, despite costing $50,000, is about as interesting as the last 70 pages of the phone book.)

Chapter 1: It’s too big!

After calling out us “observers”, Loukidelis then procedes to lay out his Luddite credentials in full, first by calculating the number of pages represented by the 43 terabytes of annual government emails:

“Using the above averages of emails received and sent, each year there would be roughly 426,000,000 pages of received emails and some 129,000,000 pages of sent emails, for a total of roughly 555,000,000 pages of emails. No one would suggest that all emails should be printed, but this gives a sense of the order-of-magnitude implications of the suggestions that, contrary to prudent information management principles, all emails should be kept, or should be vetted by others for retention. The same would be true even if these estimates were reduced by one or even two orders of magnitude, to 55,000,000 pages or 5,500,000 pages.”

Staggering! Shocking! Half a billion! I’m surprised he didn’t express it in terms of football fields to help the folks at home grasp the staggering immensity. (Because you need to know: 500M pages stack to about 700 football fields high.)

Let’s recast this problem in more computer-centric terms:

  • The government produces/receives 43TB of email per year.
  • A 4TB hard-drive can be purchased for between $200 and $400.
  • So depending on the amount of redundency you want, and the quality of hard-drive you purchase, it’s possible to store the entire years worth of government email data on between $8,600 and $50,000 worth of hardware. Or, to put it in terms Mr. Loukidelis might understand, for about the cost of one overly wordy report.

Now I’m not suggesting the OCIO buy a dozen 4TB drives and stick a server in the closet, but the numbers above should reassure us that storing 44TB of email per year is not exactly at the far reaches of today’s computing capabilities. There are companies that provide cloud-based email archiving services, particularly for organizations with privacy issues and sensitive data (financial companies). In fact, one of the leaders in the field is headquartered right here in BC. I asked them if they could handle the government’s data volume.

So, we have the technology, we just lack the will.

Chapter 2: It’s not searchable!

Unfortunately, Mr. Loukedelis doesn’t stop trying to explain technology to the unwashed with his “pages of paper” analogy. He’s got yet more reasoning by analogy to share.

“At all costs, the provincial government should not entertain any notion that all electronic records must, regardless of their value, be retained. … To suggest, as some have, that all information should be kept is akin to suggesting it is good household management for homeowners to never throw away rotten food, grocery lists, old newspapers, broken toys or worn-out clothes. No one keeps their garbage. Hoarding is not healthy.”

Except of course, we aren’t talking about rotten food, grocery lists, old newspapers, and broken toys here. We’re talking about digital data, which can be sifted, filtered and analyzed in microseconds, without human effort of any kind. These are not differences in degree, these are differences in kind.

Mr. Loukedelis might be too young to remember this, but when Google introduced GMail in 2004, they did two remarkable things: they gave every user an unprecedented 1GB of free storage (that number is now 15GB); and, they hid the “delete” button in favor of an “archive” button. The archive button does not delete mails, it just removes them from the Inbox. Google served notice a decade ago: you don’t have to delete your mail, and you shouldn’t bother to delete your mail, because it’s too valuable as a record, and so very easy to search and find what you want.

I’m surprised Mr. Loukidelis, as a lawyer, isn’t following the progress of e-discovery technology, rapidly moving from keyword based searching to applying natural language and AI (well, statistical pattern recognition) tools to finding relevant documents in huge corpuses of electronic data.

Suffice it to say, it’s early days. Present technology is more than satisfactory to do a much better job than the poor old FOI clerks are doing searching mail boxes. And in the future, we can expect AI tools to easily sort through as much “garbage” as we care to throw at them.

The time to start archiving everything, and letting the computers sort out the mess, is now.

Chapter 3: It’s not relevant!

There’s one more vignette Mr. Loukedelis shares, a folksy thing, which is also worth looking at:

“This is true even if an individual engages in a transaction that generates records. Take the example of an individual who shops at an online store and arranges to pick up the television they buy at a bricks-and-mortar location. The order confirmation is emailed to them and they print it for pickup purposes. They cannot pick the television up within the allotted window, so they email the retailer to extend the time. The retailer responds. They then email the retailer about whether the television comes with an HDMI cable. The retailer responds. Once the television is picked up, the purchaser keeps the receipt for warranty purposes. This is surely the only documentation that truly matters. It would make no sense to keep all of the emails back and forth, or the printed pickup notice.”

Valueless! Cluttering up the important documentary record of government! If we had to store all this back-and-forth nonsense, we’d never be able to find the “good stuff” amongst the trash. Right?

What if the individual were picked up for a murder he didn’t commit, and his only alibi was that he sent an email from his desk to the television store, right when the act was committed? What if, after delivery, the individual opens the box and finds no HDMI cable! The store insists there isn’t supposed to be one. How can the individual prove otherwise? On and on it goes.

The most trivial pieces of information can have value, in the right circumstances. And since they cost practically nothing to store, why not keep them, particularly in light of the alternative Mr. Loukedelis proposes.

Chapter 4: What’s the alternative?

It’s important to weigh Mr. Loukidelis’ strong rejection of email archiving against the alternative, which is basically the current system.

  • Most policy discussion and decisions are handled in email.
  • That email may be discarded very easily by any staff member.
  • Only if printed and filed will a permanent record be kept.
  • If deleted, a copy in the trash folder may find its way to a backup file.
  • Once deleted, FOI searches for the record will start to come up empty, as individual searches on staff computers don’t necessarily hit the trash folder.
  • Also, FOI searches can only find the record if run on the right staff member’s computer (unlike with a government-wide archive).
  • The copy in the backups will only be retrievable if HP Advanced Solutions restores the backup file (and if you think storing 44TB of data a year is expensive, compare it to having HPAS do really anything at all for you).
  • The backups themselves will be purged after 13 months. At that point, the record is gone, forever.

On top of this system, Mr. Loukidelis proposes some sensible tweaks and improvements, but let’s be crystal clear: the current system sucks, it’s really collosally bad, and there’s no excuse for that in 2015.

Mr. Loukidelis should have proposed a real improvement, but instead he wiffed, and he wiffed hard.

Appendix A: Optional Conspiracy Theory Section

Mr. Loukidelis’ recommendation #2 is really striking, here it is:

“It is recommended in the strongest possible terms that government resist any notion that all emails should be kept”

Emphasis mine. Not just recommended, but “in the strongest possible terms”. None of the other recommendations is remotely so strong. And here’s an odd thing: the other recommendations are all addressed to Commissioner Denham’s original report, but Denham has nothing at all to say about archiving email. It’s like the topic was dropped into the Loukidelis report from out of the blue sky, and greeted by a phalanx of flame-throwers.

Why? What’s going on? Why spend so much ink, and such strong language, to kill an idea that Denham didn’t even recommend?

I find it hard to believe that Loukidelis really cared that much about “observers” like me and my blog. Yet he cared enough to not only put in a section about email archiving, but also to beat the topic to death with a shovel.

I think there must have been some internal debate in government about permanently ending the controversy over bad email management by adopting an email archive. And Loukidelis was instructed by political staff on one side of that debate to ensure that the idea was terminated with dispatch.

Maybe Finance Minister Mike “Mr Transparency” de Jong made an email archive a personal hobby-horse and started talking it up in cabinet. If so, having the Loukidelis report kill the idea dead would be a quick and dirty way for the Premier to make sure the discussion went no further.

Regardless, I think there’s probably an interesting story behind recommendation #2, and I hope someday I get to hear what it was.