Friday, January 23, 2009

(Much) Faster Unions in PostGIS 1.4

Originally posted at GeoSpeil.

I have had a very geeky week, working on bringing the "cascaded union" functionality to PostGIS.

By way of background, about a year ago, a PostGIS user brought a question up. He had about 30K polygons he was unioning and the process was taking hours. ArcMap could to it in 20 minutes, what was up? First of all, he had an unbelievably degenerate data set, which really did a great job of exposing the inefficiency of the PostGIS union aggregate. Second, the union aggregate really was inefficient.

This is what his data looked like, before and after union.

      

The old PostGIS ST_Union() aggregate just naively built the final result from the input table: union rows 1 and 2 together, then add row 3, then row 4, etc. As a result, each new row generally made the interim polygon more complex – more vertices, more parts. In contrast, the "cascaded union" approach first structures the data set into an STR-Tree, then unions the tree from the bottom up. As a result, adjacent bits are merged together progressively, so each stage of the union does the minimum amount of work, and creates an interim result simpler than the input components.

Implementing this new functionality in PostGIS required a few steps: first, the algorithm had to be ported from JTS in Java to the GEOS C++ computational geometry library; second, the C++ algorithm in GEOS had to be exposed in the public GEOS C API; third, PostGIS functions to call the new GEOS function had to be added.

The difference on the test data set from our user was stark. My first cut brought the execution time in PostGIS from 3.5 hours to 4.5 minutes for the sample data set. That was excellent! But, we knew that the JTS implementation could carry out the same union on the same data in a matter of seconds. Where was the extra 4 minutes going in PostGIS? Some profiling turned up the answer.

Before you can run the cascaded union process, you need to aggregate all the data in memory, so that a tree can be built on it. The PostGIS aggregation was being done using ST_Accum() to build an array of geometry[], then handing that to the union operation. But the ST_Accum() aggregation was incredibly inefficient! Four minutes of overhead isn't a bit deal when your union is taking hours, but now that it was taking seconds, the overhead was swamping the processing.

Running a profiler found the problem immediately. The ST_Accum() aggregate built the geometry[] array in memory, repeatedly memcpy()'ing each interim array. So the array was being copied thousands of times. Fortunately, the upcoming version of PostgreSQL (8.4) had a new array_agg() function, which used a much more efficient approach to array building. I took that code and ported it into PostGIS, for use in all versions of PostgreSQL. That reduced the aggregation overhead to a few seconds.

Final result, the sample union now takes 26 seconds! A big improvement on the original 3.5 hour time.

Here's a less contrived result, the 3141 counties in the United States. Using the old ST_Union(), the union takes 42 seconds. Using the new ST_Union() (coming in PostGIS 1.4.0) the union takes 3.7 seconds.




 

13 comments:

Mangocrate said...

Tangential, but in PostGIS 1.4 is there a way to come up with the number of overlaps?

If two polys overlap, the output would be 3 polygons where the left side would be coded with a 1, the right side with a 1, and the middle/overlapping area coded with a 2. I'm not concerned about the original attributes - just the number of layering.

Tyler Mitchell said...

Great to hear Paul! This was previously preventing me from advocating its use in earlier projects. I'll be sure to spread the word. Of course everyone will ask me if it's already in SVN or not - can we get at it now?

Bravo!

Paul Ramsey said...

@Mangocrate, there's nothing particular in 1.4 which supports this use case, though you might be able to finangle it with what is there already. A question for postgis-users.

@Tyler, combine the current SVN PostGIS with SVN GEOS to achieve this functionality.

Sean said...

Could you give us the specs on the hardware you used for benchmarking this so we have some context if we get different results?

Paul Ramsey said...

@Sean, the tests were all run on a 2.6GHz Intel iMac.

binkid said...

Awesome post!

Which profiling tools do you use, Paul?

Paul Ramsey said...

@binkid, thanks! For me, profiling begins and ends with the shark.

Roger Andre said...

Awesome! This should fix the problem I was having as well, http://tinyurl.com/dmge47.

Thanks very much.

Bugblatter said...

This is a truly spectacular piece of work. We have often been asked by clients to buffer and merge point datasets with several million points. We attempted this using ArcWhatever (could barely open the points, let along buffer them) and FME, which ran for a week and then gave an out of memory error. So, I do the whole configure, make, make install thing, 4 times, for postgres, goes, proj4 and postgis. After a lot of swearing and running ldconfig a few million times I eventually get postgis to accept that geos really is installed -- MySQL might have more limited spatial functionality, but it sure is a lot easier to build from source. Anyway, I digress. I run a few random queries using the excellent generate series capability in postgres, and manage to create, buffer and merge 100,000 points in a few seconds. Finally, I try this on a real world dataset, namely all of the postal addresses in Wales, 1.4 million or so. With a 200m buffer, this ran on a reasonably pokey 64-bit linux box in 19 minutes. Truly astonishing. Well done. Much as I love MySQL, this was a bit of St. Paul on the road to Damascus moment.

ben said...

Could you give the st_mem_size range of the US counties that you used? I have a similar situation, unioning a few thousand geometries averaging around 50kb, and my runtime is several hours.

G.A. said...

Nice to see it works fine on your tests... Our dateset fails with various TopologyExceptions. I've tried to snaptogrid the geoms, I've only used valid ones, but still the operation is unstable.

giohappy

Sergios said...

Hi Paul,

This looks great, I am trying to do the same thing for the entire road network of Italy, and the query runs forever. I am running Postgres 9.0 and PostGIS 1.5. Where can I find the ported for PostGIS "array agg()" function.
Thanx in advance
Sergios

P.S. My Postgres Server is Running on Windows :S

lordnn said...

"(Much) Faster Unions" works only on areas union. Although CascadeUnion can combine lines, it is not used by PostGIS in this case.

About Me

My Photo
Victoria, British Columbia, Canada

Followers

Blog Archive

Labels

bc (37) it (29) postgis (19) icm (11) video (11) enterprise IT (10) sprint (9) open source (8) osgeo (8) cio (6) foippa (6) gis (6) management (6) spatial it (6) enterprise (5) foi (5) foss4g (5) mapserver (4) outsourcing (4) politics (4) bcesis (3) oracle (3) COTS (2) architecture (2) boundless (2) esri (2) idm (2) natural resources (2) ogc (2) open data (2) opengeo (2) openstudent (2) postgresql (2) rant (2) technology (2) vendor (2) web (2) 1.4.0 (1) HR (1) access to information (1) accounting (1) agile (1) aspen (1) benchmark (1) buffer (1) build vs buy (1) business (1) business process (1) cathedral (1) cloud (1) code (1) common sense (1) consulting (1) contracting (1) core review (1) crm (1) crockofshit (1) custom (1) data warehouse (1) deloitte (1) design (1) digital (1) email (1) essentials (1) evil (1) exadata (1) fcuk (1) fgdb (1) fme (1) foocamp (1) foss4g2007 (1) ftp (1) gds (1) geocortex (1) geometry (1) geoserver (1) google (1) google earth (1) government (1) grass (1) hp (1) iaas (1) icio (1) industry (1) innovation (1) integrated case management (1) introversion (1) iso (1) isss (1) isvalid (1) javascript (1) jts (1) lawyers (1) mapping (1) mcfd (1) microsoft (1) mysql (1) new it (1) nosql (1) opengis (1) openlayers (1) oss (1) paas (1) pirates (1) policy (1) portal (1) proprietary software (1) qgis (1) rdbms (1) recursion (1) redistribution (1) regression (1) rfc (1) right to information (1) saas (1) salesforce (1) sardonic (1) seibel (1) sermon (1) siebel (1) snark (1) spatial (1) standards (1) svr (1) taxi (1) tempest (1) texas (1) tired (1) transit (1) twitter (1) uber (1) udig (1) uk (1) uk gds (1) verbal culture (1) victoria (1) waterfall (1) wfs (1) where (1) with recursive (1) wkb (1)