Open Source for/by Government

Update: Barcelona is going all-open. Sounds extreme, but some times you’ve got to…

“You’ve got to spend money to make money”, I once confidently told a business associate, on the occasion of paying him a thousand dollars to manually clean some terrible data for me. In the event, I was right: that cleaned data paid for itself 10 times over in the following years.

I’m still the only person with a GIS file for 1996 BC elections results by voting area, and the jealousy is killing you.

Governments can play the game too, but it seems like they all end up tilling the same landscape. There’s no shortage of governments trying to create their own Silicon Valley clusters, usually through the mechanisms of subsidizing venture capital funding (via tax breaks or directly) and increased spending on R&D grants to academia. Spending money to “hopefully” make money.

There’s an under-recognized niche available, for a government willing to go after it.

Venture capitalists are (understandably) interested in having their investments create “intellectual property”, that can be patented and monopolized for outsized profits. By following the VC model of focussing on IP formation, governments are missing out on another investment avenue: the formation of “intellectual capital” in their jurisdictions.

VCs don’t like intellectual capital because it’s too mobile. It lives between the ears of employees, who can change employers too easily, and require expensive golden handcuffs to lock into place. They can monetize intellectual property in an acquisition or public offering, but they cannot monetize intellectual capital.

Governments, on the other hand, understand that by investing in universities and colleges, they are creating intellectual capital that will tend to stick around in their jurisdictions (for all the public wailing about “brain drain”, the fact is that people don’t move around all that much).

Open Source for/by Government

Investment in open source technology is a potential gold mine for creating intellectual capital, but governments have been steadfastly ignoring it for years. There is also a big first mover advantage waiting for the first governments to get into the game:

  • Instead of “buying off-the-shelf” for government information systems, build on existing OSS, or start OSS from scratch, using local talent (in-house or private sector).
  • Deliberately build with enough generality to allow use in other jurisdictions.
  • Become the first reference customer for the project. Send your local talent out to evangelize it. Encourage them to commercialize their support and services.
  • Wash, rinse, repeat.

Is this risky? Yes. Will it result in some failed projects? Yes. Will it be more expensive than the “safe” alternative? Sometimes yes, sometimes no. Will it result in increased revenues flowing into your jurisdiction? Almost certainly, if committed to and carried out across a number of projects.

When the first library in BC adopted the Evergreen open source library software, they probably weren’t envisioning a Canada-wide open source cooperative, bringing support and consulting dollars into the province, but that’s what they did, by accident. When the Atlanta Public Library started the project, they probably weren’t thinking a local company would end up selling support and expertise on the software around the country.

There is no IP moat around open source projects, but there is a first mover advantage to having a critical mass of developers and professionals who have amassed intellectual and social capital around the project.

Intellectual capital isn’t just built in universities, and the private sector shouldn’t only be looked to for intellectual property. Let’s mix it up a little.

The BC government spends $9M/year on Oracle “maintenance”, basically the right to access bug fixes and updates from Oracle for the software we’re running. It’s not a lot of money, but it’s money being shipped straight over the border. Affilias, the “.org” top level DNS provider built their infrastructure on PostgreSQL – they spend a couple hundred thousand a year having some PostgreSQL core developers on staff. Same effect, different path.

PostGIS Scaling

Earlier this month I got to speak at the Spatial Data Science Conference hosted by my employer Carto at our funky warehouse offices in Bushwick, Brooklyn. The topic: PostGIS scaling.

PostGIS Scaling

Now.

“Make it go faster” is a hard request to respond to in the generic: what is “it”, what are you doing with “it”, are you sure that your performance isn’t already excellent but you’re just too damned demanding?

So, the talk covers a number of routes to better performance: changing up query patterns, adding special PostgreSQL extensions, leaning on new features of PostgreSQL, and just plain old waiting for PostgreSQL to get better. Which it does, every release.

PostGIS "Fund Me" Milestone

On the twitter this morning, there was a good question:

TL;DR: If you find a feature in “Fund Me” and want to fund it, join the postgis-devel mailing list and make yourself known.

If you go to the PostGIS ticket report and scroll through the pages you’ll first see some milestones tied to released versions. These are usually bug reports, both big and small, valid and invalid, and will eventually be closed.

We unfortunately carry a lot of tickets in the current development milestone (2.5 right now) which are, at best, speculative. They should probably be closed (we really will never do them and don’t much care) or moved to the “Fund Me” category (they are valid, but we have no personal/professional impetus to address them).

The “Fund Me” category used to be called “Future”. This was a bad name, as it implied that sometime in the “Future” the ticket might actually be addressed, and all you needed was sufficient patience to wait. The reality is that the way a ticket got into the “Future” category was that it was ignored for long enough that we couldn’t stand to see it in the current milestone anymore.

The PostGIS development community includes all kinds of developers, who make livings in all kinds of ways, and there are folks who will work on tasks for money. The “Fund Me” milestone is a way of pointing up that there are tasks that can be done, if only someone is willing to pay a developer to do them.

That’s the good news!

The bad news is that the tickets all look the same, but they are wildly variable in terms of level of effort and even feasibility.

  • #220 “Implement ST_Numcurves and ST_CurveN” would probably take a couple hours at the outside, and almost any C developer could do it, even oen with zero experience in the PostGIS/PostgreSQL/GEOS ecosystem.
  • #2597 “[raster] St_Grayscale” would require some knowledge of the PostGIS raster implementation and image processing routines or at least the GDAL library.
  • #2910 “Implement function to output Mapbox Vector Tiles” actually happened in 2.4, but the (duplicate) ticket remained open, as a reminder that we’re terrible at ticket management.

And then there’s the “big kahunas”, tasks that live quietly in one ticket but actually encompass massive research and development projects spanning months or years.

  • #1629 “Tolerance and Precision strategy” is a super idea, that would allow functions like ST_Intersects() or ST_Equals() to return true if a condition was met within a tolerance. However, it would require substantial enhancement to GEOS, to allow predicate evaluation within a tolerance context, as well as a changes to non-GEOS backed distance functions, and new signatures for every geometry relationship function. Given the depth of the GEOS problem, I’d estimate multiple months of effort, and a potential for zero deliverables at all if things went pear-shaped.
  • #472 “Missing ST_IsValid for Geography Types” is even worse than the tolerance problem, since it should really be implemented as a complete rewrite of GEOS to understand non-linear edge types, either through a cheater’s strategy to turn do local projections of geographic edges, or as a full understanding of geographic edges. On the upside, doing that would allow many of the other GEOS functions to support geography which would vastly expand geography functionality in one stroke. On the downside, it is again in the category of a year-long effort with a potential failure at the end of it if for unexpected reasons it turns out to be impossible within that timeframe.

These kind of core features basically never get funded, because the marginal benefit they provide is generally much lower than the development cost for any one organization. This is a common open source weakness: aggregating funding is something everyone agrees is a great idea in principle but rarely happens in practice.

Occasionally, lightning does strike and a major funded feature happens. PostGIS topology was funded by a handful of European governments, and my work on the geography type was funded entirely by Palantir. However, usually funders show up with a few thousand dollars in hand and are dismayed when they learn of the distance between their funds and their desires.

Nested Loop Join with FDW

Update: See below, but I didn’t test the full pushdown case, and the result is pretty awesome.

I have been wondering for a while if Postgres would correctly plan a spatial join over FDW, in which one table was local and one was remote. The specific use case would be “keeping a large pile of data on one side of the link, and joining to it”.

Because spatial joins always plan out to a “nested loop” execution, where one table is chosen to drive the loop, and the other to be filtered on the rows from the driver, there’s nothing to prevent the kind of remote execution I was looking for.

I set up my favourite spatial join test: BC voting areas against BC electoral districts, with local and remote versions of both tables.

CREATE EXTENSION postgres_fdw;

-- Loopback foreign server connects back to
-- this same database
CREATE SERVER test
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (
host '127.0.0.1',
dbname 'test',
extensions 'postgis'
);

CREATE USER MAPPING FOR pramsey
SERVER test
OPTIONS (user 'pramsey', password '');

-- Foreign versions of the local tables
CREATE FOREIGN TABLE ed_2013_fdw
(
gid integer,
edname text,
edabbr text,
geom geometry(MultiPolygon,4326)
) SERVER test
OPTIONS (
table_name 'ed_2013',
use_remote_estimate 'true');

CREATE FOREIGN TABLE va_2013_fdw
(
gid integer OPTIONS (column_name 'gid'),
id text OPTIONS (column_name 'id'),
vaabbr text OPTIONS (column_name 'vaabbr'),
edabbr text OPTIONS (column_name 'edabbr'),
geom geometry(MultiPolygon,4326) OPTIONS (column_name 'geom')
) SERVER test
OPTIONS (
table_name 'va_2013',
use_remote_estimate 'true');

The key option here is use_remote_estimate set to true. This tells postgres_fdw to query the remote server for an estimate of the remote table selectivity, which is then fed into the planner. Without use_remote_estimate, PostgreSQL will generate a terrible plan that pulls the contents of the `va_2013_fdw table local before joining.

With use_remote_estimate in place, the plan is just right:

SELECT count(*), e.edabbr
FROM ed_2013 e
JOIN va_2013_fdw v
ON ST_Intersects(e.geom, v.geom)
WHERE e.edabbr in ('VTB', 'VTS')
GROUP BY e.edabbr;
GroupAggregate  (cost=241.14..241.21 rows=2 width=12)
 Output: count(*), e.edabbr
 Group Key: e.edabbr
 ->  Sort  (cost=241.14..241.16 rows=6 width=4)
     Output: e.edabbr
     Sort Key: e.edabbr
     ->  Nested Loop  (cost=100.17..241.06 rows=6 width=4)
         Output: e.edabbr
         ->  Seq Scan on public.ed_2013 e  (cost=0.00..22.06 rows=2 width=158496)
             Output: e.gid, e.edname, e.edabbr, e.geom
             Filter: ((e.edabbr)::text = ANY ('{VTB,VTS}'::text[]))
         ->  Foreign Scan on public.va_2013_fdw v  (cost=100.17..109.49 rows=1 width=4236)
             Output: v.gid, v.id, v.vaabbr, v.edabbr, v.geom
             Remote SQL: SELECT geom FROM public.va_2013 WHERE (($1::public.geometry(MultiPolygon,4326) OPERATOR(public.&&) geom)) AND (public._st_intersects($1::public.geometry(MultiPolygon,4326), geom))

For FDW drivers other than postgres_fdw this means there’s a benefit to going to the trouble to support the FDW estimation callbacks, though the lack of exposed estimation functions in a lot of back-ends may mean the support will be ugly hacks and hard-coded nonsense. PostgreSQL is pretty unique in exposing fine-grained information about table statistics.

Update

One “bad” thing about the join pushdown plan above is that it still pulls all the resultant records back to the source before aggregating them, so there’s a missed opportunity there. However, if both the tables in the join condition are remote, the system will correctly plan the query as a remote join and aggregation.

SELECT count(*), e.edabbr
FROM ed_2013_fdw e
JOIN va_2013_fdw v
ON ST_Intersects(e.geom, v.geom)
WHERE e.edabbr in ('VTB', 'VTS')
GROUP BY e.edabbr;
 Foreign Scan  
   (cost=157.20..157.26 rows=1 width=40) 
   (actual time=32.750..32.752 rows=2 loops=1)
   Output: (count(*)), e.edabbr
   Relations: Aggregate on ((public.ed_2013_fdw e) INNER JOIN (public.va_2013_fdw v))
   Remote SQL: SELECT count(*), r1.edabbr FROM (public.ed_2013 r1 INNER JOIN public.va_2013 r2 ON (((r1.geom OPERATOR(public.&&) r2.geom)) AND (public._st_intersects(r1.geom, r2.geom)) AND ((r1.edabbr = ANY ('{VTB,VTS}'::text[]))))) GROUP BY r1.edabbr
 Planning time: 12.752 ms
 Execution time: 33.145 ms

Parallel PostGIS IIA

One of the core complaints in my review of PostgreSQL parallelism, was that the cost of functions executed on rows returned by queries do not get included in evaluations of the cost of a plan.

So for example, the planner appeared to consider these two queries equivalent:

SELECT *
FROM pd;

SELECT ST_Area(geom)
FROM pd;

They both retrieve the same number of rows and both have no filter on them, but the second one includes a fairly expensive function evaluation. No amount of changing the cost of the ST_Area() function would cause a parallel plan to materialize. Only changing the size of the table (making it bigger) would flip the plan into parallel mode.

Fortunately, when I raised this issue on pgsql-hackers, it turned out to have been reported and discussed last month, and Amit Kapila had already prepared a patch, which he kindly rebased for me.

With the patch in place, I now see rational behavior from the planner. Using the default PostGIS function costs, a simple area calculation on my 60K row polling division table is sequential:

EXPLAIN
SELECT ST_Area(geom)
FROM pd;
Seq Scan on pd  
(cost=0.00..14445.17 rows=69534 width=8)

However, if the ST_Area() function is costed a little more realistically, the plan shifts.

ALTER FUNCTION ST_Area(geometry) COST 100;

EXPLAIN
SELECT ST_Area(geom)
FROM pd;
 Gather  
 (cost=1000.00..27361.20 rows=69534 width=8)
   Workers Planned: 3
   ->  Parallel Seq Scan on pd  
       (cost=0.00..19407.80 rows=22430 width=8)

Perfect!

While not every query receives what I consider a “perfect plan”, it now appears that we at least have some reasonable levers available to get better plans via applying some sensible (higher) costs across the PostGIS code base.