23 Jun 2022
I have a blog post up today at Crunchy Data on some of the mechanisms that underlie the PostgreSQL query planner, it’s pretty good if I do say so myself.
I was motivated to write it by a conversation over coffee with my colleague Martin Davis. We were talking about a customer with an odd query plan case and I was explaining how the spatial statistics system worked and he said “you should do that up as a blog post”. And, yeah, I should.
One of the things that is striking as you follow the PostgreSQL development community is the extent to which a fairly mature piece of technology like PostgreSQL is stacks of optimizations on top of optimizations on top of optimizations. Building and executing query plans involves so many different paths of execution, that there’s always a new, niche use case to address and improve.
I worked a political campaign a few years ago as a “data science” staffer, and our main problem was stitching together data from multiple systems to get a holistic view of our data.
That meant doing cross-system joins.
The first cut is always easy: pull a few records out of System A with a filter condition and then go to System B and pull the associated records. But then inevitably a new filter condition shows up and applied to A it generates so many records that the association step on B gets overloaded. But it turns out if I start from B and then associate in A it’s fast again.
And thus suddenly I found myself writing a query planner and executor.
It’s only when dumped into the soup of having to solve these problems yourself that you really appreciate the magic that is a mature relational database system. The idea that PostgreSQL can take a query that involves multiple tables of different sizes, with different join cardinalities, and different indexes and figure out an optimal plan in a few milliseconds, and then execute that plan in a streaming, memory efficient way…?
Magic is really the best word I’ve found.
21 Jun 2022
The question of why organizations are shy about their use of open source is an interesting one, and not completely obvious.
Open source luminary Even Roualt asks:
is there some explanation why most institutions can’t communicate about their PostGIS use ?
just because it is a major hurdle for technical people to get their public relationship department approve a communication ?
people afraid about being billed about unpaid license fees 🤣 ?
There’s really very little upside to publicizing open source use. There’s no open source marketing department to trumpet the brilliance of your decision, or invite you to a conference to give you an award. On the other hand, if you have made the mistake of choosing an open source solution over a well-known proprietary alternative, there is surely a local sales rep who will call your boss to tell them that you have made a big mistake. (You do have a good relationship with your boss, I hope.)
These reverse incentives can get pretty strong. Evendiagram reports:
Our small group inside a large agency uses postgis. We don’t talk about it, even internally, to avoid the C-suite forcing everyone back to oracle. RHEL repos allow us a lot of software that would otherwise be denied.
This reminds me of my years consulting for the British Columbia government, when technical staff would run data processing or even full-on public web sites from PostgreSQL/PostGIS machines under their desktops.
They would tell their management it was “just a test system” or “a caching layer”, really anything other than “it’s a database”, because if they uttered the magic word “database”, the system would be slated for migration into the blessed realm of enterprise Oracle systems, never to be heard from again.
Logos
Meanwhile, Daryl Herzmann reminds us that the Iowa Mesonet has been on Team PostGIS since 2003.
Iowa Environmental Mesonet, Iowa State University
- Data being managed in the database
Meteorological Data, “Common” GIS datasets (roads, counties), Current and
Archived NWS Tornado/Flash Flood/Thunderstorm Warnings, Historical Storm
Reports, Current and Archived precipitation reports. Climate data
- How the data is being accessed / manipulated
From mapserver! Manipulated via Python and PHP.
- Why you chose to use PostGIS for the application
Open-Source. Uses my favorite DB, Postgres. Easy integration with
mapserver. The support community is fantastic!
Further afield, the GIS portals of governments throughout Ukraine are running on software built on PostGIS.
Jørgen Larsen de Martino notes that:
The Danish Agency for Data Supply and Infrastructure uses PostGIS extensively - and have been using it for the last 10 years - we would not have had the success we have was it not for @PostGIS.
The Utah Geospatial Resource Center uses PostGIS to provide access to multiple spatial layers for direct access in a cloud-hosted PostGIS database called the “Open SGID”. (I can hear DBA heads exploding around the world.)
Counterpoint
While self-reporting is nice, sometimes just a little bit of dedicated searching will do. Interested in PostGIS use in the military? Run a search for “postgis site:mil” and see what pops up!
The 108th wing of the Air Force! Staff Sgt. Steve De Leon is hard at it!
“I’m taking all the data sources that AMC and A2 compile and indexing them into the PostgreSQL/PostGIS data and then from there trying to script Python code so the website can recognize all the indexed data in the PostgreSQL/PostGIS database,” said the De Leon.
The Canadian Department of National Defense is building Maritime Situational Awareness Research Infrastructure with a PostgreSQL/PostGIS standard database component.
PostgreSQL with its PostGIS extension is the selected DBMS for MSARI. To ease mainte-
nance and access, if more than one database are used, PostgreSQL will be selected for all
databases.
The Coast Guards “Environmental Response Management Application (ERMA)” is also running PostGIS.
The application is based on open source software (PostgreSQL/PostGIS, MapServer, and OpenLayers), that meet Open Geospatial Consortium (OGC) specifications and standards used across federal and international geospatial standards communities. This ensures ERMA is compatible with other commercial and open-source GIS applications that can readily incorporate data from online data projects and avoids licensing costs. Open-source compatibility supports data sharing, leverages existing data projects, reduces ERMA’s maintenance costs, and ensures system flexibility as the technology advances. Because ERMA is open source, it can easily be customized to meet specific user requirements.
More logos?
Want to appear in this space? Email me!
20 Jun 2022
Last week, I wrote that getting large organizations to cop to using PostGIS was a hard lift, despite that fact that, anecdotally, I know that there is massive use of PostGIS in every sector, at every scale of institution.
Simple Clues
Here’s a huge tell that PostGIS is highly in demand: despite the fact that PostGIS is a relatively complex extension to build (it has numerous dependencies) and deploy (the upgrade path between versions can be complex) every single cloud offering of PostgreSQL includes PostGIS.
AWS, Google Cloud, Azure, Crunchy Bridge, Heroku, etc, etc. Also forked not-quite-Postgres things like Aurora and AlloyDB. Also not-Postgres-but-trying things like Cockroach and Yugabyte.
If PostGIS was a niche hobbyist project…? Complete the sentence any way you like.
Logos
True to form, I received a number of private messages from people working in or with major institutions you have heard of, confirming their PostGIS use, and the fact that the institution would not publicly validate it.
However, I also heard from a couple medium sized companies, which seem to be the only institutions willing to talk about how useful they find open source in growing their businesses.
Hailey Eckstrand of Foundry Spatial writes to say:
Foundry Spatial uses PostGIS in development and production. In development we use it as our GIS processing engine and warehouse. We integrate spatial data (often including rasters that have been loaded into PostGIS) into a watershed fabric and process summaries for millions of watersheds across North America. We often use it in production with open source web tooling to return results through an API based on user input. One of our more complex usages is to return raster results within polygons and along networks within a user supplied distance from a click location. We find the ease and power of summarizing and analyzing many spatial datasets with a single SQL query to be flexible, performant, efficient, and… FUN!
Dian Fay of Understory writes in:
We use PostGIS at Understory to track and record storms, manage fleets of weather stations, and optimize geographic risk concentration for insurance. PostGIS lets us do all this with the database tools we already know & love, and without severing the connections between geographic and other categories of information.
More logos?
Want to appear in this space? Email me!
13 Jun 2022
The question of “who uses PostGIS” or “how big is PostGIS” or “how real is PostGIS” is one that we have been wrestling with literally since the first public release back in 2001.
There is no doubt that institutional acceptance is the currency of … more institutional acceptance.
So naturally, we would love to have a page of logos of our major users, but unfortunately those users do not self-identify.
As an open source project PostGIS has a very tenuous grasp at best on who the institutional users are, and things have actually gotten worse over time.
Originally, we were a source-only project and the source was hosted on one web server we controlled, so we could literally read the logs and see institutional users. At the time mailing lists were the only source of project communication, so we could look at the list participants, and get a feel from that.
All that’s gone now. Most users get their PostGIS pre-installed by their cloud provider, or pre-built from a package repository.
So what do we know?
IGN
In the early days, I collected use cases from users I identified on the mailing list. My favourite was our first major institutional adopter, the Institut Géographique National, the national mapping agency of France.
In 2005, they decided to move from a desktop GIS paradigm for their nation-wide basemap (of 150M features), to a database-centric architecture. They ran a bake-off of Oracle, DB2 and PostgreSQL (I wonder who got PostgreSQL into the list) and determined that all the options were similar in performance and functionality for their uses. So they chose the open source one. To my knowledge IGN is to this day a major user of PostgreSQL / PostGIS.
GlobeXplorer
Though long-gone as a brand, it’s possible the image management system that was built by GlobeXplorer in the early 2000’s is still spinning away in the bowels of Maxar.
GlobeXplorer was both one of the first major throughput use cases we learned about, and also the first one where we knew we’d displaced a proprietary incumbant. GlobeXplorer was one of the earliest companies explicitly serving satellite imagery to the web and via web APIs. They used a spatial database to manage their catalogue of images and prepared product. Initially it was built around DB2, but DB2 was a poor scaling choice. PostGIS was both physically faster and (more importantly) massively cheaper as scale went up.
RedFin
RedFin was a rarity, a use case found in the wild that we didn’t have to track down ourselves.
They described in some detail their path from MySQL to PostgreSQL, including the advantages of having PostGIS.
Using PostGIS, we could create an index on centroid_col, price, and num_bedrooms. These indexes turned many of our “killer” queries into pussycats.
Google
Google is not that big on promoting any technology they haven’t built in house, but we have heard individual Google developers confirm that they use core open source geospatial libraries in their work, and that PostGIS is included in the mix.
The biggest validation Google ever gave PostGIS was in a press release that recognized that the set of “users of spatial SQL” was basically the same as the set of “PostGIS users”.
Our new functions and data types follow the SQL/MM Spatial standard and will be familiar to PostGIS users and anyone already doing geospatial analysis in SQL. This makes workload migrations to BigQuery easier. We also support WKT and GeoJSON, so getting data in and out to your other GIS tools will be easy.
They didn’t address their new release to “Esri users” or “Oracle users” or “MySQL users”, they addressed it to the relevant population: PostGIS users.
More!
Getting permission to post logos is hard. Really hard. I’ve watched marketing staff slave over it. I’ve slaved over it myself.
Major automaker? Check. Major agricultural company? Check. Major defence contractor? Check, check, check. National government? Check. State, local, regional? Check, check, check. Financial services? Check. Management consulting? Check.
Yes, PostGIS is real.
At some point, for a project with a $0 price point, you just stop. If a user can’t be bothered to do the due diligence on the software themselves, to reap all the advantages we offer, for free, I’m not going to buy them a steak dinner, or spoon feed them references.
That said! If you work for a major government or corporate institution and you are allowed to publicize your use of PostGIS, I would love to write up a short description of your use, for the web site and our presentation materials.
Email me!
20 Apr 2022
The Dream
Over a year ago Chris Holmes, the driving force behind the “cloud optimized GeoTIFF” (COG) and “spatio-temporal asset catalog” (STAC) standards that are sweeping the “cloud optimized raster” data management world, asked me what I thought the story of a similar “cloud optimized vector” format might look like.
And I thought about COG, which is really just a very very old format (GeoTIFF) with its bytes rearranged so that the order of bytes in the file matches the likely order in which they will be accessed (blocks of bands, and within the bands, squares of pixels), and I thought I had the perfect, maximally snarky answer:
Dude, shape file is already a cloud-native format.
Now, this might seem counter-intuitive, but hear me out:
- Shape format hails from the early 90’s, when hard-disks spun very slowly, and the limiting factor for data access was usually I/O. Which is much like “cloud optimized” range access over HTTP: seeks are expensive, but block reads are cheap.
- Shape format already divies up the attributes and shapes into separate files, so you can render one without reading the other.
- Shape format is already “network safe”, with endianness defined in the format.
- Shape format is already universally supported. The specification is 24 years old, and it has been the de facto interchange format for so long that people make a joke out of it.
In short, shapefile already looks a lot like GeoTIFF, the founding format of the “cloud optimized geospatial” movement.
Let’s Get Real
So, what is missing to make “cloud optimized shapefile” a reality?
Well, in order to spatially search a shapefile you need a spatial index. There is no index format in the formal Esri specification, and the Esri sbx
index format is proprietary (though largely reverse engineered at this point) but the open source world has had a shape file index format for 20 years: the “QIX file”.
You generate a QIX file using the shptree utility in Mapserver. You can also generate one in GDAL. You can also get GeoTools to generate one.
With a QIX file, the “shape file” now consists of four files:
- shp, the binary shapes
- dbf, the table of attributes
- shx, a file index of the byte offsets of each shape in the shp file
- qix, a spatial index of the shapes
The next trick, just as in COG, is to put the main data files (shp
, dbf
and shx
) into the same order they are likely to be searched in: spatial order.
Since we already have a spatial index (qix
), we can get the files in spatial order by re-writing them in the same order they appear in the index.
Initially I told Chris that this could be done with the Mapserver sortshp utility, however I was mistaken: sortshp
sorts the file in attribute order.
To make “cloud optimized shape file” a reality, first we need a new utility program that sorts the shp
, shx
and dbf
files into qix
order.
We need: coshp!
coshp
is just a re-working of the sortshp
utility, but instead of sorting the output file by attribute it sorts it using the index tree as the driver. This results in shp
and dbf
files where shapes that are “near” in space are also “near” in the byte-stream of the file. This will reduce the number of random reads necessary to access portions of the file using a bounding box search.
Not Even Remotely Done
One of the quiet secrets of the “cloud optimized” geospatial world is that, while all the attention is placed on the formats, the actual really really hard part is writing the clients that can efficiently make use of the carefully organized bytes.
For example, the fancy demonstrations of “pure client-side” COG browsers require a complete GeoTIFF reader in Javascript, along with some extra “cloud” smarts to know what pieces of data to cache and what to treat as transient info during rendering.
So, spatially sorting a shape file is a necessary, but not at all sufficient condition to create an actual “cloud optimized shapefile”, because for it to be practically useful, there needs to be at a minimum a client-side javascript reader.
That means javascript that can:
To be truly useful, the javascript should probably include enough cloud smarts to read and opportunistically cache pieces of the files: headers at a minimum, but also reading in branches of the qix
and shx
indexes on-demand.
To make things marginally easier, I have “documented” the QIX format. It’s an ugly beast, but it is possible to traverse it without deserializing the whole thing.
It’s a challenging problem, but I hope there is someone with enough nostalgia for old formats and thirst for glory to make it happen.