What you Need to Know

Matt Asay’s piece today about the skills shortage in cloud got me to thinking about what constitutes the required knowledge to work on various projects, and how much implicit knowledge is bound up with any given “single” technology.

Like, what should you know to “do cloud”?

Doing Cloud

Well, “containers” I guess would be a core building block, but what does that imply? Some knowledge of Linux system administration, so you can understand things like the shell and environment. Lots of network administration as well, to understand routing rules and name resolution. Perhaps some package management so you can install some dependencies and build out others. Really, sounds like you need to understand all the things a late 90’s era sysop did. And we haven’t even started in on the actual cloud part yet!

For that you need to understand the abstraction over the actual machines that the container management environment (pick one, pick ten) provides, the configuration system for that environment (declarative? programmatic?), and whatever security/authentication/roles system your cloud uses. Multiply by N for “multi-cloud”.

And that’s just for the container part! Want to get into objects stores (S3/R2) and production deployments? Add in a pretty solid understanding of HTTP, CDN options, DNS and TLS.

Do you have to manage queues? Databases? ML? Each cloud has its own subtly different and variegated offering, and each area has its own unique set of skills you need to master, whether it be DBA concepts or domain languages or security needs.

I can rattle off all these things because I have learned them all to some extent or other over the course of my career, and that’s great but I’m 51 years old. I am constantly amazed that any new graduate can find enough initial purchase to start a career in technology.

Doing PostGIS

Like, want to get involved in PostGIS? Easy, “it’s in C”! Except, well actually we’ve got some C++ bits, but frankly if you know C you probably know C++ too. C and C++ both embed a macro language (CPP) so you need to know that to understand the code.

But wait, you want to build it? Well the build system is autotools. That’s an unholy mixture of M4 and bash shell scripting. Fortunately you can mostly ignore the M4 but bash skills are definitely required.

Naturally you’ll also have to understand Makefile syntax.

Of course PostGIS is a PostgreSQL extension, so to even understand what we’re doing you need to understand SQL. You’ll end up learning the PostgreSQL extension and function APIs by osmosis via the PostGIS code base, but it’s still another thing.

Inevitably you will make a mistake, so you’ll need to understand a debugger, probably gdb or lldb, which are glorious command-line tools with their own terse commandline syntax.

Ooops! Forgot the UNIX commandline! That’s table stakes, really, but there you go.

Also forgot git, which is one of the few tools that is common across most branches of practical modern IT.

The documentation is all in DocBook XML, but you can probably figure that out by inspection.

You probably won’t change the WKT lexer/parser, but that’s in flex/bison. There’s also a protobuf binding that has its own little domain language.

Fixes? Nah.

One of the reasons I don’t do “real javascript” is because doing this exercise with any modern Javascript project involves a similar-but-much-longer litany of completely different tooling from the tooling I already know.

I learned Go relatively recently, and I have to say that even though I did have to, yes, learn all about build and test and debug and modularity for Go, at least there wasn’t a huge pile of options to wade through – the Go community mostly “picked one” for the core tools. (Not true for database connectivity though! You can choose from among multiple PostgreSQL client libraries!)

I don’t know how to end this except to say it feels like my whole career has been spent watching geological layers of technology pack down on each other, layer upon layer upon layer, and that the result seems, in aggregate, completely unapproachable.

I wait for some kind of simplifying moment.

The repeated theme year after year after year has been that “encapsulation” and “abstraction” will at least allow us to ignore the lowest layers. To some extent that is true – I have made it through a whole career without having to learn assembler, and I probably will never need to.

But so many of the other promises have failed to play out.

Object orientation did not result in a new world of stringing together opaque components. The insides still matter intensely and people still have to understand the parts.

Containers have flattened out the need to worry about dependency chains, but the chains are still there inside the containers, and managing the production deployment of containers is now a career defining field in its own right.

An Expanding Universe

It’s possible that most of this apparent exponential complexity explosion is just due to the ever widening scope of what “working with computers” means.

The number of things you can do with computers is just orders of magnitude larger than when I started working with them. So the number of tools is similarly orders of magnitude larger. Dog bites man.

However, I have had the privilege of seeing most of the layers laid down, so I didn’t have to learn them all at once to become productive. Each piece came along in its time, and added a little to the stack.

So, I stand amazed at newly minted technologists who can start, and get productive, in this intellectual garbage tip we call a “profession”. We couldn’t have built a less enjoyable or consistent collection of tools if we had tried, and yet you persevere and exceed all of us. Hats off to all of you!

Learning PostgreSQL Internals

I had coffee with an IT colleague here in Victoria last week, and he was interested in getting into core PostgreSQL programming. “What resources would you recommend I look at?”

moka house

That’s… a hard question!

PostgreSQL is a huge code base with a multi-decade history. I’ve been poking around the edges for almost 10 years and feel comfortable with the extension APIs, foreign data wrappers, access methods APIs, some system catalogue stuff… maybe 5% of the surface area of the beast?

complex

So, what advice for someone who wants to dive much much deeper than that?

First, start with the vision, and read “The Design of Postgres” (Stonebraker & Rowe, 1985) to get a sense of what distinguished Postgres from its predecessors: complex objects; user extensibility; and active database facilities; all while retaining relational concepts.

Second, take a run through the Bruce Momjain’s “internals” presentations. These tend to be a little older, Bruce hasn’t been doing deep core work for a while, but he’s an expert teacher and explainer, so they are useful to get a feel for the shape of things. In a similar (and more recent) vein, my colleague Stephen Frost walks through the code base in this 2018 talk about adding a new feature to PostgreSQL.

Third, consider spending some time with “The Internals of PostgreSQL”. This is a very detailed look at PostgreSQL subsystems, including header structures and data flow. As with any book, it may have already drifted a bit from the particulars of current PostgreSQL, but there is no other resource I know that even attempts to explain internals at this granularity.

Fourth, the source code itself is an amazing resource, and the commentary in header files and function descriptions is very good. The incredibly detailed and stringent source code review process of the PostgreSQL community not only expects good code, but also good documentation of what the code does. I’m not sure how much this can be traced back to the influence of Tom Lane (whose comments are frequently complete technical manuals) but regardless the culture of good internal documentation is in place, and makes the code as “approachable” as a system of this complexity could hope to be.

Now things get harder.

conferences

Conference talks are a moving target, because they tend to focus on the leading edge of things, but there are some community members who regularly give talks about their work on core topics, that must necessarily explain how things work in order to contextualize things.

Unfortunately, PostgreSQL conferences have a habit of … not making recordings! So there’s relatively little online. I have seen some amazing talks (the multi-session query planner master class Tom Lane gave at PgCon 2011 sticks out) but most of them are lost to the ages.

The best conference to go to for core technical content is undoubtedly PgCon. (I will see you there this spring!)

COVID robbed us of many things, but it did cause PgCon to record and publish a number of standout technical talks that might otherwise have never been available.

Here’s the presenters I always mark down in my program and rush to get a seat for:

I would also go to any talk Tom Lane felt like giving. And Thomas Vondra, and Thomas Munro, and Oleg Bartunov.

hike

Learning the PostgreSQL code base is a journey of a million steps, that’s for sure. One thing that all my effective personal learning has had in common is a practical need. My learning has been in support of some actual work that needed to be done for my employer of the moment. It had motivation, a start point and an end point. That was really helpful.

Best of luck in your PostgreSQL journeys!

Additional Notes

Because of course I left out some stuff in the first draft:

  • Once you are at the level of perusing source code, you should also be reading the contents of the pgsql-hackers mailing list, and the current commitfest patches. This will help you understand the culture of contribution, and also get a feel for where work is ongoing.
  • One gateway to contribution is patch review. Starting with basic things like “does it compile against main” and “does it do what the author says”. Read the other reviews, see what other reviewers look for, wash rinse repeat. I still think that having your own small patches put through the process is invaluable learning. Doing some review of other people’s patches is a way of “paying it forward” for reviews your own patches receive.
  • The amazing Oleg Bartunov wrote in to tell me there is now a second book on PostgreSQL internals, “PostgreSQL 14 Internals. A deep dive into the server mechanics”. Currently in Russian, but an English transition is going on right now.

Technology, Magic & PostgreSQL

I have a blog post up today at Crunchy Data on some of the mechanisms that underlie the PostgreSQL query planner, it’s pretty good if I do say so myself.

I was motivated to write it by a conversation over coffee with my colleague Martin Davis. We were talking about a customer with an odd query plan case and I was explaining how the spatial statistics system worked and he said “you should do that up as a blog post”. And, yeah, I should.

One of the things that is striking as you follow the PostgreSQL development community is the extent to which a fairly mature piece of technology like PostgreSQL is stacks of optimizations on top of optimizations on top of optimizations. Building and executing query plans involves so many different paths of execution, that there’s always a new, niche use case to address and improve.

I worked a political campaign a few years ago as a “data science” staffer, and our main problem was stitching together data from multiple systems to get a holistic view of our data.

That meant doing cross-system joins.

The first cut is always easy: pull a few records out of System A with a filter condition and then go to System B and pull the associated records. But then inevitably a new filter condition shows up and applied to A it generates so many records that the association step on B gets overloaded. But it turns out if I start from B and then associate in A it’s fast again.

And thus suddenly I found myself writing a query planner and executor.

It’s only when dumped into the soup of having to solve these problems yourself that you really appreciate the magic that is a mature relational database system. The idea that PostgreSQL can take a query that involves multiple tables of different sizes, with different join cardinalities, and different indexes and figure out an optimal plan in a few milliseconds, and then execute that plan in a streaming, memory efficient way…?

Magic is really the best word I’ve found.

Some More PostGIS Users

The question of why organizations are shy about their use of open source is an interesting one, and not completely obvious.

Open source luminary Even Roualt asks:

is there some explanation why most institutions can’t communicate about their PostGIS use ? just because it is a major hurdle for technical people to get their public relationship department approve a communication ? people afraid about being billed about unpaid license fees 🤣 ?

There’s really very little upside to publicizing open source use. There’s no open source marketing department to trumpet the brilliance of your decision, or invite you to a conference to give you an award. On the other hand, if you have made the mistake of choosing an open source solution over a well-known proprietary alternative, there is surely a local sales rep who will call your boss to tell them that you have made a big mistake. (You do have a good relationship with your boss, I hope.)

These reverse incentives can get pretty strong. Evendiagram reports:

Our small group inside a large agency uses postgis. We don’t talk about it, even internally, to avoid the C-suite forcing everyone back to oracle. RHEL repos allow us a lot of software that would otherwise be denied.

This reminds me of my years consulting for the British Columbia government, when technical staff would run data processing or even full-on public web sites from PostgreSQL/PostGIS machines under their desktops.

They would tell their management it was “just a test system” or “a caching layer”, really anything other than “it’s a database”, because if they uttered the magic word “database”, the system would be slated for migration into the blessed realm of enterprise Oracle systems, never to be heard from again.

Logos

Meanwhile, Daryl Herzmann reminds us that the Iowa Mesonet has been on Team PostGIS since 2003.

Iowa Environmental Mesonet, Iowa State University

  • Data being managed in the database
    Meteorological Data, “Common” GIS datasets (roads, counties), Current and Archived NWS Tornado/Flash Flood/Thunderstorm Warnings, Historical Storm Reports, Current and Archived precipitation reports. Climate data
  • How the data is being accessed / manipulated
    From mapserver! Manipulated via Python and PHP.
  • Why you chose to use PostGIS for the application
    Open-Source. Uses my favorite DB, Postgres. Easy integration with mapserver. The support community is fantastic!

Further afield, the GIS portals of governments throughout Ukraine are running on software built on PostGIS.

Jørgen Larsen de Martino notes that:

The Danish Agency for Data Supply and Infrastructure uses PostGIS extensively - and have been using it for the last 10 years - we would not have had the success we have was it not for @PostGIS.

The Utah Geospatial Resource Center uses PostGIS to provide access to multiple spatial layers for direct access in a cloud-hosted PostGIS database called the “Open SGID”. (I can hear DBA heads exploding around the world.)

Counterpoint

While self-reporting is nice, sometimes just a little bit of dedicated searching will do. Interested in PostGIS use in the military? Run a search for “postgis site:mil” and see what pops up!

The 108th wing of the Air Force! Staff Sgt. Steve De Leon is hard at it!

“I’m taking all the data sources that AMC and A2 compile and indexing them into the PostgreSQL/PostGIS data and then from there trying to script Python code so the website can recognize all the indexed data in the PostgreSQL/PostGIS database,” said the De Leon.

The Canadian Department of National Defense is building Maritime Situational Awareness Research Infrastructure with a PostgreSQL/PostGIS standard database component.

PostgreSQL with its PostGIS extension is the selected DBMS for MSARI. To ease mainte- nance and access, if more than one database are used, PostgreSQL will be selected for all databases.

The Coast Guards “Environmental Response Management Application (ERMA)” is also running PostGIS.

The application is based on open source software (PostgreSQL/PostGIS, MapServer, and OpenLayers), that meet Open Geospatial Consortium (OGC) specifications and standards used across federal and international geospatial standards communities. This ensures ERMA is compatible with other commercial and open-source GIS applications that can readily incorporate data from online data projects and avoids licensing costs. Open-source compatibility supports data sharing, leverages existing data projects, reduces ERMA’s maintenance costs, and ensures system flexibility as the technology advances. Because ERMA is open source, it can easily be customized to meet specific user requirements.

More logos?

Want to appear in this space? Email me!

Some PostGIS Users

Last week, I wrote that getting large organizations to cop to using PostGIS was a hard lift, despite that fact that, anecdotally, I know that there is massive use of PostGIS in every sector, at every scale of institution.

Simple Clues

Here’s a huge tell that PostGIS is highly in demand: despite the fact that PostGIS is a relatively complex extension to build (it has numerous dependencies) and deploy (the upgrade path between versions can be complex) every single cloud offering of PostgreSQL includes PostGIS.

AWS, Google Cloud, Azure, Crunchy Bridge, Heroku, etc, etc. Also forked not-quite-Postgres things like Aurora and AlloyDB. Also not-Postgres-but-trying things like Cockroach and Yugabyte.

If PostGIS was a niche hobbyist project…? Complete the sentence any way you like.

Logos

True to form, I received a number of private messages from people working in or with major institutions you have heard of, confirming their PostGIS use, and the fact that the institution would not publicly validate it.

However, I also heard from a couple medium sized companies, which seem to be the only institutions willing to talk about how useful they find open source in growing their businesses.

Hailey Eckstrand of Foundry Spatial writes to say:

Foundry Spatial uses PostGIS in development and production. In development we use it as our GIS processing engine and warehouse. We integrate spatial data (often including rasters that have been loaded into PostGIS) into a watershed fabric and process summaries for millions of watersheds across North America. We often use it in production with open source web tooling to return results through an API based on user input. One of our more complex usages is to return raster results within polygons and along networks within a user supplied distance from a click location. We find the ease and power of summarizing and analyzing many spatial datasets with a single SQL query to be flexible, performant, efficient, and… FUN!

Dian Fay of Understory writes in:

We use PostGIS at Understory to track and record storms, manage fleets of weather stations, and optimize geographic risk concentration for insurance. PostGIS lets us do all this with the database tools we already know & love, and without severing the connections between geographic and other categories of information.

More logos?

Want to appear in this space? Email me!