Learning PostgreSQL Internals
01 Oct 2022I had coffee with an IT colleague here in Victoria last week, and he was interested in getting into core PostgreSQL programming. “What resources would you recommend I look at?”
That’s… a hard question!
PostgreSQL is a huge code base with a multi-decade history. I’ve been poking around the edges for almost 10 years and feel comfortable with the extension APIs, foreign data wrappers, access methods APIs, some system catalogue stuff… maybe 5% of the surface area of the beast?
So, what advice for someone who wants to dive much much deeper than that?
First, start with the vision, and read “The Design of Postgres” (Stonebraker & Rowe, 1985) to get a sense of what distinguished Postgres from its predecessors: complex objects; user extensibility; and active database facilities; all while retaining relational concepts.
Second, take a run through the Bruce Momjain’s “internals” presentations. These tend to be a little older, Bruce hasn’t been doing deep core work for a while, but he’s an expert teacher and explainer, so they are useful to get a feel for the shape of things. In a similar (and more recent) vein, my colleague Stephen Frost walks through the code base in this 2018 talk about adding a new feature to PostgreSQL.
Third, consider spending some time with “The Internals of PostgreSQL”. This is a very detailed look at PostgreSQL subsystems, including header structures and data flow. As with any book, it may have already drifted a bit from the particulars of current PostgreSQL, but there is no other resource I know that even attempts to explain internals at this granularity.
Fourth, the source code itself is an amazing resource, and the commentary in header files and function descriptions is very good. The incredibly detailed and stringent source code review process of the PostgreSQL community not only expects good code, but also good documentation of what the code does. I’m not sure how much this can be traced back to the influence of Tom Lane (whose comments are frequently complete technical manuals) but regardless the culture of good internal documentation is in place, and makes the code as “approachable” as a system of this complexity could hope to be.
Now things get harder.
Conference talks are a moving target, because they tend to focus on the leading edge of things, but there are some community members who regularly give talks about their work on core topics, that must necessarily explain how things work in order to contextualize things.
Unfortunately, PostgreSQL conferences have a habit of … not making recordings! So there’s relatively little online. I have seen some amazing talks (the multi-session query planner master class Tom Lane gave at PgCon 2011 sticks out) but most of them are lost to the ages.
The best conference to go to for core technical content is undoubtedly PgCon. (I will see you there this spring!)
COVID robbed us of many things, but it did cause PgCon to record and publish a number of standout technical talks that might otherwise have never been available.
Here’s the presenters I always mark down in my program and rush to get a seat for:
- Andres Freund, who while hammering out incredibly hard technical work still makes time to explain what he is up to.
- Robert Haas, who keeps pushing really important things from his perch as EDB CTO
- Amit Kapila, who is quietly banging out impressive work every release
- Melanie Plageman, who creates jaw droppingly good explanations of really hard topics (the query planner talk blew my mind)
- Peter Geoghegan, who goes right to the foundations and builds things up (unfortunately the btree talk he gave, which was a tour de force, is not online, perhaps this nbtree talk is an acceptable substitute)
I would also go to any talk Tom Lane felt like giving. And Thomas Vondra, and Thomas Munro, and Oleg Bartunov.
Learning the PostgreSQL code base is a journey of a million steps, that’s for sure. One thing that all my effective personal learning has had in common is a practical need. My learning has been in support of some actual work that needed to be done for my employer of the moment. It had motivation, a start point and an end point. That was really helpful.
Best of luck in your PostgreSQL journeys!
Additional Notes
Because of course I left out some stuff in the first draft:
- Once you are at the level of perusing source code, you should also be reading the contents of the pgsql-hackers mailing list, and the current commitfest patches. This will help you understand the culture of contribution, and also get a feel for where work is ongoing.
- One gateway to contribution is patch review. Starting with basic things like “does it compile against main” and “does it do what the author says”. Read the other reviews, see what other reviewers look for, wash rinse repeat. I still think that having your own small patches put through the process is invaluable learning. Doing some review of other people’s patches is a way of “paying it forward” for reviews your own patches receive.
- The amazing Oleg Bartunov wrote in to tell me there is now a second book on PostgreSQL internals, “PostgreSQL 14 Internals. A deep dive into the server mechanics”. Currently in Russian, but an English transition is going on right now.