Some Great Things about PostgreSQL
17 May 2017I spent the last few months using PostgreSQL for real work, with real data, and I’ve been really loving some of the more esoteric features. If you use PostgreSQL on a regular basis, learning these tools can make your code a lot more readable and possibly faster too.
Distinct On
A number of the tables I had to work with included multiple historical records for each individual, but I was only interested in the most recent value. That meant that every query had to start with some kind of filter to pull off the latest value for joining to other tables.
It turns out that the PostgreSQL DISTINCT ON
syntax can spit out the right answer very easily:
No self-joining or complexity here, the tuple set is sorted into id/time order, and then the distinct on clause pulls the first entry (which is the most recent, thanks to the sorting) off of each id grouping.
Filtered Aggregates
I was doing a lot of reporting, so I built a BI-style denormalized reporting table, with a row for every entity of interest and a column for every variable of interest. Then all that was left was the reporting, which rolled up results across multiple groupings. The trouble was, the roll-ups were oftenly highly conditional: all entities with this condition A but not B, compared with those with B but not A, compared with all entities in aggregate.
Ordinarily this might involve embedding a big case statement for each conditional but with filtered aggregates we get a nice terse layout that also evaluates faster.
I would routinely end up with 20-line versions of this query, which spat out spreadsheets that analysts were extremely happy to take and turn into charts and graphs and even decisions.
Window Functions
My mind aches slightly when trying to formulate window functions, but I was still able to put them to use in a couple places.
First, even with a window wide enough to cover a whole table, window functions can be handy! Add a percentile column to a whole table:
Second, using ordinary aggregates in a window context can create some really groovy results. Want cumulated sales over store territories? (This might be better delegated to front-end BI display software, but…)
Alert readers will note the above example won’t provide a perfect output table if there are days without any sales at all, which brings me to a side note cool feature: PostgreSQL’s generate_series function (Regina Obe’s favourite function) supports generating time-based series!
Normally you’ll probably generate boring 1-day, or 1-week, or 1-month series, but the ability to generate arbitrarily stepped time series is pretty cool and useful. To solve the cumulation problem, you can just generate a full series of days of interest, and left join the calculated daily amounts to that, prior to cumulation in order to get a clean one-value-per-day cumulated result.
Left Join and Coalesce
This is not really an advanced technique, but it’s still handy. Suppose you have partial data on a bunch of sales from different sources and in different tables. You want a single table output that includes your best guess about the value, what’s the easiest way to get it? Left join and coalesce.
Start with a base table that includes all the sales you care about, left join all the potential sources of data, then coalesce the value you care about into a single output column.
The coalesce function takes the first non-NULL value it encounters in its parameters and returns that as the value. The practical effect is that, in the case where the first two tables have no rows for a particular base record, and the third does, the coalesce will skip past the first two and return the non-NULL value from the third. This is a great technique for compressing sparse multiple input sources into a terse usable single output.