The Open Source Support Company Trap

Matt Asay asks a question about big data vendors, who are from a business model point-of-view mostly “open source support” companies: HortonWorks as a pure-play and others as open-core and enhanced-oss models.

My experience has been that open source support companies fall into a market trap. They are supporting good projects, that are popular and widely deployed. It seems like a good place, but the market keeps constantly screwing them.

TL;DR

Potential open source support customers are either:

  • Small enough to need your expert support, but too small and cheap to supply company-supporting levels of revenue; or,
  • Large enough to figure it out for themselves, so either take a quick hit of consulting up front or just ignore you and self-support from the start.

Small Customers Suck

Small customers lack the technical wherewithal to use the product unassisted, so they need the support, but they are both highly price sensitive while also being a large drain on support hours (because they really need a lot of help).

You can drop your prices to try and load up a lot of these (“how do we do it? volume!”) but there are rarely enough fish in the pool to actually run the size of business you need to achieve sustainability.

How large? Large enough that the recurring revenue is enough to support sales, marketing, cost of sales, and a handful of core developers who aren’t tied to support contracts. Not huge, but well beyond the scale of a dev-and-friend-in-basement.

If you do drop your prices to try and bring in the small customers you immediately run into a problem with big customers, who will want big customer questions answered at your small customer price point.

Big Customers Suck, Then Leave

Big customers are frequently drawn to the scaling aspects of open source: deploy 1 instance, deploy 100, the capital cost remains the same.

In theory, there should be room in there for a “one throat to choak” support opportunity for big customers with big deployments, and with competent sales work up front, a good support deal will monetize at least some of the complexity inherent in a large deployment.

In practice, instead of being long term recurring revenue, the big customers end up being short term consulting gigs. A deal is signed, the customer’s team learns the ropes, with lots of support hours from top level devs on your team, and the deployment goes live. Then things settle down and there is a quick scaling back of support payments: year one is great, year two is OK, year three they’re backing away, year four they’re gone.

While there is room to grow a business on this terrain, particularly if the customers are Really Really Big, the constant fade-out of recurring revenue means that the business model is that of a high-end consultancy, not a recurring-revenue support company.

CURL Options and Availability Version

I’ve been adding support for arbitrary CURL options to http-pgsql, and have bumped up against the classic problem with linking external libraries: different functionality in different versions.

CURL has a huge collection of options, and different versions have different support, but for any given options, what versions support it? This turns out to be a fiddly question to answer in general. Each option has availability in the documentation page, but finding availability for every option is a pain.

So, I’ve scraped the documentation and put the answers in a CSV file, along with the hex-encoded version number that CURL exposes in #define LIBCURL_VERSION_NUM.

If you need to re-run the table as new versions come out, the Python script is also in the Gist.

Project or Product?

There are several nuggets in this webinar by David Eaves, but the one that really tickled my brain was the distinction between two almost identically named roles: project manager and product manager.

The lexical difference is, of course, very small:

# SELECT levenshtein('project manager', 'product manager');

 levenshtein 
-------------
           2
(1 row)

The functional difference is extremely large:

  • The project manager is optimizing for budget and schedule. Is the project on time? Are you delivering according to your agreed schedule? Then your project manager is doing a good job.
  • The product manager is optimizing for user satisfaction. Is the product fast and easy to use? Are the users happy to adopt it? Then your product manager is doing a good job.

The BC government has a project management office and in some respects it shows: big projects like ICM, MyEdBC and the iHealth systems have been delivered within their (very long) schedules and (incredibly huge) budget envelopes (plus or minus a bit).

On the other hand the projects above have also been catastrophically bad for users, rolling out with big performance failures and lots of of user push-back. In the case of the Nanaimo iHealth project, doctors have actually been suspended for refusing to use the system. Now there’s a system that needs some user acceptance testing!

The “product manager” role is one that’s very common in the private sector IT world, certainly at the big Silicon valley firms and the last two start-ups I’ve worked with. It’s not one I’ve seen much in the government space, with the exception of “digital transformation” carve-outs like GDS or USDS.

Project or Product?

Delivering junk on time and on budget isn’t success, and neither is delivering a great system 2 years late and 100% over budget. Some kind of co-equal arrangement seems like a good idea:

Product and project managers see the same work through different lenses. And that’s a good thing when you are trying to achieve something special like bringing a new product to market as I was. But they both work for the same team. And when they join forces to collaborate, everyone benefits and the company wins.
- Ron Yang

Comox Valley 2013 Absentee Ballots

For all the electrons spilled speculating on what trends might apply to the Courtenay-Comox absentee ballots being counted next week, I feel like I haven’t seen the actual numbers from 2013 in print anywhere, so here they are from the 2013 Statement of Votes:

Section GP NDP CP LIB Total %
s. 98 Special 20 83 8 57 168 4.8%
s. 99 Absentee - in ED 219 607 86 560 1472 42.0%
s. 100 Absentee - out of ED 42 132 6 111 291 8.3%
s. 101 Absentee - advance 8 41 3 41 92 2.7%
s. 104 Voting in DEO office 119 519 74 601 1313 37.5%
s. 106 Voting by mail 18 74 15 61 168 4.8%
Total 426 1456 192 1431 3505 100%
% 12.2% 41.5% 5.5% 40.8% 100% -

Some caveats:

  • Redistribution made the 2017 riding somewhat weaker for the NDP than it was in 2013. (Advantage: Liberals)
  • In 2017 the NDP candidate did somewhat better than in 2013. (Advantage: NDP)
  • In 2013 the NDP candidate lost the riding but (barely) won the absentee tally. (Advantage: NDP)

With those caveats in mind, the final conclusion: anyone who tells you that there’s a predictable direction the absentee ballot will go based on past results is blowing smoke up your ***.

Some Great Things about PostgreSQL

I spent the last few months using PostgreSQL for real work, with real data, and I’ve been really loving some of the more esoteric features. If you use PostgreSQL on a regular basis, learning these tools can make your code a lot more readable and possibly faster too.

Distinct On

A number of the tables I had to work with included multiple historical records for each individual, but I was only interested in the most recent value. That meant that every query had to start with some kind of filter to pull off the latest value for joining to other tables.

It turns out that the PostgreSQL DISTINCT ON syntax can spit out the right answer very easily:

SELECT DISTINCT ON (order_id) orders.*
FROM orders
ORDER BY orders.order_id, orders.timestamp DESC

No self-joining or complexity here, the tuple set is sorted into id/time order, and then the distinct on clause pulls the first entry (which is the most recent, thanks to the sorting) off of each id grouping.

Filtered Aggregates

I was doing a lot of reporting, so I built a BI-style denormalized reporting table, with a row for every entity of interest and a column for every variable of interest. Then all that was left was the reporting, which rolled up results across multiple groupings. The trouble was, the roll-ups were oftenly highly conditional: all entities with this condition A but not B, compared with those with B but not A, compared with all entities in aggregate.

Ordinarily this might involve embedding a big case statement for each conditional but with filtered aggregates we get a nice terse layout that also evaluates faster.

SELECT
    store_territory,
    Count(*) FILTER (WHERE amount < 5.0) 
        AS cheap_sales_count,
    Sum(amount) FILTER (WHERE amount < 5.0) 
        AS cheap_sales_amount,
    Count(*) FILTER (WHERE amount < 5.0 AND customer_mood = 'good') 
        AS cheap_sales_count_happy,
    Sum(amount) FILTER (WHERE amount < 5.0 AND customer_mood = 'good')
        AS cheap_sales_amount_happy
FROM bi_table
GROUP BY store_territory

I would routinely end up with 20-line versions of this query, which spat out spreadsheets that analysts were extremely happy to take and turn into charts and graphs and even decisions.

Window Functions

My mind aches slightly when trying to formulate window functions, but I was still able to put them to use in a couple places.

First, even with a window wide enough to cover a whole table, window functions can be handy! Add a percentile column to a whole table:

SELECT bi_table.*, 
    ntile(100) OVER (ORDER BY amount) 
        AS amount_percentile
FROM bi_table

Second, using ordinary aggregates in a window context can create some really groovy results. Want cumulated sales over store territories? (This might be better delegated to front-end BI display software, but…)

WITH daily_amounts AS (
    SELECT 
        sum(amount) AS amount,
        store_territory,
        date(timestamp) AS date
    FROM bi_table
    GROUP BY store_territory, date
)
SELECT 
    sum(amount) OVER (PARTITION BY store_territory ORDER BY date) 
        AS amount_cumulate
    store_territory, date
FROM daily_amounts

Alert readers will note the above example won’t provide a perfect output table if there are days without any sales at all, which brings me to a side note cool feature: PostgreSQL’s generate_series function (Regina Obe’s favourite function) supports generating time-based series!

SELECT generate_series(
    '2017-01-01'::date, 
    '2017-01-10'::date, 
    '18 hours'::interval);

Normally you’ll probably generate boring 1-day, or 1-week, or 1-month series, but the ability to generate arbitrarily stepped time series is pretty cool and useful. To solve the cumulation problem, you can just generate a full series of days of interest, and left join the calculated daily amounts to that, prior to cumulation in order to get a clean one-value-per-day cumulated result.

Left Join and Coalesce

This is not really an advanced technique, but it’s still handy. Suppose you have partial data on a bunch of sales from different sources and in different tables. You want a single table output that includes your best guess about the value, what’s the easiest way to get it? Left join and coalesce.

Start with a base table that includes all the sales you care about, left join all the potential sources of data, then coalesce the value you care about into a single output column.

SELECT
    base.order_id,
    Coalesce(oi1.order_name, oi2.order_name, oi2.order_name) 
        AS order_name
FROM base
LEFT JOIN order_info_1 oi1 USING (order_id)
LEFT JOIN order_info_2 oi2 USING (order_id)
LEFT JOIN order_info_3 oi3 USING (order_id)

The coalesce function takes the first non-NULL value it encounters in its parameters and returns that as the value. The practical effect is that, in the case where the first two tables have no rows for a particular base record, and the third does, the coalesce will skip past the first two and return the non-NULL value from the third. This is a great technique for compressing sparse multiple input sources into a terse usable single output.