Monday, August 10, 2015

Big Data and Data Science Piss Me Off

Get off my lawn!

I don't talk about this much, but I actually trained in statistics, not in computer science, and I've been getting slowly but progressively weirded out by the whole "big data" / "data science" thing. Because so much of it is bogus, or boys-with-toys or something.

Basically, my objections to the big data thing are the usual: probably your data is not big. It really isn't, and there are some great blog posts all about that.

So that's point number one: most people blabbing on about big data can fit their problem onto a big vertical machine and analyze it to their heart's content in R or something.

Point number two is less frequently touched upon: sure, you have 2 trillion records, but why do you need to look at all of them? The whole point of an education in statistics is to learn how to reason about a population using a random sample. So why are all these alleged "data scientists" firing up massive compute clusters to summarize every single record in their collections?

I'm guessing it's the usual reason: because they can. And because the current meme is that they should. They should stand up a 100 node cluster on AWS and bloody well count all 2 trillion of them. Because: CPUs.

But honestly, if you want to know the age distribution of people buying red socks, draw a sample of a couple hundred thousand records, and find out to within a fraction of a percentage point 19-times-out-of-20. After all, you're a freaking "data scientist", right?

Wednesday, July 15, 2015

BC IT Outsourcing 2014/15

If what goes up must come down, nobody told BC's IT outsourcers, because they continue to gobble up a larger chunk of the government pie every year.

The BC Public Accounts came out today, and I'm happy to say that the People Who Are Smarter Than You Are managed to book another record year of billings: a $468,549,154 spend, up 8% over last year.

It's not a victory unless you beat someone else, so good news:

  • Overall government revenue, up 5.4%
  • Overall government spending, up 2.4%
  • Health spending, up 2.8%
  • Education spending, up 0%
  • IT services spending up 8%!!!!

Don't be sad, kids and sick people, IT services folks are Adding Value and Finding Synergies in ways that you just can't. In the long run, workshopping the new Management Strategy Realignment Plan is just a better investment than fixing your gimpy hip, or hiring a teaching assistant to help Angry Jimmy focus on his work.

HP Advanced Solutions continues to dominate the category, adding $20M in billings this year alone (How many teachers could that hire? At least 200. Or even more teaching assistants.) In fact, two thirds of the billing growth this year was just HP.

There's also a new kid in the enterprise software vendor list to keep an eye on: (SFDC) showed up with a wee $463,053 in billings this year. I expect that to increase mightily in coming years. However, the big money in SFDC work will not be earned by SFDC (even after locking up the entire BC government enterprise back-office, Oracle bills less than $10M a year in software maintenance), but by the consultants providing SFDC "implementation services" (Deloitte, CGI, HP). Watch for a SFDC goldrush as the government starts replacing expensive Oracle systems with... expensive SFDC systems in the cloud.

The best part about hiring big public companies enterprise IT like HP, Oracle, Maximus, and CGI to create lots of important Technology Process (and occasionally a bit of Product) for us isn't the soothingly glacial pace of progress or the fantastic billing rates. It's knowing that at least 20% of every public dollar spent goes straight to the bottom line of those companies, ensuring that shareholders and institutional investors survive through another year without undue financial hardship.

Until next year, keep on spending, British Columbia!

Monday, April 27, 2015

More Speech for Money

The BC Liberal government is changing the Elections Act to allow unlimited party and candidate spending within one month of election day and meanwhile, as usual, the media are transfixed by the shiny object in the corner.

The political pundits are making a great deal of noise (see V. Palmer's inside baseball assessment if you care) about an amendment to the Elections Act that says that:

"the chief electoral officer must provide … to a registered political party, in respect of a general election … a list of voters that indicates which voters on the list voted in the general election"

At the same time, they are ignoring the BC Liberals fundamentally changing the money dynamic of the fixed election date by eliminating the 60-day "pre-campaign" period.

"Section 198 is amended (a) by repealing subsections (1) and (2) and substituting the following: (1) In respect of a general election, the total value of election expenses incurred by a registered political party during the campaign period must not exceed $4.4 million."

The Elections Act currently divides up the election period before a fixed election into two "halves": the 60 days before the official campaign, and the campaign period itself (about 28 days if I recall correctly). In the first 60 days, candidates can spend a maximum of $70,000 and parties a maximum of $1.1 million. In the campaign period, candidates can spend another $70,000 and parties as much as $4.4 million.

The intent of the "pre-campaign" period is clearly to focus campaigning on the campaign period itself, by limiting the amount of early spending by parties. The "money density" of the pre-campaign period is about $18,000 / day in party spending; in the campaign period, it is almost $160,000 / day.

This is all very public-spirited, and contributes to a nice focussed election period. But (BUT!) the BC Liberals currently have more money than they know what to do with, so it is in their interest to be able to focus all that money as close to the event as possible. And rather than simply raising the pre-campaign spending limit they went one better: they removed it all together. They can spend unlimited amounts of money as close as 28 days before election day, 21 days before the opening of advance polls.

Let me repeat that: they can spend unlimited amounts of money.

So in British Columbia now, it is legal to both raise unlimited amounts of money from corporations, unions and individuals in any amounts at all (and some individuals and corporations have donated to the BC Liberals, individually, over $100,000 a year), and it is legal to spend unlimited amounts of money, right up to within 28 days of the election day.

See any problems with that?

GIS "Data Models"

Most IT professionals have some expectation, having received a basic education on relational data modelling, that a model for a medium sized problem might look like this:

Why is it, then, that production GIS data flows so consistently produce models that look like this:

What is wrong with us?!?? I bring up this rant only because I was just told that some users find the PostgreSQL 1600 column limit constraining since it makes it hard to import the Esri census data, which are "modelled" into tables that are presumably wider than they are long.

Saturday, March 21, 2015

Magical PostGIS

I did a new PostGIS talk for FOSS4G North America 2015, an exploration of some of the tidbits I've learned over the past six months about using PostgreSQL and PostGIS together to make "magic" (any sufficiently advanced technology...)


Friday, March 20, 2015

Making Lines from Points

Somehow I've gotten through 10 years of SQL without ever learning this construction, which I found while proof-reading a colleague's blog post and looked so unlikely that I had to test it before I believed it actually worked. Just goes to show, there's always something new to learn.

Suppose you have a GPS location table:

  • gps_id: integer
  • geom: geometry
  • gps_time: timestamp
  • gps_track_id: integer

You can get a correct set of lines from this collection of points with just this SQL:

  ST_MakeLine(geom ORDER BY gps_time ASC) AS geom 
FROM gps_poinst
GROUP BY gps_track_id

Those of you who already knew about placing ORDER BY within an aggregate function are going "duh", and the rest of you are, like me, going "whaaaaaa?"

Prior to this, I would solve this problem by ordering all the groups in a CTE or sub-query first, and only then pass them to the aggregate make-line function. This, is, so, much, nicer.

Wednesday, March 18, 2015

Deloitte's Second Act

Hot off their success transforming the BC social services sector with "integrated case management", Deloitte is now heavily staffing the upcoming transformation of the IT systems that underpin our natural resource management ministries.

Interlude: I should briefly note here that Deloitte's work in social services involved building a $180,000,000 case management system that the people who use it generally do not like, using software that nobody else uses for social services, that went offline for several consecutive days last year, and based on software that basically entered end-of-life almost five years ago. I'm sure that's not Deloitte's fault, they are only the international experts hired to advise on the best ways to build the system and then actually build it.

So many shiny arrows!
Smells like management consultants...


The brain trust has now decided that the thing we need on the land base is "integrated decision making", presumably because everything tastes better "integrated". A UVic MPA student has done a complete write-up of the scheme—and I challenge you to find the hard centre inside this chewey mess of an idea—but here's a representative sample:

The IDM initiative is an example of horizontal management because it is an initiative among non‐hierarchical ministries focused on gaining efficiencies by harmonizing regulations, IT systems and business processes for the betterment of the NRS as a whole. Horizontal management is premised on joint or consensual decision making rather than a more traditional vertical hierarchy.  Horizontal collaborations create links and share information, goodwill, resources, and power or capabilities by organizations in two or more sectors to achieve jointly what they cannot achieve individually.  

Sounds great, right!?! Just the sort of thing I'd choose to manage billions of dollars in natural resources! (I jest.)

Of course, the brain trust really isn't all that interested in "horizontal management", what has them hot and bothered about "integrated decision making" is that it's an opportunity to spend money on "IT systems and business processes". Yay!

To that end, they carefully prepared a business case for Treasury Board, asking for well north of $100M to rewrite every land management system in government. Forests, lands, oil and gas, heritage, the whole kit and caboodle. The business case says:

IDM will improve the ability of the six ministries and many agencies in the NRS to work together to provide seamless, high‐quality service to proponents and the public, to provide effective resource stewardship across the province, to effectively consult with First Nations in natural resource decisions, and to contribute to cross‐government priorities.

Sounds ambitious! I wonder how they're going to accomplish this feat of re-engineering? Well, I'm going to keep on wondering, because they redacted everything in the business case except the glowing hyperbole.

However, even though we don't know how, or really why, they are embarking on this grand adventure, we can rest assured that they are now spending money at a rate of about $10M / year making it happen, much of it on our good friends Deloitte.

  • There are currently 80 consultants billing on what has been christened the "Natural Resource Sector Transformation Secretariat".

  • Not that Secretariat...

  • Of those consultants 34 are (so far) from Deloitte.
  • Coincidentally, 34 is also the number of government staff working at the Secretariat.
  • So, 114 staff, of which 34 are government employees and the rest are contractors. How many government employees does it take to change a lightbulb? Let me take that to procurement and I'll get back to you.

The FOI system charged me $120 (and only after I bargained down my request to a much less informative one) to find the above out, because they felt that the information did not meet the test of being "of public interest". If you feel it actually is in the public interest to learn where our $100M on IT services for natural resources are being spent, and you live in BC, please leave me a comment on this post.

Interlude: The test for whether fees should be waived is double barrelled, but is (hilariously) decided by the public body itself (soooo unbiased). Here are the tests I think I pass (but they don't):

  1. Do the records show how the public body is allocating financial or other resources?
  2. Is your primary purpose to disseminate information in a way that could reasonably be expected to benefit the public, or to serve a private interest?

I'm still digging for more information (like, how is it that Deloitte can bill out 34 staff on this project when there hasn't been a major RFP for it yet?) so stay tuned and send me any hints if you have them.

Thursday, February 19, 2015

GeoTiff Compression for Dummies

"What's the best image format for map serving?" they ask me, shortly after I tell them not to serve their images from inside a database.

"Is it MrSid? Or ECW? those are nice and small." Which indeed they are. Unfortunately, outside of proprietary image server software I've never seen them be fast and nice and small at the same time. Generally the decode step is incredibly CPU intensive, presumably because of the fancy wavelet math that makes them so small in the first place.

"So, what's the best image format for map serving?".

In my experience, the best format for image serving, using open source rendering engines (MapServer, GeoServer, Mapnik) is: GeoTIFF, with JPEG compression, internally tiled, in the YCBCR color space, with internal overviews. Unfortunately, GeoTiffs are almost never delivered this way, as I was reminded today while downloading a sample image from the City of Kamloops (But nonetheless, thanks for the great free imagery, Kamloops!) [593M]

It came in a 593Mb ZIP file. "Hm, that's pretty big, I thought." I unzipped it.

5255C.tif [515M]

Unzipped it was a 515Mb TIF file. That's right, it was smaller "uncompressed". Why? Because internally it was already compressed, and applying the ZIP compression algorithm to already compressed data generally fluffs it up a little. Whoops.

The default TIFF compression is, unfortunately, "deflate", the same as that used for ZIP. This is a lossless encoding, but not very good for imagery. We can make the image a whole lot smaller just by using a more appropriate compression, like JPEG. We'll also tile it internally while we're at it. Internal tiling allows renderers to quickly pick out and decompress just a small portion of the image, which is important once you've applied a more serious compression algorithm like JPEG.

gdal_translate \
  -co TILED=YES \
  5255C.tif 5255C_JPEG.tif

This is much better, now we have a vastly smaller file.

5255C_JPEG.tif [67M]

But we can still do better! For reasons that well pass my understanding, the JPEG algorithm is more effective against images that are stored in the YCBCR color space. Mine is not to reason why, though.

gdal_translate \
  -co TILED=YES \
  5255C.tif 5255C_JPEG_YCBCR.tif

Wow, now we're down to 1/20 the size of the original.

5255C_JPEG_YCBCR.tif [24M]

But, we've applied a "lossy" algorithm, JPEG, maybe we've ruined the data! Let's have a look.

OriginalAfter JPEG/YCBCR

Can you see the difference? Me neither. Using a JPEG "quality" level of 75%, there are no visible artefacts. In general, JPEG is very good at compressing things so humans "can't see" the lost information. I'd never use it for compressing a DEM or a data raster, but for a visual image, I use JPEG with impunity, and with much lower quality settings too (for more space saved).

Finally, for high speed serving at more zoomed out scales, we need to add overviews to the image. We'll make sure the overviews use the same, high compression options as the base data.

gdaladdo \
  -r average \
  5255C_JPEG_YCBCR.tif \
  2 4 8 16

For reasons passing understanding, gdaladdo uses a different set of command-line switches to pass the configuration info to the compressor than gdal_translate does, but as before, mine is not to reason why.

The final size, now with overviews as well as the original data, is still less that 1/10 the size of the original.

5255C_JPEG_YCBCR.tif [37M]

So, to sum up, your best format for image serving is:

  • GeoTiff, so you can avoid proprietary image formats and nonsense, with
  • JPEG compression, for visually fine results with much space savings, and
  • YCBCR color, for even smaller size, and
  • internal tiling, for fast access of random squares of data, and
  • overviews, for fast access of zoomed out views of the data.

Go forth and compress!

Friday, February 06, 2015

Breaking a Linestring into Segments

Like doing a sudoku, solving a "simple yet tricky" problem in spatial SQL can grab ones mind and hold it for a period. Someone on the PostGIS IRC channel was trying to "convert a linestring into a set of two-point segments", using an external C++ program, and I thought: "hm, I'm sure that's doable in SQL".

And sure enough, it is, though the syntax for referencing out the parts of the dump objects makes it look a little ugly.

Monday, February 02, 2015

The New Gig

I haven't had many jobs in my career, so changing jobs feels pretty momentous: two weeks ago I had my last day at Boundless, and today will be my first at CartoDB.

I started with Boundless back in 2009 when it was OpenGeo and still a part of the Open Planning Project, a weird non-profit arm of a New York hedge fund millionaire's corporate archipelago. (The hedgie, Mark Gorton, is still going strong, despite the brief set-back he endured when LimeWire was sued by RIAA.) For that six year run, I was fortunate to have a lead role in articulating what it meant to "do open source" in the geospatial world, and to help to build OpenGeo into a self-supporting open source enterprise. We grew, spun out of the non-profit, gained lots of institutional customers, and I got to meet and work with lots of quality folks. After six years though, I feel like I need a change, an opportunity to learn some new things and meet some new people: To move from the enterprise space, to the consumer space.

So I was very lucky when a new opportunity came along: to work for a company that is reimagining what it means to be a spatial database in a software-as-a-service world. Under the covers, CartoDB uses my favorite open source spatial database, PostGIS, to run their platform, and working for CartoDB gives me a chance to talk about and to work on something I like almost as much as (more than?) open source: spatial SQL! The team at CartoDB have done a great job with their platform, providing a simple entry-point into map making, while still leaving the power of SQL exposed and available, so that users can transition from beginner, to explorer, to power user. As someone who currently only knows a portion of their technology (the SQL bit), I'm looking forward to experiencing the rest of their platform as a beginner. I also know the platform folks will have lots of good questions for me on PostGIS internals, and we'll have many interesting conversations about how to keep pushing PostgreSQL and PostGIS to the limits.

My two week between-jobs break was refreshing, but sometimes a change is as good as rest too. I enjoyed my last six years with Boundless and I'm looking forward to the future with CartoDB.

About Me

My Photo
Victoria, British Columbia, Canada


Blog Archive


bc (40) it (31) postgis (23) video (12) enterprise IT (11) icm (11) gis (9) sprint (9) open source (8) osgeo (8) enterprise (7) cio (6) foippa (6) management (6) spatial it (6) foi (5) foss4g (5) outsourcing (5) politics (5) mapserver (4) bcesis (3) boundless (3) opengeo (3) oracle (3) rant (3) COTS (2) architecture (2) deloitte (2) esri (2) hp (2) idm (2) javascript (2) natural resources (2) ogc (2) open data (2) openstudent (2) postgresql (2) technology (2) vendor (2) web (2) 1.4.0 (1) HR (1) access to information (1) accounting (1) agile (1) aspen (1) bcpoli (1) benchmark (1) buffer (1) build vs buy (1) business (1) business process (1) c (1) career (1) cartodb (1) cathedral (1) client (1) cloud (1) code (1) common sense (1) consulting (1) contracting (1) core review (1) crm (1) crockofshit (1) cunit (1) custom (1) data science (1) data warehouse (1) design (1) development (1) digital (1) email (1) environment (1) essentials (1) evil (1) exadata (1) fcuk (1) fgdb (1) fme (1) foocamp (1) foss4g2007 (1) ftp (1) gdal (1) gds (1) geocortex (1) geometry (1) geoserver (1) geotiff (1) google (1) google earth (1) government (1) grass (1) hadoop (1) iaas (1) icio (1) imagery (1) industry (1) innovation (1) integrated case management (1) introversion (1) iso (1) isss (1) isvalid (1) jpeg (1) jts (1) lawyers (1) mapping (1) mcfd (1) microsoft (1) money (1) mysql (1) new it (1) nosql (1) nrs transformation (1) opengis (1) openlayers (1) oss (1) paas (1) pirates (1) policy (1) portal (1) proprietary software (1) public accounts (1) qgis (1) r (1) rdbms (1) recursion (1) redistribution (1) regression (1) rfc (1) right to information (1) saas (1) salesforce (1) sardonic (1) seibel (1) sermon (1) server (1) siebel (1) snark (1) spatial (1) standards (1) statistics (1) svr (1) taxi (1) tempest (1) texas (1) tired (1) transit (1) twitter (1) uber (1) udig (1) uk (1) uk gds (1) verbal culture (1) victoria (1) waterfall (1) wfs (1) where (1) with recursive (1) wkb (1)