GeoTiff Compression for Dummies

“What’s the best image format for map serving?” they ask me, shortly after I tell them not to serve their images from inside a database.

“Is it MrSid? Or ECW? those are nice and small.” Which indeed they are. Unfortunately, outside of proprietary image server software I’ve never seen them be fast and nice and small at the same time. Generally the decode step is incredibly CPU intensive, presumably because of the fancy wavelet math that makes them so small in the first place.

“So, what’s the best image format for map serving?”.

In my experience, the best format for image serving, using open source rendering engines (MapServer, GeoServer, Mapnik) is: GeoTIFF, with JPEG compression, internally tiled, in the YCBCR color space, with internal overviews. Unfortunately, GeoTiffs are almost never delivered this way, as I was reminded today while downloading a sample image from the City of Kamloops (But nonetheless, thanks for the great free imagery, Kamloops!) [593M]

It came in a 593Mb ZIP file. “Hm, that’s pretty big, I thought.” I unzipped it.

5255C.tif [515M]

Unzipped it was a 515Mb TIF file. That’s right, it was smaller “uncompressed”. Why? Because internally it was already compressed, and applying the ZIP compression algorithm to already compressed data generally fluffs it up a little. Whoops.

The default TIFF compression is, unfortunately, “deflate”, the same as that used for ZIP. This is a lossless encoding, but not very good for imagery. We can make the image a whole lot smaller just by using a more appropriate compression, like JPEG. We’ll also tile it internally while we’re at it. Internal tiling allows renderers to quickly pick out and decompress just a small portion of the image, which is important once you’ve applied a more serious compression algorithm like JPEG.

gdal_translate \
  -co TILED=YES \
  5255C.tif 5255C_JPEG.tif

This is much better, now we have a vastly smaller file.

5255C_JPEG.tif [67M]

But we can still do better! For reasons that well pass my understanding, the JPEG algorithm is more effective against images that are stored in the YCBCR color space. Mine is not to reason why, though.

gdal_translate \
  -co TILED=YES \
  5255C.tif 5255C_JPEG_YCBCR.tif

Wow, now we’re down to 1/20 the size of the original.

5255C_JPEG_YCBCR.tif [24M]

But, we’ve applied a “lossy” algorithm, JPEG, maybe we’ve ruined the data! Let’s have a look.

OriginalAfter JPEG/YCBCR

Can you see the difference? Me neither. Using a JPEG “quality” level of 75%, there are no visible artefacts. In general, JPEG is very good at compressing things so humans “can’t see” the lost information. I’d never use it for compressing a DEM or a data raster, but for a visual image, I use JPEG with impunity, and with much lower quality settings too (for more space saved).

Finally, for high speed serving at more zoomed out scales, we need to add overviews to the image. We’ll make sure the overviews use the same, high compression options as the base data.

gdaladdo \
  -r average \
  5255C_JPEG_YCBCR.tif \
  2 4 8 16

For reasons passing understanding, gdaladdo uses a different set of command-line switches to pass the configuration info to the compressor than gdal_translate does, but as before, mine is not to reason why.

The final size, now with overviews as well as the original data, is still less that 1/10 the size of the original.

5255C_JPEG_YCBCR.tif [37M]

So, to sum up, your best format for image serving is:

  • GeoTiff, so you can avoid proprietary image formats and nonsense, with
  • JPEG compression, for visually fine results with much space savings, and
  • YCBCR color, for even smaller size, and
  • internal tiling, for fast access of random squares of data, and
  • overviews, for fast access of zoomed out views of the data.

Go forth and compress!

Breaking a Linestring into Segments

Like doing a sudoku, solving a “simple yet tricky” problem in spatial SQL can grab ones mind and hold it for a period. Someone on the PostGIS IRC channel was trying to “convert a linestring into a set of two-point segments”, using an external C++ program, and I thought: “hm, I’m sure that’s doable in SQL”.

And sure enough, it is, though the syntax for referencing out the parts of the dump objects makes it look a little ugly.

The New Gig

I haven’t had many jobs in my career, so changing jobs feels pretty momentous: two weeks ago I had my last day at Boundless, and today will be my first at CartoDB.

I started with Boundless back in 2009 when it was OpenGeo and still a part of the Open Planning Project, a weird non-profit arm of a New York hedge fund millionaire’s corporate archipelago. (The hedgie, Mark Gorton, is still going strong, despite the brief set-back he endured when LimeWire was sued by RIAA.)

For that six year run, I was fortunate to have a lead role in articulating what it meant to “do open source” in the geospatial world, and to help to build OpenGeo into a self-supporting open source enterprise. We grew, spun out of the non-profit, gained lots of institutional customers, and I got to meet and work with lots of quality folks.

After six years though, I feel like I need a change, an opportunity to learn some new things and meet some new people: To move from the enterprise space, to the consumer space.

So I was very lucky when a new opportunity came along: to work for a company that is reimagining what it means to be a spatial database in a software-as-a-service world. Under the covers, CartoDB uses my favorite open source spatial database, PostGIS, to run their platform, and working for CartoDB gives me a chance to talk about and to work on something I like almost as much as (more than?) open source: spatial SQL!

The team at CartoDB have done a great job with their platform, providing a simple entry-point into map making, while still leaving the power of SQL exposed and available, so that users can transition from beginner, to explorer, to power user.

As someone who currently only knows a portion of their technology (the SQL bit), I’m looking forward to experiencing the rest of their platform as a beginner. I also know the platform folks will have lots of good questions for me on PostGIS internals, and we’ll have many interesting conversations about how to keep pushing PostgreSQL and PostGIS to the limits.

My two week between-jobs break was refreshing, but sometimes a change is as good as rest too. I enjoyed my last six years with Boundless and I’m looking forward to the future with CartoDB.

Can Procurement do Agile?

Some very interesting news out of the USA today, as the GSA (the biggest, baddest procurement agency in the world) has released a Request For Information (RFI) on agile technology delivery.

To shift the software procurement paradigm, GSA’s 18F Team and the Office of Integrated Technology Services (ITS) is collaborating on the establishment of a BPA that will feature vendors who specialize in Agile Delivery Services. The goal of the proposed BPA is to decrease software acquisition cycles to less than four weeks (from solicitation to contract) and expedite the delivery of a minimum viable product (MVP) within three months or less.

In a wonderful “eat your own dogfood” move, the team working on building this new procurement vehicle are themselves adopting agile practices in their own process. Starting small with a pilot, working directly with the vendors who will be trying the new vehicle, etc. If the hidebound old GSA can develop a workable framework for agile procurement, then nobody else has an excuse.

(The reason procurement agencies have found it hard to specify “agile” is that agile by design does not define precise deliverables in advance, so it is damnably hard to fit into a “fixed cost bid” structure. In places where time-and-materials vehicles are already in place, lots of government organizations are already working with vendors in an agile way, but for the kinds of big, boondoggle-prone capital investment projects I write about, the waterfall model still predominates.)

Give in, to the power of the Client Side...

Brian Timoney called out this Tom Macwright quote, and he’s right, it deserves a little more discussion:

…the client side will eat more of the server side stack.

To understand what “more” there is left to eat, it’s worth enumerating what’s already been eaten (or, which is being consumed right now, as we watch):

  • Interaction: OK, so this was always on the client side, but it’s worth noting that the impulse towards using a heavy-weight plug-in for interaction is now pretty much dead. The detritus of plug-in based solutions will be around for a long while, inching towards end-of-life, but not many new ones are being built. (I bet some are, though, in the bowels of organizations where IE remains an unbreakable corporate standard.
  • Single-Layer Rendering: Go back almost 10 years and you’ll find OpenLayers doing client-side rendering, though using some pretty gnarly hacks at the time. Given the restrictions in rendering performance, a common way to break down an app was a static, tiled base map with a single vector layer of interest on top. (Or, for the truly performance oriented, a single raster layer on top, only switching to vector for editing purposes.) With modern browser technology, and good implementations, rendering very large numbers of features on the client has become commonplace, to the extent that the new bottleneck is no longer the CPU, it’s the network.
  • All-the-layers Rendering: Already shown in-principle by Google Maps, tiled vector rendering is moving over the last 12 months rapidly from wow-wizzy-demo to oh-no-not-that-again status. Rather than rendering to raster on the server side, send a simplified version to the client for rendering there. For base maps there’s not a lot of benefit over pre-rendered raster, but there’s some: dynamic labelling means orientation is completely flexible, and also allows for multiple options for labelling; also, simplified vector tiles can serve a wider range of zoom levels while remaining attractively rendered, so the all-important network bandwidth issues can be addressed for mobile devices.
  • “GIS” operations: While large scale analysis is not going to happen on a web page, a lot of visual effects that were otherwise hard to achieve can now be pushed to the client. Some of the GIS operations are actually in support of getting attractive client-side rendering: voronoi diagrams can be a great aid to label placement; buffers come in handy for cartography all the time.
  • Persistence: Not really designed for long-term storage, but since any mobile application on a modern platform now has access to a storage area of pretty large size, there’s nothing stopping these new “client” applications from wandering far and completely untethered from the server/cloud for long periods of time.

Uh, what’s left?

  • Long term storage and coordination remain. If people are going to work together on data, they need a common place to store and access their information from.
  • Large scale data handling and analysis remain, thanks to those pesky narrow network pipes, for now.
  • Coordination between devices requires a central authority still. Although, not for long, with web sockets I’m sure some JavaScript wizard has already cooked up a browser-to-browser peer-to-peer scheme, so the age of fully distributed open street map will catch up to us eventually.

Have I missed any?

Once all applications are written in 100% JavaScript we will have finally achieved the vision promised to me back in 1995, a write-once, run-anywhere application development language, where applications are not installed but are downloaded as needed over the network (because “the network is the computer”). Just turns out it took 20 years longer and the language and virtual machine are different (and there’s this strange “document” cruft floating around, a coccyx-like evolutionary remnant people will be wondering about for years).