Cloud Optimized Shape File

The Dream

Over a year ago Chris Holmes, the driving force behind the “cloud optimized GeoTIFF” (COG) and “spatio-temporal asset catalog” (STAC) standards that are sweeping the “cloud optimized raster” data management world, asked me what I thought the story of a similar “cloud optimized vector” format might look like.

And I thought about COG, which is really just a very very old format (GeoTIFF) with its bytes rearranged so that the order of bytes in the file matches the likely order in which they will be accessed (blocks of bands, and within the bands, squares of pixels), and I thought I had the perfect, maximally snarky answer:

Dude, shape file is already a cloud-native format.

Now, this might seem counter-intuitive, but hear me out:

In short, shapefile already looks a lot like GeoTIFF, the founding format of the “cloud optimized geospatial” movement.


Let’s Get Real

So, what is missing to make “cloud optimized shapefile” a reality?

Well, in order to spatially search a shapefile you need a spatial index. There is no index format in the formal Esri specification, and the Esri sbx index format is proprietary (though largely reverse engineered at this point) but the open source world has had a shape file index format for 20 years: the “QIX file”.

You generate a QIX file using the shptree utility in Mapserver. You can also generate one in GDAL. You can also get GeoTools to generate one.

With a QIX file, the “shape file” now consists of four files:

The next trick, just as in COG, is to put the main data files (shp, dbf and shx) into the same order they are likely to be searched in: spatial order.

Since we already have a spatial index (qix), we can get the files in spatial order by re-writing them in the same order they appear in the index.

Initially I told Chris that this could be done with the Mapserver sortshp utility, however I was mistaken: sortshp sorts the file in attribute order.

To make “cloud optimized shape file” a reality, first we need a new utility program that sorts the shp, shx and dbf files into qix order.

We need: coshp!

coshp is just a re-working of the sortshp utility, but instead of sorting the output file by attribute it sorts it using the index tree as the driver. This results in shp and dbf files where shapes that are “near” in space are also “near” in the byte-stream of the file. This will reduce the number of random reads necessary to access portions of the file using a bounding box search.

Not Even Remotely Done

One of the quiet secrets of the “cloud optimized” geospatial world is that, while all the attention is placed on the formats, the actual really really hard part is writing the clients that can efficiently make use of the carefully organized bytes.

For example, the fancy demonstrations of “pure client-side” COG browsers require a complete GeoTIFF reader in Javascript, along with some extra “cloud” smarts to know what pieces of data to cache and what to treat as transient info during rendering.

So, spatially sorting a shape file is a necessary, but not at all sufficient condition to create an actual “cloud optimized shapefile”, because for it to be practically useful, there needs to be at a minimum a client-side javascript reader.

That means javascript that can:

To be truly useful, the javascript should probably include enough cloud smarts to read and opportunistically cache pieces of the files: headers at a minimum, but also reading in branches of the qix and shx indexes on-demand.

To make things marginally easier, I have “documented” the QIX format. It’s an ugly beast, but it is possible to traverse it without deserializing the whole thing.

It’s a challenging problem, but I hope there is someone with enough nostalgia for old formats and thirst for glory to make it happen.