Cloud Optimized Shape File
20 Apr 2022The Dream
Over a year ago Chris Holmes, the driving force behind the “cloud optimized GeoTIFF” (COG) and “spatio-temporal asset catalog” (STAC) standards that are sweeping the “cloud optimized raster” data management world, asked me what I thought the story of a similar “cloud optimized vector” format might look like.
And I thought about COG, which is really just a very very old format (GeoTIFF) with its bytes rearranged so that the order of bytes in the file matches the likely order in which they will be accessed (blocks of bands, and within the bands, squares of pixels), and I thought I had the perfect, maximally snarky answer:
Dude, shape file is already a cloud-native format.
Now, this might seem counter-intuitive, but hear me out:
- Shape format hails from the early 90’s, when hard-disks spun very slowly, and the limiting factor for data access was usually I/O. Which is much like “cloud optimized” range access over HTTP: seeks are expensive, but block reads are cheap.
- Shape format already divies up the attributes and shapes into separate files, so you can render one without reading the other.
- Shape format is already “network safe”, with endianness defined in the format.
- Shape format is already universally supported. The specification is 24 years old, and it has been the de facto interchange format for so long that people make a joke out of it.
In short, shapefile already looks a lot like GeoTIFF, the founding format of the “cloud optimized geospatial” movement.
Let’s Get Real
So, what is missing to make “cloud optimized shapefile” a reality?
Well, in order to spatially search a shapefile you need a spatial index. There is no index format in the formal Esri specification, and the Esri sbx
index format is proprietary (though largely reverse engineered at this point) but the open source world has had a shape file index format for 20 years: the “QIX file”.
You generate a QIX file using the shptree utility in Mapserver. You can also generate one in GDAL. You can also get GeoTools to generate one.
With a QIX file, the “shape file” now consists of four files:
- shp, the binary shapes
- dbf, the table of attributes
- shx, a file index of the byte offsets of each shape in the shp file
- qix, a spatial index of the shapes
The next trick, just as in COG, is to put the main data files (shp
, dbf
and shx
) into the same order they are likely to be searched in: spatial order.
Since we already have a spatial index (qix
), we can get the files in spatial order by re-writing them in the same order they appear in the index.
Initially I told Chris that this could be done with the Mapserver sortshp utility, however I was mistaken: sortshp
sorts the file in attribute order.
To make “cloud optimized shape file” a reality, first we need a new utility program that sorts the shp
, shx
and dbf
files into qix
order.
We need: coshp!
coshp
is just a re-working of the sortshp
utility, but instead of sorting the output file by attribute it sorts it using the index tree as the driver. This results in shp
and dbf
files where shapes that are “near” in space are also “near” in the byte-stream of the file. This will reduce the number of random reads necessary to access portions of the file using a bounding box search.
Not Even Remotely Done
One of the quiet secrets of the “cloud optimized” geospatial world is that, while all the attention is placed on the formats, the actual really really hard part is writing the clients that can efficiently make use of the carefully organized bytes.
For example, the fancy demonstrations of “pure client-side” COG browsers require a complete GeoTIFF reader in Javascript, along with some extra “cloud” smarts to know what pieces of data to cache and what to treat as transient info during rendering.
So, spatially sorting a shape file is a necessary, but not at all sufficient condition to create an actual “cloud optimized shapefile”, because for it to be practically useful, there needs to be at a minimum a client-side javascript reader.
That means javascript that can:
- Read the shp file format.
- Read the dbf file format.
- Read the shx file format.
- Read the qix file format.
To be truly useful, the javascript should probably include enough cloud smarts to read and opportunistically cache pieces of the files: headers at a minimum, but also reading in branches of the qix
and shx
indexes on-demand.
To make things marginally easier, I have “documented” the QIX format. It’s an ugly beast, but it is possible to traverse it without deserializing the whole thing.
It’s a challenging problem, but I hope there is someone with enough nostalgia for old formats and thirst for glory to make it happen.