From 0 to 65 Million in 2 Hours

I’m doing some performance benchmarking for a client this week, so getting a big, real test database is a priority. The USGS TIGER data is one of the largest uniform data sets around, so I’ve started with that.

I just loaded all the edges, 64,830,691 of them, and it took just under 2 hours! Fortunately, the 2007 data comes in shape files, and the schemas are identical for each file, so the load script is as simple as this controller:

find . -name "fe*_edges.zip" -exec ./append_edges.sh {} ';'

And this runner (append_edges.sh):

unzip $1
shp2pgsql -W WINDOWS-1252 -D -a -s 4269 \`basename $1 .zip\`.shp fe_edges | psql tiger
rm fe*edges.*

Note the use of the -W parameter, to ensure that the high-bit “charàctérs” are handled correctly, and the -a parameter, to append the file contents to the table.

GeoSilk

Who is building the GeoWeb? Must be the GeoSpider! Spinning it with GeoSilk!

I’ll be talking at GeoWeb about Mapserver, an important piece of GeoSilk, in a no-holds-barred panel with representatives from GeoServer (Justin Deoliveira) and MapGuide (Bob Bray). Yes, we’re in a bit of an open source ghetto – it would have been better to get all the functionally similar products in one room, and only let the victor out alive.

Mapserver! Strong like ox! Fast like cheetah!

Standards for Geospatial REST?

One of the things that has been powerful about the open source geospatial community has been the care with which open standards have been implemented. Frequently the best early implementations of the OGC standards have open source ones. It would be nice if the open source community could start providing REST APIs to the great core rendering and data access services that underly products like Mapserver, Geoserver, and so on.

However, it’s bad enough that Google, Microsoft, ESRI, all invent their own addressing schemes for data and services. Do we want the open source community to create another forest of different schemes, one for each project? Also, inventing a scheme takes time and lots of email wanking in any community with more than a handful of members. Replicating that effort for each community will waste a lot of time.

No matter how much we (me) might bitch about OGC standards, they have a huge advantage over DIY, in that they provide a source of truth. You want an answer to how that works? Check the standard.

There’s been enough test bedding of REST and some nice working implementations, perhaps it is time to document some of that knowledge and get into the playing field of the OGC to create a source of truth to guide the large community of implementors.

A lot of the work is done, in that most of the the potential representations have been documented: GML, KML, GeoRSS, GeoJSON, WKT. Perhaps there is little more to be done: writing up how to apply APP to feature management.

For Sean

In talking about mass scaling geospatial web services, Sean Gillies expressed that he wasn’t sure that Mapserver (for example) was up to the task, or alternatively, really provided anything that couldn’t be provided with ArcServer, for example (he’s not a fanboi, but he does like playing devil’s advocate).

For mass scaling a system on Mapserver, I’d have Mapserver generating maps out to a tile cache, so really, the number of Mapservers I would need would be limited to the amount of fresh tiles I was generating at any given time. Sean correctly pointed out that in that architecture, ArcServer wouldn’t be nearly as punitive, from a licensing point-of-view, as I was painting it, since it too could be deployed in more limited numbers while the tile cache handled most of the load.

I just thought of an interesting use case which is totally incompatible with a proprietary per-CPU licensing scheme: suppose I want to generate a complete tile cache for the USA. Do a TIGER street map of all of the USA, for example? No problem, I fire up 1000 EC2 instances and let ‘em rip! With Mapserver, that’s not a problem. With ArcServer, or any other per-CPU licensed product, it’s just not on.

The whole area of virtual rentable infrastructure is a place where old fashioned proprietary licensing has a problem.

Train Dreams

The web is littered with half-baked train “projects” that need little more than some love, public support, and a couple billion dollars to get off the ground. Vector1 gives us a particularly fluffy example today, the solar train. But why stop there? Here’s a gallery of not-gonna-happen/didn’t-happen rail “proposals” culled in just a few minutes from the Goog:

Some of these attempts actually spent millions (like the Texas TGV), but they all met/are meeting the same fate. Millions won’t do it, try billions, guys. That’s what it took to build the continental railway system in the first place, in a financial bubble worthy of the dot-com era – and don’t forget, generous government subsidies via land grants.

Meanwhile, freight rail in North America is going strong, stronger than Europe, even! But before passenger rail takes off, North Americans need to be willing to park their cars. So far, the gas price surge has only generated a surge of news stories on more efficient cars, electric cars, and small cars. But very few stories about not using your car.