Cancer 2

Before I joined the population of fellow cancer travellers, I had the same simple linear understanding of the “process” that most people do.

You get diagnosed, you get treatment, it works or it doesn’t.

What I didn’t appreciate (and this will vary from cancer to cancer, but my experience is with colorectal) is how little certainty there is, and how wide the grey areas are.

Like, in my previous post, I said I was “diagnosed” with cancer. Which maybe made you think I have it. But that’s not how it works. I had a colonoscopy, and a large polyp was removed, and that polyp was cancerous, and a very small part of it could not be excised. So it’s still in me.

Do I have cancer? Maybe! I have a probability of having live cancer cells in me that is significantly higher than zero. But not as high as one.

How bad is what I (might) have? This is also a game of probability. Modern technology can shave off the edges of the distribution, but it can’t quite nail it down.

A computed tomography (CT) scan didn’t show any other tumors in my body, so that means I probably don’t have “stage 4” (modulo the resolution of the scan), which is mostly incurable (though it can be manageable), where the cancer has managed to spread outside the colon.

An MRI didn’t show any swollen lymph nodes, which means I maybe do not have “stage 3”, which requires chemotherapy, because the cancer has partially escaped the colon. But MRI results are better at proving rather than disproving nodal involvement and people report having surgical results that run counter to the MRI all the time.

That leaves me (theoretically) at “stage 2”, looking at a surgical “cure” that involves removing the majority of my rectum and a bunch of lymph nodes. At that point (after the major life-altering surgery!) the excised bits are sent to a pathologist, and the probability tree narrows a little more. Either the pathologist finds cancer in the nodes (MRI was wrong), and I am “upstaged” to stage 3 and sent to chemotherapy, or she doesn’t and I remain a stage 2 and move to a program of monitoring.

In an exciting third possibility, the pathologist finds no cancer in the lymph nodes or the rectum, which means I will have had major life-altering surgery to remove… nothing dangerous. My surgeon says I should find this a happy result (no cancer!) which is probably because he’s seen so many unhappy results, but it’s a major surgery with life-long side effects and I would do almost anything to not have to have it.

Amazingly, despite our modern technology there’s just no way to know for sure if there are still live cancer cells in me short of taking the affected bits out and doing the pathology. Or waiting to see if something grows back, which is to flirt with a much worse prognosis.

Monitoring will be regular blood tests, annual scans and colonoscopies for several years, as the probability of recurrence slowly and asymptotically moves toward (but never quite arrives at) zero. And all those tests and procedures have their own error rates and blind spots.

There are no certainties. All the measuring and cutting and chemicals, and I will still have not driven the cancer entirely out, it will stubbornly remain as a probability, a non-zero ghost haunting me every year of the rest of my hopefully long life.

And of course worth mentioning, I am getting the snack-sized, easy-mode version of this experience! People in stage three or stage four face a probability tree with a lot more “and then you probably die in a few years” branches, and the same continuous reevaluation of that tree, with each new procedure and scan, each new discovery of progression or remission.

Talk to you again soon, inshalla.

Cancer 1

A little over a month ago, three days after my 53rd birthday, I received a diagnosis of rectal cancer. Happy birthday to me.

Since then, I have been wrestling with how public to be about it. I have a sense that writing is good for me. But it also keeps like milk. I wrote most of this a couple weeks ago and my head space has already evolved.

So writing like this is mostly a work of self-absorption (I’m sure you can forgive me) but hopefully it also helps to raise awareness amongst the cohort of people who might know me or read this.

Colorectal cancer rates are going up, and the expected age of occurance is going down. Please get screened. No matter your age, ask your clinician for a “FIT test”. If you’re over 45, just ask for a colonoscopy, the FIT test isn’t perfect.

I have a pretty good prognosis, mostly because my case was caught by screening, not by experiencing symptoms bad enough to warrant a trip to the doctor. Most of the people who get diagnosed after showing symptoms have it worse than I, and will have a longer, harder road to recovery. Get screened.

Our language of cancer borrows a bit from the language of contagion. I “got” cancer. It’s not quite a neutral description, there’s a hint of agency in there, maybe I did something wrong? This article drives me crazy, the author “went vegan and became a distance runner” after his father died of colorectal cancer.

Sorry friend, cancer is not something you “get”, and it’s not something you can opt out of with clean living. It’s something that happens to you. Take it from this running, cycling, ocean rowing, rock climbing, healthy eater – driving down the marginal probability of cancer (and heart disease (and depression (and more))) with exercise and diet is its own reward, but you are not in control. When cancer wants you, it will come for you.

Carefree Climbing

This is why you should get screened (right?). It’s the one way to proactively protect yourself. The amazing thing about a colonoscopy is, not only can it detect cancer, but it also prevent it, by removing pre-cancerous polyps. It’s possible that screening could have prevented my case, if I had been screened a few years earlier.

I am now a denizen of numerous Facebook fora for fellow travellers along this life path, and one of the posts last week asked “what do you think cancer taught you”? I am a little too early on the path to write an answer myself, but one woman’s answer struck me.

She said it taught her that control is an illusion.

Before, I had plans. I could tell you I was going to go places, and do things, and when I was going to do them, next month, next season, next year. I was in control. Now, I can tell you what I will be doing next week. Perhaps. The rest is in other hands than mine.

Talk to you again soon, inshalla.

Carefree Camping

Building the PgConf.Dev Programme

Update: The programme is now public.

The programme for in Vancouver (May 28-31) has been selected, the speakers have been notified, and the whole thing should be posted on the web site relatively soon.

Vancouver, Canada

I have been on programme committees a number of times, but for regional and international FOSS4G events, never for a PostgreSQL event, and the parameters were notably different.

The parameter that was most important for selecting a programme this year was the over 180 submissions, versus the 33 available speaking slots. For FOSS4G conferences, it has been normal to have between two- and three-times as many submissions as slots. To have almost six-times as many made the process very difficult indeed.

Why only 33 speaking slots? Well, that’s a result of two things:

  • Assuming no more than modest growth over the last iteration of PgCon, puts attendence at around 200, which is the size of our plenary room. 200 attendees implies no more than 3 tracks of content.
  • Historically, PostgreSQL events use talks of about 50 minutes in length, within a one hour slot. Over three tracks and two days, that gives us around 33 talks (with slight variations depending on how much time is in plenary, keynotes or lightning talks).

The content of those 33 talks falls out from being the successor to PgCon. PgCon has historically been the event attended by all major contributors. There is an invitation-only contributors round-table on the pre-event day, specifically for the valuable face-to-face synch-up.

Seminary Room

Given only 33 slots, and a unique audience that contains so many contributors, the question of what should “be” ends up focussed around making the best use of that audience. should be a place where users, developers, and community organizers come together to focus on Postgres development and community growth.

That’s why in addition to talks about future development directions there are talks about PostgreSQL coding concepts, and patch review, and extensions. High throughput memory algorithms are good, but so is the best way to write a technical blog entry.

Getting from 180+ submissions to 33 selections (plus some stand-by talks in case of cancellations) was a process that consumed three calls of over 2 hours each and several hours of reading every submitted abstract.

The process was shepherded by the inimitable Jonathan Katz.

  • A first phase of just coding talks as either “acceptable” or “not relevant”. Any talks that all the committee members agreed was “not relevant” were dropped from contention.
  • A second phase where each member picked 40 talks from the remaining set into a kind of “personal program”. The talks with just one program member selecting them were then reviewed one at a time, and that member would make the case for them being retained, or let them drop.
  • A winnow looking for duplicate topic talks and selecting the strongest, or encouraging speakers to collaborate.
  • A third “personal program” phase, but this time narrowing the list to 33 talks each.
  • A winnow of the most highly ranked talks, to make sure they really fit the goal of the programme and weren’t just a topic we all happened to find “cool”.
  • A talk by talk review of all the remaining talks, ensuring we were comfortable with all choices, and with the aggregate make up of the programme.

The programme committee was great to work with, willing to speak up about their opinions, disagree amicably, and come to a consensus.


Since we had to leave 150 talks behind, there’s no doubt lots of speakers who are sad they weren’t selected, and there’s lots of talks that we would have taken if we had more slots.

If you read all the way to here, you must be serious about coming, so you need to register and book your hotel right away. Spaces are, really, no kidding, very limited.

PgConf.Dev @ Vancouver, May 28-31

This year, the global gathering of PostgreSQL developers has a new name, and a new location (but more-or-less the same dates) … is now!

Some important points right up front:

  • The call for papers is closing in one week! If you are planning to submit, now is the time!
  • The hotel scene in Vancouver is competitive, so if you put off booking accomodations… don’t do that! Book a room right away.
  • The venue capacity is 200. That’s it, so once we have 200 registrants, we are full for this year. Register now.
  • There are also limited sponsorship slots. Is PostgreSQL important to your business? Sponsor!

Vancouver, Canada

I first attended in 2011, when I was invited to keynote on the topic of PostGIS. Speaking in front of an audience of PostgreSQL luminaries was really intimidating, but also gratifying and empowering. Notwithstanding my imposter syndrome, all those super clever developers thought our little geospatial extension was… kind of clever.

I kept going to PgCon as regularly as I was able over the years, and was never disappointed. The annual gathering of the core developers of PostgreSQL necessarily includes content and insignts that you simply can not come across elsewhere, all compactly in one smallish conference, and the hallway track is amazing.

PostgreSQL may be a global development community, but the power of personal connection is not to be denied. Getting to meet and talk with core developers helped me understand where the project was going, and gave me the confidence to push ahead with my (very tiny) contributions.

This year, the event is in Vancouver! Still in Canada, but a little more directly connected to international air hubs than Ottawa was.

Also, this year I am honored to get a chance to serve on the program committee! We are looking for technical talks from across the PostgreSQL ecosystem, as well as about happenings in core. PostgreSQL is so much larger than just the core, and spreading the word about how you are building on PostgreSQL is important (and I am not just saying that as an extension author).

I hope to see you all there!

Data Science is Getting Ducky

For a long time, a big constituency of users of PostGIS has been people with large data analytics problems that crush their desktop GIS systems. Or people who similarly find that their geospatial problems are too large to run in R. Or Python.

These are data scientists or adjacent people. And when they ran into those problems, the first course of action would be to move the data and parts of the workload to a “real database server”.

This all made sense to me.

But recently, something transformative happened – Crunchy Data upgraded my work laptop to a MacBook Pro.

Suddenly a GEOS compile that previously took 20 minutes, took 45 seconds.

I now have processing power on my local laptop that previously was only available on a server. The MacBook Pro may be a leading indicator of this amount of power, but the trend is clear.

What does that mean for default architectures and tooling?

Well, for data science, it means that a program like DuckDB goes from being a bit of a curiosity, to being the default tool for handling large data processing workloads.

What is DuckDB? According to the web site, it is “an in-process SQL OLAP database management system”. That doesn’t sound like a revolution in data science (it sounds really confusing).

But consider what DuckDB rolls together:

  • A column-oriented processing engine that makes the most efficient possible use of the processors in modern computers. Parallelism to ensure all CPUs are made use of, and low-level optimizations to ensure each tick of those processors pushes as much data through the pipe as possible.
  • Wide ranging support for different data formats, so that integration can take place on-the-fly without requiring translation or sometimes even data download steps.

Having those things together makes it a data science power tool, and removes a lot of the prior incentive that data scientists had to move their data into “real” databases.

When they run into the limits of in-memory analysis in R or Python, they will instead serialize their data to local disk and use DuckDB to slam through the joins and filters that were blowing out their RAM before.

They will also take advantage of DuckDB’s ability to stream remote data from data lake object stores.

What, stream multi-gigabyte JSON files? Well, yes that’s possible, but it’s not where the action is.

The CPU is not the only laptop component that has been getting ridiculously powerful over the past few years. The network pipe that connects that laptop to the internet has also been getting both wider and lower latency with every passing year.

As the propect of streaming data for analysis has come into view, the formats for remote data have also evolved. Instead of JSON, which is relatively fluffy, and hard to efficiently filter, the Parquet format is becoming a new standard for data lakes.

Parquet is a binary format, that organizes the data into blocks for efficient subsetting and processing. A DuckDB query to a properly organized Parquet time series file might easily pull only records for 2 of 20 columns, and 1 day of 365, reducing a multi-gigabyte download to a handful of megabytes.

The huge rise in available local computation, and network connectivity is going to spawn some new standard architectures.

Imagine a “two tier” architecture where tier one is an HTTP object store and tier two is a Javascript single page app? The COG Explorer has already been around for a few years, and it’s just such a two tier application.

(For fun, recognize that an architecture where the data are stored in an access-optimized format, and access is via primitive file-system requests, while all the smarts are in the client-side visualization software is… the old workstation GIS model. Everything old is new again.)

The technology is fresh, but the trendline is pretty clear. See Kyle Barrron’s talk about GeoParquet and DeckGL for a taste of where we are going.

Meanwhile, I expect that a lot of the growth in PostGIS / PostgreSQL we have seen in the data science field will level out for a while, as the convenience of DuckDB takes over a lot of workloads.

The limitations of Parquet (efficient remote access limited to a handful of filter variables being the primary one, as will cojoint spatial/non-spatial filter and joins) will still leave use cases that require a “real” database, but a lot of people who used to reach for PostGIS will be reaching for Duck, and that is going to change a lot of architectures, some for the better, and some for the worse.