Tokenization and Your Private Data (3)

To recap:

  • (Day 1) The government is interested in using the CRM and other USA cloud applications, but the BC FOIPPA Act does not allow it,
  • (Day 2) So the BC CIO has recommended “tokenization” systems to make personal information 100% obscured before storage in USA cloud applications.

BUT, and it’s a big BUT, storing securely tokenized data makes cloud applications mostly useless.

As we saw yesterday, secure tokenization replaces every input word with a completely random token. This is done in practice with a tokenization server that translates words to tokens and vice versa.

The tokenization server also has to translate user queries into tokenized equivalents. So if the user asked:

“Show me the record for ‘paul’ ‘ramsey’”

The filter would translate it into this query for the server:

“Show me the record for ‘rtah’ ‘hgat’”

Hm, the magic still seems to be working. But what about a search that returns more than one record?

“Show me all the records of people named ‘Paul’”

This is harder. In a secure tokenization system, there’s a unique token for every word ever stored, even the same word. So the tokenizer now has to ask:

“Show me all the records that have firstname ‘rtah’ or ‘fasp’”

Our example has only two ‘Paul’s. Imagine this example with a database with 50 thousand ‘Paul’s. The query functionality would either not work or slow to a crawl. Can it be fixed? Sure!

We can fix the performance problem by just using the same token for every ‘Paul’ encountered by the system (and for every ‘Jones’, and so on).

Problem solved, now if I ask:

“Show me all the records of people named ‘Paul’”

The filter can translate it simply into:

“Show me all the records that have firstname ‘rtah’”

And no matter how many ‘Paul’s there are in the system it will work fine.

Just one (big) problem. Always substituting the same token for the same word turns “tokenization” from an uncrackable system into a trivial substitution cipher, like the ones you used in Grade 4 to write secret messages to your friends (only using words as the substitution elements instead of letters).

And things get even worse [1] as you add other, very common features people expect from their CRM software:

  • If you want to retrieve records in sorted order, then the tokens in the CRM must have the same sort order as the words they stand in for.
  • If you want to do substring matching (“give me all the names that start with ‘p’”) then the token internal structure must also reflect the internal structure of the original word.

None of this has stopped tokenization software vendors (like CipherCloud, one of the vendors being used by the BC government) from claiming to be able to both provide the magic unbreakability of tokenization while still supporting all the features of the backend CRM.

Cryptography buffs, interested in how CipherCloud could substantiate the claims it was making, started looking at the material it published in its manual and demonstrated at trade shows. Based on the publicly available material, one writer concluded:

The observed encryption has significant weaknesses, most of them inherent to a scheme that wants to encrypt data, while enabling the original application to perform operations such as search and sorting on the encrypted data without changing that application. There might be some advanced techniques (homomorphic encryption and the likes) that avoid these weaknesses, but at least the software demoed in the video does not use them.

In response, the company slapped their discussion site with a DMCA takedown order. This is not the action of a company that is confident in its methods.

Tomorrow, I’ll look at what the Freedom of Information Commissioner has said about “tokenization” and where we are going from here.

[1] Yes, my equality example is very simplified for teaching purposes, and there are some papers out there on “fully homomorphic encryption”, but note that FHE is still an area of research, and in any event (see tomorrow’s post), wouldn’t meet the BC Information Commissioner’s standard for extra-territorial storage of personal information.

Tokenization and Your Private Data (2)

So, (Day 1) the BC government’s vendors (and thus, by extension, the BC government) are hot to trot to use the cloud CRM to store the personal data of BC citizens. But, BC privacy law does not allow that. Whatever will the government do?

Enter stage left: “tokenization”. The CIO has recommended tokenization technology for Ministries looking to use and other cloud services to manage private information:

Using tokenization – a method of substituting specified data fields for arbitrary values – these solutions allow for the use of foreign-based services while remaining within the residency-based restrictions of FOIPPA.
Bette-Jo Hughes, Oct 2, 2013

Tokenization is a strategy that takes every word in an input text, and replaces it with a random substitution “token”, and keeps track of the relationship between words and tokens. So, the input to a tokenization process would be N words, and the output would be N random numbers, and an N-entry dictionary matching the words to the numbers that replaced them.

Crytography buffs will note that this is just a one-time pad, an old but unbreakable scheme for encoding messages, only operating word-by-word instead of letter-by-letter.

This seems like a nice trick!

Input Dictionary Output
Paul Ramsey
Paul Jones
Tim Jones
Paul = rtah
Ramsey = hgat
Paul = fasp
Jones = nasd
Tim = yhav
Jones = imfa
rtah hgat
fasp nasd
yhav imfa

If you are clever, you can put a tokenizing filter between your users and American web sites like, and have the tokenizer replace the words you send to with tokens, and replace the tokens sends you with words. So the data at will be gobbledegook, but what you see on your screen will be words. Magic!

If all we wanted to do was just store data securely somewhere outside of Canada, and then get it back, “tokenization” would be a grand idea, but there’s a hitch.

  • First, storing tokenized data means storing 3-times the volume of the original (one copy of tokens stored at, and a locally stored dictionary that contains both the original and the tokens). You get no benefit from the cloud from a storage standpoint (in fact it’s worse, you’re storing twice as much local data); and, you get no redundancy benefit, since if you lose your local copy of the dictionary the cloud data becomes meaningless.
  • Second, and most importantly this whole exercise isn’t about storing data, it’s about making use of a customer relationship management (CRM) system,, and secure tokenization, as described above, is not consistent with using effectively.

Tomorrow, we’ll discuss why this most excellent “tokenization” magic doesn’t work if you want to use it inside a CRM (or any other system that expects its data to have meaning).

Tokenization and Your Private Data (1)

One morning this winter, while I was sipping my coffee at the cafe below our office, a well-dressed man and woman sat down at the table next to me, and started talking. Turns out, they were my favourite kind of people — IT people! They were going to bid on the Integrated Decision Making project, and were talking about my favourite systems integrator, Deloitte.

“Is Deloitte trying to bring ICM and Siebel into this project?” she asked.

“No, not anymore” he replied “now they are really pushing”

Now this was interesting! Chastened by their failure to shoehorn social services case management into a CRM, Deloitte has adroitly pivoted and is trying to shoehorn natural resource permitting into … a cloud CRM.

(I should parenthetically point out that, unsurprisingly, the SALES people in our company find very useful in coordinating and tracking their SALES activities.)

Certainly pushing a platform that is actually growing in usage makes more sense than pushing one that end-of-lifed a decade ago, but still, again with the CRM?

Deloitte isn’t being coy with their plans, they are selling them to the highest levels of the government. On October 7, 2013, the BC CIO spent two and a half hours enjoying the hospitality of Deloitte and at a “BC government executive luncheon” on the topic “Innovation, Transformation and Cloud Computing in the Public Sector”.

And there’s another wrinkle. is a US-based cloud service provider, and our Freedom of Information and Protection of Privacy Act (FOIPPA) says that personal data must be stored in Canada. is also a US legal entity, which means they are subject to the PATRIOT Act which allows authorities to access personal data without notifying the subject of the search. That is also not allowed by BC’s FOIPPA.

What is an ambitious system integrator with a hammer suitable for every nail to do? Not change hammers! That would be silly. Far better to try and get an exemption or figure out a workaround. Workarounds add nice juicy extra complexity to the hammer, which can only help billable hours.

More on the workaround, tomorrow.

Keynote @ FME User Conference

FME was one of the first geospatial tools I learned at the start of my career, back in the mid-90s, and getting invited to keynote the quintennial FME Users Conference this year was quite an honour, so I wrote up a special keynote just for them.

When is an IT project just an IT project?

And when is it something more?

Every year, I report on the progress of IT outsourcing in BC (news flash: it keeps going up, 2011, 2012, 2013) and marvel at the sums we lavish on international consultancies, fees that largely march offshore, generating no local innovation or economic growth.

Last fall, I came across a news release from the Ministry of Health, describing a $842 MILLION “Clinical and Systems Transformation Project”. I now realize, I’ve not been tracking a significant seam of IT spending: the systems being commissioned by the five regional health authorities and their central services arm, the Provincial Health Services Authority.

Indeed, a quick perusal of the 2012/13 PHSA suppliers list shows a $50M spend on IBM, and an $11M spend on HP in just one year. That’s enough to change my annual spending tracker quite a bit!

So, IBM won the new “Clinical and Systems Transformation Project”, worth $842 MILLION over 10 years, I wonder what that RFP looked like? I asked for it, and was refused, so I FOI’ed it, and it came back. It’s 500 pages long. Have a look.

Fun sidebar: On page 186, in the “economic model” of the RFP, they direct that “proponents are to include 4% growth per year in infrastructure (e.g. storage capacity, network bandwidth, processing capacity, etc.) needs over the Term.” Any readers see a problem modelling IT capacity requirements at 4% growth per year over 10 years? Hint: A 2003 iMac shipped with 256Mb RAM; a 2013 iMac ships with 8Gb RAM: that’s 32 times more capacity. 4% compounding over 10 years generates only a 50% increase in capacity over a decade. Think those terms will need to be renegotiated?

It’s a long read, but fortunately there’s a really interesting bit right away, in the Mandatory Requirements:

Proponent is willing and able to transition any Public Sector union agreements relevant to the Managed Services to their organization, if required

Whoa! This isn’t just an IT systems agreement after all, it’s an outsourcing deal.

The government seems to have learned little from the experience of BC Hydro outsourcing to Accenture or Medical Services Plan to Maximus, or from reports by the Auditor General, or even their own consultants who reviewed outsourcing from 2001-2010 and noted that:

  • Contracts were structured towards a specific solution or specific outputs rather than a desired outcome
  • Contracts were negotiated in isolation gave the same scope of services to multiple vendors
  • The procurement process resulted in contracts that while defined, are no longer what is required
  • Risk transfer objectives were not met
  • There was no consolidated vendor management
  • There was no central management of the deals or the benefits achieved

The “Alternative Service Delivery Secretariat” wound down in 2010, but the government is still hard at it, now quietly preparing to outsource the clinical systems of three health authorities to IBM, for $84M a year over 10 years. Significant portions of critical government operations are being transferred beyond direct government control for very long periods of time.

Perhaps the managers who pushed this solution didn’t trust their own staff, or themselves, to successfully bring an ambitious project to conclusion. They didn’t want to “take the risk” so they took the “safe” option. They need to spend some time behind the velvet curtain in organizations like IBM or Accenture: the only results that matter to those organizations are the quarterly results.

There will be some good people in them, and some bad ones, but the level of competence or capability won’t be orders of magnitude better than you could build yourself in-house. And as organization, as corporations, they have only one bottom line, and it’s theirs, not ours.