Tokenization and Your Private Data (4)

Recapping:

  • (Day 1) The government is interested in using the salesforce.com CRM and other USA cloud applications, but the BC FOIPPA Act does not allow it.
  • (Day 2) So, the BC CIO has recommended “tokenization” systems to make personal information 100% obscured before storage in USA cloud applications.
  • (Day 3) But, using truly secure tokenization renders CRMs basically useless, so software vendors are flogging less secure forms of tokenization hoping that people won’t notice the reduced security levels because they still call it “tokenization”.

The BC CIO guidance on using USA cloud services has a certain breathless enthusiasm (is there any innovation more exciting than vendor innovation?) for the tokenization products vendors are bringing to market:

Vendors have begun to address this “data-residency” issue in innovative ways. As an example, Force.com, and CypherCloud offer solutions that allow sensitive or personal information to remain in Canada. Using tokenization – a method of substituting specified data fields for arbitrary values – these solutions allow for the use of foreign-based services while remaining within the residency-based restrictions of FOIPPA.
– BC OCIO, Data Residency and Tokenization

And the guidance released by BC’s Office of the Information & Privacy Commissioner (OIPC) at first glance appears to similarly swallow claims about tokenization hook, line and sinker.

Public bodies may comply with FIPPA provided that the personal information is adequately tokenized and the crosswalk table is secured in Canada.
– BC OIPC, Updated guidance on the storage of information outside of Canada by public bodies

However, the OIPC guidance has one small but important difference, the word “adequately”.

I met with a lawyer from the OIPC’s office to discuss tokenization, and he was clear that the OIPC understood the very important difference between fully randomized tokenization (basically unbreakable, and “adequate”) and any other tokenization (potentially trivially breakable, and perhaps not “adequate”). This is reassuring, because the difference is not immediately obvious, and the tokenization software vendors are doing everything in their power to obscure the difference in their marketing materials.

It is not reassuring that the OIPC has opened the door to “tokenization” at all. The OIPC is sufficiently anal retentive about personal information that they have ruled that no forms of standard encryption are sufficiently secure to be used to store personal information outside Canada, because “encryption may be deciphered given sufficient computer analysis”. That’s right, the OIPC scoffs at your AES-256 encoded data, but is OK with “adequate” tokenization, for some undefined values of “adequate”.

The OIPC guidance spends two paragraphs on “re-identification” of data (the practice of mixing tokenized and un-tokenized fields in records), and spends five more on the legal and physical security of the tokenization crosswalk table (dictionary), but spends only one word (“adequately”) on whether or not the tokenization dictionary is full of junk.

The OIPC told me that, because fully random tokenization completely obscured the original data[1], they had to rule that fully tokenized personal data was no longer “personal information” and thus not covered by the Act. This strikes me as very lawyerly, but also very dangerous, since it opens the door for government to consider technical “tokenization” solutions from vendors that are likely far less secure than conventional approaches (like AES-256) that the OIPC has already rejected.

I’ll close with the good news: all plans to store personal data outside Canada are still subject to case-by-case review by the OIPC, there is thus far no blanket approval for systems that claim they “tokenize”, and the OIPC can still issue further guidance based on research that is going on right now. I’m not lighting my hair on fire, yet. But the door is cracked open, and the snake-oil salesmen are laying out their wares, let’s keep an eye on them.

[1] Again, implementation matters. At a minimum, even completely random word-based tokenization can leak information about how many words are in each field. Some implementations also don’t encode punctuation, so they leak symbols (“Smith & Wesson” becomes “faerqb & gabedfsara”) and other non-word entities. Depending on the input data, these small leakages can be significant.

PostScript

In re-reading my series of posts, I think I have been overly harsh on the cloud security vendors, because there are really two questions here, which have very different answers:

  • Is less-than-perfect tokenization better than nothing? Yes, it’s a lot better than nothing. Even with less-than-perfect tokenization, employees of the cloud software companies can’t just casually read records in the database, and an entity wanting to break the security of the records would need to extract a pretty big corpus of records to analyze them to find information leaks and use them to break in.
  • Is less-than-perfect tokenization acceptable for BC? No, because of the FOIPPA law, and because the Commissioner has already set a very very very high bar by not allowing standard symmetric encryption (which can be very very secure) to be used to host personal data outside of Canada.

More on this tomorrow.

Tokenization and Your Private Data (3)

To recap:

  • (Day 1) The government is interested in using the salesforce.com CRM and other USA cloud applications, but the BC FOIPPA Act does not allow it,
  • (Day 2) So the BC CIO has recommended “tokenization” systems to make personal information 100% obscured before storage in USA cloud applications.

BUT, and it’s a big BUT, storing securely tokenized data makes cloud applications mostly useless.

As we saw yesterday, secure tokenization replaces every input word with a completely random token. This is done in practice with a tokenization server that translates words to tokens and vice versa.

The tokenization server also has to translate user queries into tokenized equivalents. So if the user asked:

“Show me the record for ‘paul’ ‘ramsey’”

The filter would translate it into this query for the server:

“Show me the record for ‘rtah’ ‘hgat’”

Hm, the magic still seems to be working. But what about a search that returns more than one record?

“Show me all the records of people named ‘Paul’”

This is harder. In a secure tokenization system, there’s a unique token for every word ever stored, even the same word. So the tokenizer now has to ask:

“Show me all the records that have firstname ‘rtah’ or ‘fasp’”

Our example has only two ‘Paul’s. Imagine this example with a database with 50 thousand ‘Paul’s. The query functionality would either not work or slow to a crawl. Can it be fixed? Sure!

We can fix the performance problem by just using the same token for every ‘Paul’ encountered by the system (and for every ‘Jones’, and so on).

Problem solved, now if I ask:

“Show me all the records of people named ‘Paul’”

The filter can translate it simply into:

“Show me all the records that have firstname ‘rtah’”

And no matter how many ‘Paul’s there are in the system it will work fine.

Just one (big) problem. Always substituting the same token for the same word turns “tokenization” from an uncrackable system into a trivial substitution cipher, like the ones you used in Grade 4 to write secret messages to your friends (only using words as the substitution elements instead of letters).

And things get even worse [1] as you add other, very common features people expect from their CRM software:

  • If you want to retrieve records in sorted order, then the tokens in the CRM must have the same sort order as the words they stand in for.
  • If you want to do substring matching (“give me all the names that start with ‘p’”) then the token internal structure must also reflect the internal structure of the original word.

None of this has stopped tokenization software vendors (like CipherCloud, one of the vendors being used by the BC government) from claiming to be able to both provide the magic unbreakability of tokenization while still supporting all the features of the backend CRM.

Cryptography buffs, interested in how CipherCloud could substantiate the claims it was making, started looking at the material it published in its manual and demonstrated at trade shows. Based on the publicly available material, one writer concluded:

The observed encryption has significant weaknesses, most of them inherent to a scheme that wants to encrypt data, while enabling the original application to perform operations such as search and sorting on the encrypted data without changing that application. There might be some advanced techniques (homomorphic encryption and the likes) that avoid these weaknesses, but at least the software demoed in the video does not use them.

In response, the company slapped their discussion site with a DMCA takedown order. This is not the action of a company that is confident in its methods.

Tomorrow, I’ll look at what the Freedom of Information Commissioner has said about “tokenization” and where we are going from here.

[1] Yes, my equality example is very simplified for teaching purposes, and there are some papers out there on “fully homomorphic encryption”, but note that FHE is still an area of research, and in any event (see tomorrow’s post), wouldn’t meet the BC Information Commissioner’s standard for extra-territorial storage of personal information.

Tokenization and Your Private Data (2)

So, (Day 1) the BC government’s vendors (and thus, by extension, the BC government) are hot to trot to use the salesforce.com cloud CRM to store the personal data of BC citizens. But, BC privacy law does not allow that. Whatever will the government do?

Enter stage left: “tokenization”. The CIO has recommended tokenization technology for Ministries looking to use salesforce.com and other cloud services to manage private information:

Using tokenization – a method of substituting specified data fields for arbitrary values – these solutions allow for the use of foreign-based services while remaining within the residency-based restrictions of FOIPPA.
Bette-Jo Hughes, Oct 2, 2013

Tokenization is a strategy that takes every word in an input text, and replaces it with a random substitution “token”, and keeps track of the relationship between words and tokens. So, the input to a tokenization process would be N words, and the output would be N random numbers, and an N-entry dictionary matching the words to the numbers that replaced them.

Crytography buffs will note that this is just a one-time pad, an old but unbreakable scheme for encoding messages, only operating word-by-word instead of letter-by-letter.

This seems like a nice trick!

Input Dictionary Output
Paul Ramsey
Paul Jones
Tim Jones
Paul = rtah
Ramsey = hgat
Paul = fasp
Jones = nasd
Tim = yhav
Jones = imfa
rtah hgat
fasp nasd
yhav imfa

If you are clever, you can put a tokenizing filter between your users and American web sites like SF.com, and have the tokenizer replace the words you send to SF.com with tokens, and replace the tokens SF.com sends you with words. So the data at SF.com will be gobbledegook, but what you see on your screen will be words. Magic!

If all we wanted to do was just store data securely somewhere outside of Canada, and then get it back, “tokenization” would be a grand idea, but there’s a hitch.

  • First, storing tokenized data means storing 3-times the volume of the original (one copy of tokens stored at salesforce.com, and a locally stored dictionary that contains both the original and the tokens). You get no benefit from the cloud from a storage standpoint (in fact it’s worse, you’re storing twice as much local data); and, you get no redundancy benefit, since if you lose your local copy of the dictionary the cloud data becomes meaningless.
  • Second, and most importantly this whole exercise isn’t about storing data, it’s about making use of a customer relationship management (CRM) system, salesforce.com, and secure tokenization, as described above, is not consistent with using salesforce.com effectively.

Tomorrow, we’ll discuss why this most excellent “tokenization” magic doesn’t work if you want to use it inside a CRM (or any other system that expects its data to have meaning).

Tokenization and Your Private Data (1)

One morning this winter, while I was sipping my coffee at the cafe below our office, a well-dressed man and woman sat down at the table next to me, and started talking. Turns out, they were my favourite kind of people — IT people! They were going to bid on the Integrated Decision Making project, and were talking about my favourite systems integrator, Deloitte.

“Is Deloitte trying to bring ICM and Siebel into this project?” she asked.

“No, not anymore” he replied “now they are really pushing SalesForce.com.”

Now this was interesting! Chastened by their failure to shoehorn social services case management into a CRM, Deloitte has adroitly pivoted and is trying to shoehorn natural resource permitting into … a cloud CRM.

(I should parenthetically point out that, unsurprisingly, the SALES people in our company find SALESforce.com very useful in coordinating and tracking their SALES activities.)

Certainly pushing a platform that is actually growing in usage makes more sense than pushing one that end-of-lifed a decade ago, but still, again with the CRM?

Deloitte isn’t being coy with their plans, they are selling them to the highest levels of the government. On October 7, 2013, the BC CIO spent two and a half hours enjoying the hospitality of Deloitte and Salesforce.com at a “BC government executive luncheon” on the topic “Innovation, Transformation and Cloud Computing in the Public Sector”.

And there’s another wrinkle. SF.com is a US-based cloud service provider, and our Freedom of Information and Protection of Privacy Act (FOIPPA) says that personal data must be stored in Canada. SF.com is also a US legal entity, which means they are subject to the PATRIOT Act which allows authorities to access personal data without notifying the subject of the search. That is also not allowed by BC’s FOIPPA.

What is an ambitious system integrator with a hammer suitable for every nail to do? Not change hammers! That would be silly. Far better to try and get an exemption or figure out a workaround. Workarounds add nice juicy extra complexity to the hammer, which can only help billable hours.

More on the workaround, tomorrow.

Keynote @ FME User Conference

FME was one of the first geospatial tools I learned at the start of my career, back in the mid-90s, and getting invited to keynote the quintennial FME Users Conference this year was quite an honour, so I wrote up a special keynote just for them.