Some Privacy is More Private Than Others

One of the things that struck me in researching the long and tortuous story of how the government is trying to move British Columbian’s private data into off-shore cloud computing services was the odd choice of the pilot project for the whole scheme: STADD.

What’s STADD? It’s “Services to Adults with Developmental Disabilities”.

That’s right, adults with developmental disabilities are the subjects of the BC government’s experiment to see “hmm, I wonder if we can offshore private data using fancy tokenization software”.

Let me put some icing on the cake.

The BC Liberal caucus has to manage information about the citizens who access services via their constituency offices. These are their “customers” and they use a “customer relationship management” (CRM) system to hold the information.

Are they storing this personal information offshore? Are they trying to shoehorn it into salesforce.com using tokenization software to avoid FOIPPA restrictions and protect their constituents from the PATRIOT Act?

No, that would be risky, that’s the kind of thing that STADD can pilot. The BC Liberal caucus uses a product called “Maximizer CRM”. Designed, built and hosted in… Vancouver, British Columbia.

FOSS4G 2014 in Portland, Oregon, September 2014

Just a quick public service announcement for blog followers in the Pacific Northwest and environs: you’ve got a once in a not-quite-lifetime opportunity to attend the “Free and Open Source Software for Geospatial” (aka FOSS4G) conference this year in nearby Portland, Oregon, a city so hip they have trouble seeing over their pelvis.

Anyone in the GIS / mapping world should take the opportunity to go, to learn about what technology the open source world has available for you, to meet the folks writing the software, and the learn from other folks like you who are building cool things.

September 8th-13th, be there and be square.

Tokenization and Your Private Data (5)

Recapping (last time):

  • (Day 1) The government is interested in using the salesforce.com CRM and other USA cloud applications, but the BC FOIPPA Act does not allow it.
  • (Day 2) So, the BC CIO has recommended “tokenization” systems to make personal information 100% obscured before storage in USA cloud applications.
  • (Day 3) But, using truly secure tokenization renders CRMs basically useless, so software vendors are flogging less secure forms of tokenization hoping that people won’t notice the reduced security levels because they still call it “tokenization”.
  • (Day 4) And, the BC Freedom of Information & Privacy Commissioner distinguishes between “encryption” (which is considered inadequate protection for personal information held outside Canada) and “tokenization” (which is considered adequate (but only where the “tokenization” itself is “adequate” (which seems to mean “fully random”))).

While this series on tokenization has been a bomb with regular folks (my post on the BCTF and social media got 10x the traffic) one category of readers have really taken notice: tokenization vendors. I’ve gotten a number of emails, and some educational comments as well. (Hi guys!)

For the love of the vendors, I’ll repeat yesterdays postscript. I think I have been overly harsh on the cloud security vendors, because there are really two questions here, which have very different answers:

  • Is less-than-perfect tokenization better than nothing? Yes, it’s a lot better than nothing. Even with less-than-perfect tokenization, employees of the cloud software companies can’t just casually read records in the database, and an entity wanting to break the security of the records would need to extract a pretty big corpus of records to analyze them to find information leaks and use them to break in.
  • Is less-than-perfect tokenization acceptable for BC? No, because of the FOIPPA law, and because the Commissioner has already set a very very very high bar by not allowing standard symmetric encryption (which can be very very secure) to be used to host personal data outside of Canada.

It’s worth re-visiting the two key phrases in the OIPC guidance, which are:

Tokenization is distinct from encryption; while encryption may be deciphered given sufficient computer analysis, tokens cannot be decoded without access to the crosswalk table.

What I take from this is that the OIPC is saying that “encryption” is vulnerable (it “may be deciphered”), and “tokenization” is not (it “cannot be decoded”). Now, as discussed on day 3, the “cannot be decoded” part is only true for a very small sub-set of “tokenization”, the kind that uses fully random tokens. And the OIPC is aware of this, though they only barely acknowledge it:

Public bodies may comply with FIPPA provided that the personal information is adequately tokenized and the crosswalk table is secured in Canada.

If you take “adequately” to mean “adequately” such that “tokens cannot be decoded without access to the crosswalk table” then you’re talking about an extremely restrictive definition of tokenization. A lot more restrictive than what vendors are talking about when they come to sell you tokenization.

The vendors who are phoning me and commenting here are worried that readers will see my critique and think “huh, tokenization is insecure”. And that’s not what I’m saying. What I’m saying is:

Practical use of tokenization in a USA cloud CRM is not consistent with the British Columbia OIPC’s incredibly narrow definition of an acceptable level of data security for personal information stored in foreign jurisdictions or under foreign control.
– Paul Ramsey, Just Now

If you’re just looking for a reasonable level of surety that your data in a cloud service cannot be easily poked and prodded by a third party (or the cloud service itself), and you don’t mind adding the extra level of complexity of interposing a tokenization service/server into your interactions with the cloud service, then by all means, a properly configured tokenization system would seem to fit the bill nicely.

YMMV.

Tokenization and Your Private Data (4)

Recapping:

  • (Day 1) The government is interested in using the salesforce.com CRM and other USA cloud applications, but the BC FOIPPA Act does not allow it.
  • (Day 2) So, the BC CIO has recommended “tokenization” systems to make personal information 100% obscured before storage in USA cloud applications.
  • (Day 3) But, using truly secure tokenization renders CRMs basically useless, so software vendors are flogging less secure forms of tokenization hoping that people won’t notice the reduced security levels because they still call it “tokenization”.

The BC CIO guidance on using USA cloud services has a certain breathless enthusiasm (is there any innovation more exciting than vendor innovation?) for the tokenization products vendors are bringing to market:

Vendors have begun to address this “data-residency” issue in innovative ways. As an example, Force.com, and CypherCloud offer solutions that allow sensitive or personal information to remain in Canada. Using tokenization – a method of substituting specified data fields for arbitrary values – these solutions allow for the use of foreign-based services while remaining within the residency-based restrictions of FOIPPA.
– BC OCIO, Data Residency and Tokenization

And the guidance released by BC’s Office of the Information & Privacy Commissioner (OIPC) at first glance appears to similarly swallow claims about tokenization hook, line and sinker.

Public bodies may comply with FIPPA provided that the personal information is adequately tokenized and the crosswalk table is secured in Canada.
– BC OIPC, Updated guidance on the storage of information outside of Canada by public bodies

However, the OIPC guidance has one small but important difference, the word “adequately”.

I met with a lawyer from the OIPC’s office to discuss tokenization, and he was clear that the OIPC understood the very important difference between fully randomized tokenization (basically unbreakable, and “adequate”) and any other tokenization (potentially trivially breakable, and perhaps not “adequate”). This is reassuring, because the difference is not immediately obvious, and the tokenization software vendors are doing everything in their power to obscure the difference in their marketing materials.

It is not reassuring that the OIPC has opened the door to “tokenization” at all. The OIPC is sufficiently anal retentive about personal information that they have ruled that no forms of standard encryption are sufficiently secure to be used to store personal information outside Canada, because “encryption may be deciphered given sufficient computer analysis”. That’s right, the OIPC scoffs at your AES-256 encoded data, but is OK with “adequate” tokenization, for some undefined values of “adequate”.

The OIPC guidance spends two paragraphs on “re-identification” of data (the practice of mixing tokenized and un-tokenized fields in records), and spends five more on the legal and physical security of the tokenization crosswalk table (dictionary), but spends only one word (“adequately”) on whether or not the tokenization dictionary is full of junk.

The OIPC told me that, because fully random tokenization completely obscured the original data[1], they had to rule that fully tokenized personal data was no longer “personal information” and thus not covered by the Act. This strikes me as very lawyerly, but also very dangerous, since it opens the door for government to consider technical “tokenization” solutions from vendors that are likely far less secure than conventional approaches (like AES-256) that the OIPC has already rejected.

I’ll close with the good news: all plans to store personal data outside Canada are still subject to case-by-case review by the OIPC, there is thus far no blanket approval for systems that claim they “tokenize”, and the OIPC can still issue further guidance based on research that is going on right now. I’m not lighting my hair on fire, yet. But the door is cracked open, and the snake-oil salesmen are laying out their wares, let’s keep an eye on them.

[1] Again, implementation matters. At a minimum, even completely random word-based tokenization can leak information about how many words are in each field. Some implementations also don’t encode punctuation, so they leak symbols (“Smith & Wesson” becomes “faerqb & gabedfsara”) and other non-word entities. Depending on the input data, these small leakages can be significant.

PostScript

In re-reading my series of posts, I think I have been overly harsh on the cloud security vendors, because there are really two questions here, which have very different answers:

  • Is less-than-perfect tokenization better than nothing? Yes, it’s a lot better than nothing. Even with less-than-perfect tokenization, employees of the cloud software companies can’t just casually read records in the database, and an entity wanting to break the security of the records would need to extract a pretty big corpus of records to analyze them to find information leaks and use them to break in.
  • Is less-than-perfect tokenization acceptable for BC? No, because of the FOIPPA law, and because the Commissioner has already set a very very very high bar by not allowing standard symmetric encryption (which can be very very secure) to be used to host personal data outside of Canada.

More on this tomorrow.

Tokenization and Your Private Data (3)

To recap:

  • (Day 1) The government is interested in using the salesforce.com CRM and other USA cloud applications, but the BC FOIPPA Act does not allow it,
  • (Day 2) So the BC CIO has recommended “tokenization” systems to make personal information 100% obscured before storage in USA cloud applications.

BUT, and it’s a big BUT, storing securely tokenized data makes cloud applications mostly useless.

As we saw yesterday, secure tokenization replaces every input word with a completely random token. This is done in practice with a tokenization server that translates words to tokens and vice versa.

The tokenization server also has to translate user queries into tokenized equivalents. So if the user asked:

“Show me the record for ‘paul’ ‘ramsey’”

The filter would translate it into this query for the server:

“Show me the record for ‘rtah’ ‘hgat’”

Hm, the magic still seems to be working. But what about a search that returns more than one record?

“Show me all the records of people named ‘Paul’”

This is harder. In a secure tokenization system, there’s a unique token for every word ever stored, even the same word. So the tokenizer now has to ask:

“Show me all the records that have firstname ‘rtah’ or ‘fasp’”

Our example has only two ‘Paul’s. Imagine this example with a database with 50 thousand ‘Paul’s. The query functionality would either not work or slow to a crawl. Can it be fixed? Sure!

We can fix the performance problem by just using the same token for every ‘Paul’ encountered by the system (and for every ‘Jones’, and so on).

Problem solved, now if I ask:

“Show me all the records of people named ‘Paul’”

The filter can translate it simply into:

“Show me all the records that have firstname ‘rtah’”

And no matter how many ‘Paul’s there are in the system it will work fine.

Just one (big) problem. Always substituting the same token for the same word turns “tokenization” from an uncrackable system into a trivial substitution cipher, like the ones you used in Grade 4 to write secret messages to your friends (only using words as the substitution elements instead of letters).

And things get even worse [1] as you add other, very common features people expect from their CRM software:

  • If you want to retrieve records in sorted order, then the tokens in the CRM must have the same sort order as the words they stand in for.
  • If you want to do substring matching (“give me all the names that start with ‘p’”) then the token internal structure must also reflect the internal structure of the original word.

None of this has stopped tokenization software vendors (like CipherCloud, one of the vendors being used by the BC government) from claiming to be able to both provide the magic unbreakability of tokenization while still supporting all the features of the backend CRM.

Cryptography buffs, interested in how CipherCloud could substantiate the claims it was making, started looking at the material it published in its manual and demonstrated at trade shows. Based on the publicly available material, one writer concluded:

The observed encryption has significant weaknesses, most of them inherent to a scheme that wants to encrypt data, while enabling the original application to perform operations such as search and sorting on the encrypted data without changing that application. There might be some advanced techniques (homomorphic encryption and the likes) that avoid these weaknesses, but at least the software demoed in the video does not use them.

In response, the company slapped their discussion site with a DMCA takedown order. This is not the action of a company that is confident in its methods.

Tomorrow, I’ll look at what the Freedom of Information Commissioner has said about “tokenization” and where we are going from here.

[1] Yes, my equality example is very simplified for teaching purposes, and there are some papers out there on “fully homomorphic encryption”, but note that FHE is still an area of research, and in any event (see tomorrow’s post), wouldn’t meet the BC Information Commissioner’s standard for extra-territorial storage of personal information.