Tokenization and Your Private Data (3)
02 Jul 2014To recap:
- (Day 1) The government is interested in using the salesforce.com CRM and other USA cloud applications, but the BC FOIPPA Act does not allow it,
- (Day 2) So the BC CIO has recommended “tokenization” systems to make personal information 100% obscured before storage in USA cloud applications.
BUT, and it’s a big BUT, storing securely tokenized data makes cloud applications mostly useless.
As we saw yesterday, secure tokenization replaces every input word with a completely random token. This is done in practice with a tokenization server that translates words to tokens and vice versa.
The tokenization server also has to translate user queries into tokenized equivalents. So if the user asked:
“Show me the record for ‘paul’ ‘ramsey’”
The filter would translate it into this query for the server:
“Show me the record for ‘rtah’ ‘hgat’”
Hm, the magic still seems to be working. But what about a search that returns more than one record?
“Show me all the records of people named ‘Paul’”
This is harder. In a secure tokenization system, there’s a unique token for every word ever stored, even the same word. So the tokenizer now has to ask:
“Show me all the records that have firstname ‘rtah’ or ‘fasp’”
Our example has only two ‘Paul’s. Imagine this example with a database with 50 thousand ‘Paul’s. The query functionality would either not work or slow to a crawl. Can it be fixed? Sure!
We can fix the performance problem by just using the same token for every ‘Paul’ encountered by the system (and for every ‘Jones’, and so on).
Problem solved, now if I ask:
“Show me all the records of people named ‘Paul’”
The filter can translate it simply into:
“Show me all the records that have firstname ‘rtah’”
And no matter how many ‘Paul’s there are in the system it will work fine.
Just one (big) problem. Always substituting the same token for the same word turns “tokenization” from an uncrackable system into a trivial substitution cipher, like the ones you used in Grade 4 to write secret messages to your friends (only using words as the substitution elements instead of letters).
And things get even worse [1] as you add other, very common features people expect from their CRM software:
- If you want to retrieve records in sorted order, then the tokens in the CRM must have the same sort order as the words they stand in for.
- If you want to do substring matching (“give me all the names that start with ‘p’”) then the token internal structure must also reflect the internal structure of the original word.
None of this has stopped tokenization software vendors (like CipherCloud, one of the vendors being used by the BC government) from claiming to be able to both provide the magic unbreakability of tokenization while still supporting all the features of the backend CRM.
Cryptography buffs, interested in how CipherCloud could substantiate the claims it was making, started looking at the material it published in its manual and demonstrated at trade shows. Based on the publicly available material, one writer concluded:
The observed encryption has significant weaknesses, most of them inherent to a scheme that wants to encrypt data, while enabling the original application to perform operations such as search and sorting on the encrypted data without changing that application. There might be some advanced techniques (homomorphic encryption and the likes) that avoid these weaknesses, but at least the software demoed in the video does not use them.
In response, the company slapped their discussion site with a DMCA takedown order. This is not the action of a company that is confident in its methods.
Tomorrow, I’ll look at what the Freedom of Information Commissioner has said about “tokenization” and where we are going from here.
[1] Yes, my equality example is very simplified for teaching purposes, and there are some papers out there on “fully homomorphic encryption”, but note that FHE is still an area of research, and in any event (see tomorrow’s post), wouldn’t meet the BC Information Commissioner’s standard for extra-territorial storage of personal information.