Tokenization and Your Private Data (2)
01 Jul 2014So, (Day 1) the BC government’s vendors (and thus, by extension, the BC government) are hot to trot to use the salesforce.com cloud CRM to store the personal data of BC citizens. But, BC privacy law does not allow that. Whatever will the government do?
Enter stage left: “tokenization”. The CIO has recommended tokenization technology for Ministries looking to use salesforce.com and other cloud services to manage private information:
Using tokenization – a method of substituting specified data fields for arbitrary values – these solutions allow for the use of foreign-based services while remaining within the residency-based restrictions of FOIPPA.
— Bette-Jo Hughes, Oct 2, 2013
Tokenization is a strategy that takes every word in an input text, and replaces it with a random substitution “token”, and keeps track of the relationship between words and tokens. So, the input to a tokenization process would be N words, and the output would be N random numbers, and an N-entry dictionary matching the words to the numbers that replaced them.
Crytography buffs will note that this is just a one-time pad, an old but unbreakable scheme for encoding messages, only operating word-by-word instead of letter-by-letter.
This seems like a nice trick!
Input | Dictionary | Output |
---|---|---|
Paul Ramsey Paul Jones Tim Jones |
Paul = rtah Ramsey = hgat Paul = fasp Jones = nasd Tim = yhav Jones = imfa |
rtah hgat fasp nasd yhav imfa |
If you are clever, you can put a tokenizing filter between your users and American web sites like SF.com, and have the tokenizer replace the words you send to SF.com with tokens, and replace the tokens SF.com sends you with words. So the data at SF.com will be gobbledegook, but what you see on your screen will be words. Magic!
If all we wanted to do was just store data securely somewhere outside of Canada, and then get it back, “tokenization” would be a grand idea, but there’s a hitch.
- First, storing tokenized data means storing 3-times the volume of the original (one copy of tokens stored at salesforce.com, and a locally stored dictionary that contains both the original and the tokens). You get no benefit from the cloud from a storage standpoint (in fact it’s worse, you’re storing twice as much local data); and, you get no redundancy benefit, since if you lose your local copy of the dictionary the cloud data becomes meaningless.
- Second, and most importantly this whole exercise isn’t about storing data, it’s about making use of a customer relationship management (CRM) system, salesforce.com, and secure tokenization, as described above, is not consistent with using salesforce.com effectively.
Tomorrow, we’ll discuss why this most excellent “tokenization” magic doesn’t work if you want to use it inside a CRM (or any other system that expects its data to have meaning).