02 August 2007

Data v. Information

I had fun with a previous post regarding a data visualization tool called many eyes, from IBM alphaWorks Services. There are some nice graphing templates available but pretty graphs simply do not the wonderful experience make. OpenOffice CALC and Microsoft Excel can produce a multitude of graphs in a variety of canned formats but do they really assist in helping one understand the data being presented to them.

Are they capable though, as tools, to transform data into information? The distinction may or may not be a subtle but the implications are huge. We're generally over-run with data and consider so much of it to be throw-away. Information, however - information being data with some sort of context applied to it - one holds onto as long as possible because the context applied to the data, the transform or function applied to some data set, increases the data's value and elevates it to that of information.

Consider a couple of simple examples:

What does this string of data mean, if anything: 011903124555555
  1. Well, it could be a random string of 16 digits and not very interesting (highly likely).
  2. Out-of-country phone dialing number? (yes, US Embassy in Turkey)
  3. Credit card number? (same format for Visa/MasterCard but not a valid number)
  4. USPS/FedEx/UPS/DHL tracking number? (UPS if you drop their "1Z" prefix)
  5. US social security number? (Massachusetts SSN with some cruft appended to the end).
  6. Product SKU (I seem to recall that there are standardized SKU formats)
We just don't know, without any context applied to it. Now, what if we thought about another string of digits in the context of identity theft:

  • 034011234,Last,First,Acct#

Huh...that looks important and maybe should be protected. Maybe it's a person with an account # and MA SSN on-record. The problem though, is that if the suspect data were changed to be:

  • Acct#,034011234,Last,First

It could become meaningless because the transformation changed through simple re-ordering of data elements and the context may no longer be identifiable therefore leaving the data as data. There's a good chance, however, in this specific case that the context could be inferred. What happens if we eliminate the comma delimiters and just spew a line of text in the hope that it will be properly caught and processed?

  • Acct#034011234LastFirst

Here we have an example where Acct# and SSN have been concatenated and probably lose meaning outside of the process that knows to stop reading the Acct# field after X characters and read the next nine characters as the SSN. First and last names can be extremely difficult to distinguish without capitalization and/or localized knowledge of standard names. Michael Smith may mean nothing to a non-English speaker.

So what does this mean from a practical point of view? Without waxing philosophical, from an information security and protection standpoint, it is an extremely compelling reason to give serious consideration to Translucent Databases, which I will post about at a future point in time.

No comments: