What is randomization?
Privacy-Preserving Data Mining relies on the notion that one's personal data
can be protected by being scrambled or randomized prior to being communicated.
By applying this technique, a retailer could generate highly accurate data models
without ever seeing personal information.
A Web user decides to enter a piece of personal data e.g., age, salary,
weight. Upon entry, that number, say age 30 is immediately scrambled or 'randomized'
by IBM software: the software takes the original number that was input and adds
(or subtracts) to it a random value.
This randomization step is performed independently for every user who opts
to enter their age. So, a 30 year old's age may be randomized to 42, while a
34 year old's entry may be randomized to 28. The randomization differs for every
single user. What does not change is the allowed range of the randomization.
And, the range is directly linked to the desired level of privacy.
Large randomization increases the uncertainty and the personal privacy of the
users. However, at the same time, larger randomizations can cause loss in the
accuracy of the results that are, at the end, produced by a data mining algorithm
that uses the randomized data as input.
Once all the randomized data is in for a large number of users, the privacy
preserving data mining software takes the randomized distribution and reconstructs
how the true distribution might have looked like.
|