IBM®
Skip to main content
    Privacy Research Institute      Terms of use
 
 
 
     Home      Products      Services & solutions      Support & downloads      My account     
IBM Research

Privacy-preserving data mining

 


Project overview
A data mining model that enable to extract statistical data while safeguarding the personally identifiable data

Traditional data mining techniques require access to precise information in individual records, prohibiting their use in cases where privacy is an issue. This project developed at IBM Almaden Research Center, allows users to randomize information in their records, then applies novel algorithms to compensate for the randomization at the aggregate level, thus preserving privacy at the individual level while still building accurate data mining novels.

The goal of privacy preserving data mining is to develop accurate models without access to precise information in individual data records, thus resolving the conflict between privacy and data mining.

Business value
Privacy-enhanced data mining offers a competitive advantage:
» Reduced privacy risk: Valuable statistics can be extracted without exposing personally identifiable information.
» Higher quality: Mining of personal data is limited by regulations. By anonymizing data first, regulations allow a larger set of data to be mined.
     
What is randomization?

Privacy-Preserving Data Mining relies on the notion that one's personal data can be protected by being scrambled or randomized prior to being communicated. By applying this technique, a retailer could generate highly accurate data models without ever seeing personal information.

A Web user decides to enter a piece of personal data — e.g., age, salary, weight. Upon entry, that number, say age 30 is immediately scrambled or 'randomized' by IBM software: the software takes the original number that was input and adds (or subtracts) to it a random value.

This randomization step is performed independently for every user who opts to enter their age. So, a 30 year old's age may be randomized to 42, while a 34 year old's entry may be randomized to 28. The randomization differs for every single user. What does not change is the allowed range of the randomization. And, the range is directly linked to the desired level of privacy.

Large randomization increases the uncertainty and the personal privacy of the users. However, at the same time, larger randomizations can cause loss in the accuracy of the results that are, at the end, produced by a data mining algorithm that uses the randomized data as input.

Once all the randomized data is in for a large number of users, the privacy preserving data mining software takes the randomized distribution and reconstructs how the true distribution might have looked like.

   
back to top    
    About IBM Privacy Contact