It has become a business imperative for organizations to differentiate themselves by means of their effectiveness at generating business insight. A key source of underused insight is contained within the operational data that powers enterprises. It has proven to be difficult to consolidate, enrich and analyze operational data for many reasons.

Organizations have discovered that solutions such as corporate data lakes have struggled to deliver on their promised benefits. This is due in part to uncertainty regarding how regulations such as GDPR impact the moving, combining and processing of data, and in part to the difficulty in desensitizing data such that utility and integrity is not broken.

Our research looks at technology and methodologies to desensitize data in such a way that the data utility is maximized for compliance with a particular regulation.

High assurance desensitization

Our research into high assurance desensitization looks at technology for creating tokens or pseudonyms whose properties make them suitable for use in pseudoanonymizing complex data models where data integrity across distributed data sources and data semantics needs to be maintained.

The technology supports semantic preserving and semantic mapping encryption for stateless tokenization, or memory-backed HMAC tables for stateful tokenization. In both cases, hardware security modules can be used for key management in accordance with highly regulated environments such as banking.

Use cases

Helping enterprise data lakes or legacy systems to become GDPR-compliant.

Extracting business insights from production data.

Preparing data to be sent for cloud processing.

Unlinkable pseudonyms

In our ever more digital society, personal data is increasingly being collected, processed, maintained and exchanged in electronic form. When data collection and operations happen in a distributed fashion, it is often important that different data sets of the same user can be associated.

Typical examples of such distributed, yet linkable data sets are health records or governmental databases. Many countries, including the US, Belgium, Denmark, and Sweden, use a nation-wide social security number for linkage.

Although the use of such unique identifiers across the entire system easily allows the various entities to correlate their records, it poses serious risks to data security and user privacy.

For one thing, it is difficult — if not impossible — to control and limit the exchange of records between entities. Moreover, any data breach reveals fully identifiable and linkable personal information.

We are researching solutions to this problem in the form of entity-specific identifiers (called pseudonyms), which are unlinkable per se. However, in cases where there is a legitimate need to link a user’s various records, these pseudonyms are established via a central entity that can convert between pseudonyms of the same user on a case-per-case basis. This conversion is done in such a way that the converter learns nothing about the contents of what it is converting.

Ask the experts

Michael Osborne

Michael Osborne

IBM Research scientist

Anja Lehmann

Anja Lehmann

IBM Research scientist

Tamas Visegrady

Tamas Visegrady

IBM Research scientist