What is the Challenge?

Successful enterprise data processing systems have a tendency to grow out of their well managed walled-gardens.  This happens because they need additional resources and tools found outside the garden, e.g. offered by a Cloud provider, or because they start to connect into other enterprise systems. Traditional metadata catalogs do not scale well to handle this organic growth, depending on static and predefined processes to manage and curate data within some well defined perimeter.  This makes enterprise data management an intractable problem as the data is constantly been moved and processed between these heterogeneous systems by a large number of independently managed tools.

Significant value can be enabled by  gathering, linking and enriching metadata at enterprise scale allowing cost reduction through, for example identifying cold data, removing unnecessary data duplication or better scheduling of workflows. New value added services can be introduced with the guarantee that sensitive data is being used in the way it is intended and being properly managed and protected.

This is analogous to the ways in which enterprises have unlocked hidden value in their data through big data processing. The processing of heterogeneous metadata at scale is in fact a type of big data problem with the same classic 5 Vs of: velocity, value, variety, volume and veracity. 

Pathfinder data-centric enterprise metadata management

Pathfinder

Automated discovery and metadata collection

Extensive heterogeneous metadata sources

Metadata enriched and linked creating Data Map

Event-based notification of key changes

Supporting a wide range of data management applications

Pathfinder

Pathfinder is an event-based system that dynamically collects metadata from data catalogs, data stores (databases, file systems) and systems that transform and process data. The metadata is linked, enriched and stored in data maps tailored for data-management applications.

Pathfinder

Introducing the Enterprise Data Map

The metadata is extracted from source systems into a location where is is gathered, linked and enriched. 
This can be thought of as analogous to a data lake in a big data system.
As most of the value in the metadata is gained through understanding the complex relationships between  entities this is stored as a graph that allows the entire data of the enterprise to be mapped, such that questions about where data is stored, how are those storage system protected, where does data flow can be answered at the level of the enterprise rather than a single isolated processing platform. Metadata stored in the graph can be enriched  just as in a classic data lake by multiple independent processes that can add new relationship and entities that were not in the raw data, e.g. that data pipelines resemble each other, which can possibly be merged enabling cost saving, or data at a certain classification is not being properly handled, thus exposing the company open to possible litigation. The map is open and scalable meaning that advanced machine learning techniques can be brought to bare to extract latent information hidden in the complexity.

We enable the creation of the Enterprise Data Map with a system called Pathfinder.

How Does Pathfinder Work?

We are using a new event-based approach to data management which discovers what is actually occurring on the systems. Metadata is collected from heterogeneous data processing and storage systems, distributed through streaming and combined and analyzed via specific data management enrichment processes. The metadata is extracted from sources by collectors, serialized into a graph of entity/relationships and stored as a sequence of events in a change log.  The events are generated and propagated in real-time. The enrichment processes consume these events and then write new metadata back to the change log.  The compacted log contains the canonical set of metadata for the enterprise and its evolution over time. It can be materialized in different processing systems for specific usages. 

Metadata Lake Architecture

Metadata Lake Architecture

Pathfinder Compliance Scenario

The screenshots made with the Pathfinder GUI and explanations illustrate how having a set of linked, enriched metadata can provide insight into complex data compliance scenario.

Pathfinder


End-to-end view
By collecting, linking, and enriching a rich set of metadata, we get a complete view of a test environment on IBM Cloud. Through our GUI we see the datasets (purple), the data stores (IBM Db2 database at the left, IBM Cloud Object Store (COS) bucket at the right), the systems processingthe data (orange), and the infrastructure on which this test scenario is running (Kubernetes on IBM Cloud in Germany and OpenShift on IBM Cloud in the US). The initial dataset is the table in the red circle that is stored in a Db2 database on IBM Cloud in Germany. It gets copied into a Kafka topic (middle), and then finally into a COS bucket in the US.   

Pathfinder


Source dataset details
By selecting the details view for the origin data set we can see a subset of the metadata associated with that data, including the data classification, Sensitive Personal Information (SPI). In our test setup, when this dataset is copied, there are no transformations that would cause this classification to change.

Pathfinder


Target dataset details
Finally, we focus on the target data set created on the COS bucket on IBM Cloud in the US. In the alerts view we see that this copy of the table has not been cataloged, and that there is a risk of violating the European Unions General Data Protection Regulation (GDPR), since sensitive personal information was copied from a datastore in the European Union (Germany) to a datastore outside of the EU. From the metadata collection, we can see that this dataset is being processed in a Jupyter notebook in the US, so there is also a potential issue associated with any models created by incorporating that data.  

The rich collection of metadata, giving us an end-to-end view of data sets, where they are stored, how they are processed, and many other details, enables a wide range of data management applications that address issues faced by all global organizations.

Video

News & Publications

S. Rooney, L. Garcés-Erice, D. Bauer, P. Urbanetz,
Pathfinder: Building the Enterprise Data Map,”
Big Data, 2021.

EnterpriseDataMap Blog, “Metadata as Big Data - Introducing Pathfinder

profile
Daniel Bauer

profile
Niels Pardon

profile
Enrico Toniato

profile
Peter Urbanetz