Scientific documents such as papers, reports and patents but also other professional documents such as financial or medical reports very often include numerous diagrams. The goal of these diagrams is to illustrate, in a graphical way, data sets that explain, describe or emphasize the textual content of those documents.

Such data sets can be generated by experiments, measurements, observations or other means, and are uniquely depicted in these technical diagrams for the reader to extract the message they convey in a fast and efficient way.

We aim to develop image ana­ly­tics that builds se­man­tic un­der­stand­ing from tech­ni­cal dia­grams.

—Maria Gabrani, IBM scientist

With the emergence of Internet searches and archival storage, together with the speed at which new scientific documents are constantly being created, it is of great value to have tools that can not only scan numerous documents and extract their main scientific information automatically, but also present this information in a concise and meaningful way.

However, for a document to be completely and thoroughly analyzed, its diagrams also need to be processed in order to extract the key information presented by the depicted data sets.

The problem is that such graphs are typically stored as bitmap images, the data sets are often very noisy, and the graphic symbols used to depict the data, such as lines, markers and labelling, often overlap, intersect or otherwise override each other.

Cognitive analysis for extracting knowledge from graphics in documents

At IBM Research – Zurich, we are developing computational techniques based on image processing and machine learning to extract the data sets — and in turn the information they represent — automatically. From the taxonomy of various diagrams, we are currently focusing on line and scatter plots.

Conventional information retrieval focuses on text. Our challenge is to extract non-textual information from a variety of sources.

Numerical data

Information in scientific diagrams and tables is quantitative and often not contained in text.

Codified data

Information is often codified in diagrams. This is common in clinical studies.

Multimodality

Text + charts + tables = Far more complete information.

Error correction

Numerical data within documents can be used to error-correct extracted text and numerical data.

Context

Numerical data can be used across related documents for context and comparison.

Business impact

Cognitive businesses base their decision-making process on insights extracted from the vast amount of available data.

Visual comprehension

computers are blind to graphicsComputers are blind to graphics in documents.

Data extraction

A tool that can automatically scan through a surfeit of documents, extract the main scientific information, and present it in a concise and meaningful way is of great value.

Comparison

Comparison between data sets across different documents enables

  • Competitive study analysis,
  • Establishment of trends.

Error correction

Increases confidence in extracted information.

Automatic graph generation

Valuable addition to document image analysis, document understanding and information retrieval in digital documents.

Ultimate goal

Semantic understanding of technical diagrams.

Challenges

Diversity

Diagram taxonomy:
10 types of diagrams plus flowcharts

Variability

Data acquired from public sources is huge

Complexity

Overlapping, overwriting, noisy

Lack of truth

No labels, no exact values