What is predictive maintenance?

Predictive maintenance helps determine the condition of in-service equipment in order to predict when maintenance operations are to be performed. Its main promise is to enable just-in-time scheduling of corrective maintenance. This alleviates the major drawbacks of both reactive and preventive maintenance.

The key is “the right information at the right time.” By knowing which equipment needs maintenance, work can be better scheduled, and unplanned costly shutdowns are replaced by shorter and fewer planned shutdowns, thus improving equipment reliability, minimizing potential data loss and reducing maintenance costs. In addition, this extends the useful lifetime of equipment and optimizes the handling of spare parts.

Unlike preventive maintenance, predictive maintenance relies on the actual condition of the equipment — rather than on expected end-of-life statistics — to predict the future trend of the equipment’s condition. The condition of the equipment is evaluated by periodic or continuous non-invasive monitoring while the equipment is in service, thus minimizing disruption of normal operations. Statistical models are used to detect even minor anomalies and failure patterns and to determine at what point in the future maintenance action will be required.

Industry solutions

At IBM, we apply predictive maintenance solutions to any type of equipment, ranging from IT components to cash machines, wind turbines or even aircraft. For instance, tracking real-time telemetry data predicts the remaining useful life of an aircraft engine or the failure of electric pumps used for extraction in the oil and gas industry. It can even be used to forecast energy demands in grids.

Ultimately, our goals are to predict where, when and why failures are likely to occur, as well as to identify quickly the primary drivers as part of the analysis process.


Use case

Predictive maintenance for IT equipment

In recent years, downtime costs for IT equipment have increased significantly to thousands of dollars per minute, especially in data centers. To maintain reliability when a piece of equipment experiences a failure, sophisticated defense and redundancy mechanisms have been put in place, such as clusters for servers, RAID for storage or disaster recovery environments.

The usage of such mechanisms raises the question of whether there is still a need for predictive maintenance for IT equipment. From a business logic perspective there are two major benefits of applying predictive maintenance. First, even if an IT component is redundant, replacement still requires downtime, and planned is always preferable to unplanned downtime because it reduces the risk of consecutive, correlated failures, which can lead to redundancy loss. Second, it takes time to rebuild redundancy after a replacement measure. Therefore, it is worthwhile to understand the risk of multiple failures.


IT example 1

Predictive maintenance for DRAM

Failures of dynamic random access memory (DRAM), although rare, represent a major concern due to potential data loss. These failures are usually preceded by memory errors, which accumulate over time. Memory errors are events that lead to the logical state of one or multiple bits being read differently from how they were last written. These can be caused by electrical or magnetic interference, hardware problems or be the result of corruption along the data path.

If errors corrupt bits randomly, without leaving any physical damage, they can be handled through error correction codes implemented by the manufacturer. However, if errors corrupt the same bits repeatedly, they become uncorrectable and lead to failures and eventually to the replacement of the affected dual in-line memory module. Such errors impact machine uptime and data availability, and incur significant labor costs and revenue loss.

It is our focus to build intelligent statistical models that can predict uncorrectable errors just in time (days or weeks in advance). This enables us to reach two primary goals. First, it allows support delivery teams to schedule the maintenance work, while minimizing operation costs. Second, it extends the lifetime of the DRAM as long as possible, before maintenance needs to happen. Our models are based on a deep understanding of what other error types and sensor metrics are indicative of uncorrectable errors and how their trends affect the occurrence of these errors.


IT example 2

Predictive maintenance for disks

Disks are among the most frequently failing components in IT environments. Even with defense and redundancy mechanisms such as RAID in place, correlated failures of disks in the same unit occur quite frequently. Factors such as temperature, duty cycles or intensive workloads affect both the reliability and performance of disks, which in turn lead to failures.

Disk failures can be either predictable or unpredictable. On the one hand, unpredictable failures — ranging from electronic components becoming defective to sudden crashes due to improper handling — cannot be foreseen by monitoring. On the other hand, predictable failures mainly appear due to wear-and-tear that typically progresses over months or years. These latter failures can be tackled by predictive failure analysis.

Disk manufacturers already use SMART monitoring metrics in embedded predictive models. However, these models are typically threshold-based and designed to avoid false alarms, therefore have weak predictive power. At IBM Research – Zurich, we study in depth the evolution trends of SMART metrics over time, understand their degradation patterns and use this knowledge to build sophisticated statistical models that can predict failures with high accuracy.

Use case

Predictive maintenance for ATMs

Automated teller machines (ATMs) are electronic telecommunication devices that enable customers of financial institutions to perform cash withdrawals and deposits as well as many other bank-related functions. ATM functions often require a complex interplay of multiple individual components, such as the card reader, dispenser and receipt printer. As those components usually consist of very delicate electronic and mechanical parts, they tend to fail due to high utilization, harsh climate conditions or misuse. In many cases, a single broken component can cause an ATM to be inoperable, which may not only lead to dissatisfied costumers, but also incurs damage to the financial institution.

Predictive maintenance for ATMs is designed to assess the health status and to model the aging of individual components. Predicting component failures not only allows efficient scheduling of maintenance, but also increases the availability of ATMs. As predictive maintenance for ATMs allows cost savings over routine or time-based reactive maintenance, it has become an increasingly important topic for most banks.

Our services research group applies machine learning and time series analysis to build predictive models for ATMs and other Internet-of-Things devices. By putting predictive models into production, we help our costumers to increase device availability, decrease maintenance costs, and identify the root cause of outages.

Ask the experts

Ioana Giurgiu

Ioana Giurgiu

IBM Research scientist

Roy Assaf

Roy Assaf

Post-doctoral researcher