Modern data storage systems are extremely large and consist of several tens or hundreds of storage nodes. In such systems, node failures are daily events, and safeguarding data from them poses a serious design challenge.
Data redundancy, in the form of replication or advanced erasure codes, is used to protect data from node failures. By storing redundant data across several nodes, the redundant data on surviving nodes can be used to rebuild the data lost by the failed nodes.
As these rebuild processes take time to complete, there exists a chance of additional node failures occurring during rebuild. This eventually may lead to a situation in which some of the data becomes irrecoverably lost from the system.
Our activities in storage system reliability investigate novel methods to address issues encountered in large size storage installations. In particular, we focus on the occurrence of data loss and on methods to improve reliability without sacrificing performance and storage efficiency.
We have shown that spreading the redundant data corresponding to the data on each node across a higher number of other nodes, and using a distributed and intelligent rebuild process will improve the system’s mean time to data loss (MTTDL) and the expected annual fraction of data loss (EAFDL).
In particular, declustered placement, which corresponds to spreading the redundant data corresponding to each node equally across all other nodes of the system, is found to have potentially significantly higher MTTDL and lower EAFDL values than other placement schemes, especially for large storage systems.
We have also developed enhanced recovery schemes for geo-replicated cloud storage systems where network bandwidth between sites is typically more scarce than bandwidth within a site, and can potentially be a bottleneck for recovery operations.
Figure 1. Example of the distributed rebuild model for a two-way replicated system. When one node fails, the critical data blocks are equally spread across the n-1 surviving nodes. The distributed rebuild process creates replicas of these critical blocks by copying them from one surviving node to another in parallel.
Ask the expert
IBM Research scientist