Performance and reliability

Modern data storage systems are extremely large and consist of several tens or hundreds of storage nodes. In such systems, node failures are daily events, and safeguarding data from them poses a serious design challenge. Data redundancy, in the form of replication or advanced erasure codes, is used to protect data from node failures. By storing redundant data across several nodes, the redundant data on surviving nodes can be used to rebuild the data lost by the failed nodes. As these rebuild processes take time to complete, there exists a chance of additional node failures occurring during rebuild. This eventually may lead to a situation in which some of the data becomes irrecoverably lost from the system.

Our activities in storage system reliability investigate novel methods to address issues encountered in large size storage installations. In particular, we focus on the occurrence of data loss and on methods to improve reliability without sacrificing performance and storage efficiency.

We have shown that spreading the redundant data corresponding to the data on each node across a higher number of other nodes, and using a distributed and intelligent rebuild process will improve the system's mean time to data loss (MTTDL). In particular, declustered placement, which corresponds to spreading the redundant data corresponding to each node equally across all other nodes of the system, is found to potentially have significantly higher MTTDL values than other placement schemes, especially for large storage systems.

Selected publications

  1. W. Bux, X.-Y. Hu, I. Iliadis, R. Haas,
    "Scheduling in Flash-Based Solid-State Drives - Performance Modeling and Optimization,"
    Proc. of the 20th Annual IEEE Int’l Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Washington, DC, pp. 459-468, August 2012.
  2. V. Venkatesan and I. Iliadis,
    "A General Reliability Model for Data Storage Systems,"
    International Conference on Quantitative Evaluation of Systems (QEST) 2012.
  3. V. Venkatesan, I. Iliadis, and R. Haas,
    "Reliability of Data Storage Systems under Network Rebuild Bandwidth Constraints,"
    IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2012.
  4. V. Venkatesan, I. Iliadis, C. Fragouli, R. Urbanke,
    "Reliability of Clustered vs. Declustered Replica Placement in Data Storage Systems,"
    IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2011. Winner of Best Paper award.
  5. V. Venkatesan, I. Iliadis, X.-Y. Hu, R. Haas, and C. Fragouli,
    "Effect of Replica Placement on the Reliability of Large-Scale Data Storage Systems,"
    IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2010.
  6. I. Iliadis, R. Haas, X.-Y. Hu, and E. Eleftheriou,
    "Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems,"
    ACM Trans. Storage, vol. 7, no. 2, pp. 1-42, July 2011.
  7. A. Dholakia, E. Eleftheriou, X.-Y. Hu, I. Iliadis, J. Menon, and KK Rao,
    "A New Intra-disk Redundancy Scheme for High-Reliability RAID Storage Systems in the Presence of Unrecoverable Errors,"
    ACM Trans. Storage, vol. 4, no. 1, pp. 1-42, May 2008.