Cloud storage, IBM Research Europe Zurich

“Increasingly large storage subsystems must be enhanced to protect against errors.”

—Haris Pozidis, IBM scientist

As digital data storage becomes increasingly vast and sensitive — particularly for businesses — new challenges are developing when it comes to ensuring reliable and secure data availability. Storage subsystems must be enhanced to protect against errors that can occur in increasingly large storage systems while sustaining the enormous throughput and low-latency offered by solid-state storage.

Future storage systems must scale to enable Big Data analytics and cognitive applications while still being cost-efficient by storing data according to its value. This requires flexible and easy-to-use, multi-tiered storage systems that incorporate technologies such as flash, hard-disk drives and magnetic tapes. Furthermore, caching technologies and flash memory management in virtualized and distributed storage environments must be advanced.

New storage-class memory technologies such as phase change memory (PCM) must be integrated into existing storage stacks.

All-flash arrays

Solid-state persistent memory such as flash has been introduced in the enterprise environment as it improves on several factors compared to disk, most notably IO performance and power efficiency. In order to reduce the total cost, triple-level-cell (TLC) flash memory technology is typically employed today, as opposed to the conventional single-level cell (SLC) and multilevel-cell (MLC) technology, at the cost of much lower reliability and modest latency penalty.

The ever-increasing storage density of NAND flash memory devices requires significant advances in flash management and signal processing to address increasing endurance, retention, and integrity/reliability issues. Furthermore, new non-volatile memory (NVM) technologies such as phase change memory (PCM) are expected to induce significant changes from the server and storage architectures up to the middleware and application design as they are introduced into the existing storage/memory hierarchy.

Our activities focus on the advanced use of solid-state NVMs in enterprise-class systems. We are designing and evaluating holistic approaches to sustained high-IO operation rates, low latency, as well as error detection and correction from the low-level data block up to the array level. In addition, we are investigating the potential for synergies between the various layers (devices, controller, file systems, virtualized systems, and applications).

Advanced flash management

Our mission is to enable the latest and next generations of NAND flash technologies and new emerging non-volatile memory (NVM) technologies for enterprise storage systems through advanced, flash management schemes. These schemes can be placed at appropriate locations inside a solid-state storage array, on top of existing consumer grade storage devices, or a combination thereof. We construct intelligent flash management functions capable of taking advantage of the increasing spread of device characteristics on the page, block, and chip level, uneven wear out of flash blocks and cells, which can be workload-induced or driven by the garbage collection algorithms, thereby achieving optimal wear-leveling.

We focus on techniques that do not impact data path processing of host read and write operations and achieve lowest possible latency characteristics throughout the lifetime of the storage device. We are further designing and evaluating data reduction schemes such as compression and deduplication to improve overall cost per gigabyte storage capacity and reduce write amplification. These techniques lead to a significant increase in overall meta-data, for which we are investigating adequate management architectures combining current and next-generation volatile and NVM technologies. We utilize findings from large-scale characterization of existing non-volatile memory devices combined with different approaches including modeling, simulations, and evaluation on real flash cards and SSDs.

Flash signal processing

In order for enterprises to guarantee high degrees of data integrity and availability, they must cope with the reliability degradation that comes with the continuous technology node shrinkage and the usage of MLC/TLC technology. We are designing signal processing and coding algorithms and schemes to enhance the reliability of MLC/TLC NAND flash memory and thus to enable its employment in enterprise storage systems and servers. Our work includes advanced characterization and testing of flash memory chips to assess their raw performance and to extract and understand the various noise and distortion sources present in the writing and reading processes.

We are also developing comprehensive models of the NAND flash channel based on experimental data. These models are then used to guide the design of advanced signal processing schemes to mitigate the effects of such impairments as cell-to-cell interference, program and read disturb and distribution shifts due to cycling and/or data retention.

Error correction codes are integral modules of flash controllers in storage systems. Historically, BCH codes have been used to correct errors in flash chips. However, the error-correcting power of these codes has been increasing exponentially with every flash technology generation. The industry is quickly approaching a regime of diminishing performance gains in return for large increases in complexity and thus silicon area and cost. In an effort to reverse this trend, alternative approaches to ECC design have recently been introduced in flash. These approaches are typically geared towards the use of soft information. However, the extraction of soft information from NAND flash chips requires multiple read operations and thus increases latency, which is at a premium for enterprise applications.

Our work is geared towards addressing all the tradeoffs involved in selecting proper coding schemes and verifying their correction performance, which are critical tasks for the controller design in flash-based storage systems.

Software-defined storage

As data is becoming the world’s new natural resource, the capability to store and get value out of it becomes critical to the success of businesses and organizations. To that end, Software-Defined Storage (SDS) plays a key role by offering the required flexibility, scalability, cost efficiency and agility. SDS decouples storage functions from hardware and implements all the storage system intelligence in software that can run on general-purpose, off-the-shelf hardware components, as well as on virtualized cloud resources. SDS enables storage systems to be shaped and sized to best fit the needs of particular use cases and workloads. Client APIs enable different pieces of storage infrastructure to be managed as a single entity and automate provisioning, policies and monitoring.

“Data is becoming the world’s new natural resource.”

—Ioannis Koltsidas, IBM scientist

Our approach is not limited to prototype systems inspired by forward-thinking ideas. We aim to develop practical SDS systems that can be used in real-world production environments. Our research spans multiple types of SDS systems, including block storage, file storage, object and NoSQL storage on top of diverse storage media such as Flash, phase-change memory, disk and magnetic tape. We build on open storage protocols such as NVMe, NVMe over Fabrics and OpenCAPI, open storage formats such as LTFS, and new storage access paradigms such as direct user-space I/O. Using such building blocks, we are developing systems for both enterprise data centers and Cloud environments.

Software-defined NVMe Flash

The NVM Express (NVMe) family of protocols and interfaces introduces exciting new opportunities for software-defined storage systems. Replacing legacy protocols that have been designed for HDDs, PCIe NVMe drives enable a more direct access to storage. Software systems can take advantage of NVMe to reach unprecedented levels of performance scalability and CPU efficiency and achieve extremely low latency accesses. NVMe over Fabrics, the version of the protocol for network fabrics, extends these capabilities across the network, enabling access to remote storage resources over the same interface with almost the same performance as local, direct-attached storage.

Our research leverages these new interfaces and technologies to build the next generation of Flash-based storage systems. Our goal is to revisit the storage system architectures and redesign the storage functions in ways that afford composable systems out of disaggregated storage resources with low latency and high performance scalability. We strive to separate the control path cleanly from the data path functions in order to give storage clients more control when accessing storage resources. The scope of our research extends beyond raw performance to achieve rich data services, cost-efficient storage techniques and workload-optimized data mobility.

SALSA
Unified SDS for low-cost SSDs and SMR disks

As data volumes continue to grow, cost-efficient storage devices are becoming increasingly important. Two prime examples of such devices are low-cost Flash SSDs and shingled magnetic recording (SMR) HDDs. Low-cost commodity SSDs offer ample read performance with high IOPS and low latency. However, they suffer from poor performance under mixed read/write workloads and poor endurance. SMR disks feature significant cost benefits over traditional HDDs. However, they require that specific write patterns be adopted, which introduces additional complexity and performance variation for general-purpose workloads.

SoftwAre Log-Structured Array (or SALSA for short) is a unified software stack optimized for low-cost SSDs and SMR HDDs. SALSA is uses software intelligence to mitigate the limitations of commodity devices.

By shifting the complexity from the hardware controller of the devices to software running on the host, SALSA not only reduces costs, but also takes advantage of the ample host resources to manage the device resources more effectively. For Flash-based SSDs, SALSA elevates their performance and endurance to meet the requirements of modern data centers. For host-managed SMR HDDs, SALSA offers a conventional block interface and controls the data placement on the devices to improve their read and write performance. SALSA provides redundancy, storage virtualization and data reduction, which allow the user to pool multiple devices and create storage volumes with improved performance, reliability and cost. Most importantly, SALSA exposes a standard block interface so that it can be used by file systems and applications with no modification.

NoSQL key
Value storage and caching

Cloud and mobile applications employ data models that are vastly different from traditional enterprise ones. Storing and retrieving key/value (K/V) pairs has become one of the most pervasive data models because it affords simplicity, generality and scalability.

Our research focuses on technologies that enable fast, efficient and cost-effective NoSQL K/V storage on NVMe-attached storage media such as Flash-based and 3DXP-based SSDs.We have developed uDepot (pronounced “micro-depot”), a K/V storage and caching engine that offers micro-second latency access to storage.

uDepot is an NVMe-optimized K/V store that has been built from the ground up to be lean, scalable and efficient. To that end, uDepot implements a new I/O access paradigm that facilitates zero-copy data transfers, polling-based I/O request completion, user-space I/O that avoids system calls and context switches in the data path and minimizes the end-to-end I/O amplification both in terms of number of I/O operations, as well as in terms of bytes read and written.uDepot can be used either as a K/V store that is embedded in the application and runs in the application context, or as a scale-out distributed K/V cache that can be accessed using the Memcache protocol.

Distributed shared storage
For large-scale scientific computing

The Human Brain Project, a flagship project funded by the European Commission, aims at understanding the human brain through advanced simulation and multi-scale modelling. Distributed shared storage (DSS) serves as a network-attached, shared data store for large-scale distributed human brain simulations. It brings distributed, network-attached NVMe storage close to the application.Using remote direct memory access (RDMA) and direct user space storage access technologies, applications achieve high speed, low latency, byte-granular access to a unified shared storage pool.

IBM Power Systems

In a distributed setup running on IBM Power® systems with direct-attached NVMe drives and connected via a 100-Gbit/s InfiniBand® fabric, the DSS prototype demonstrates a storage access throughput of tens of millions of I/O operations per seconds (IOPS) while sustaining a cumulative I/O bandwidth of tens of gigabytes per second (GB/s). As it bypasses legacy block I/O layering of the operating system, DSS is able to deliver the low response times of NVMe-attached drives at the distributed application level. Distributed shared storage will soon become available as open-source software.

Host-side Flash-based caching

In a typical enterprise IT environment, where servers store data in one or more SAN storage systems, caching technologies in the server are critical to achieve low-latency and high-throughput data access.

We are studying systems in which the servers use solid-state storage devices based on Flash and newer memory technologies for caching data from the SAN. We are developing a novel caching framework that exploits synergies between servers and storage. The system employs advanced caching algorithms to identify which data is hot (accessed often from applications running on that server) on each server, and which data is cold (not frequently accessed). The hot data is stored in the local caches of the server so that it can be served to applications and users at a very low latency.

Our research focuses on high performance and high scalability in all aspects of the system, but also addresses such aspects as reliability and endurance. Our caching technology has been integrated into the IBM DS8000® Easy Tier Server® and IBM AIX® 7.2 Flash Cache products, and our implementation for the Linux® platform has been released under the iostash open-source project.

Software-defined cold storage

Cold-storage technologies are becoming critical for dealing with exploding data volumes, and tape storage is the most promising technology when it comes to storing vast amounts of data for the long-term with high reliability and low cost. In the past few years, the Long-Term File System (LTFS) has introduced a standardized open format for data stored on tape, and open-source implementations have enabled users to access tape using non-proprietary components.

Our research builds on LTFS and focuses on making tape-based storage as user-friendly as possible. We are developing the data path and control path components that enable users to read data from and to tape in a completely transparent way. In other words, we aim to save the user the task of managing robotics, tape drives and tape cartridges.

“We aim to make tape-based storage as user-friendly as possible.”

—Ioannis Koltsidas, IBM scientist

Our work, which forms the core of IBM Spectrum Archive®, enables enterprise file systems to use a tape backend as a bottomless cold tier in a completely transparent manner. In addition, we focus on object-based cold storage, i.e., enabling OpenStack Swift to use tape storage transparently. Our Swift High-Latency Middleware provides that capability in a way that can be extended to other types of high latency media as well.

Ask the experts

Nikolaos Papandreou

Roman Pletka

Thomas Parnell

Martin Petermann

Slavisa Sarafijanovic
Slavisa Sarafijanović

Bernard Metzler

Radu Stoica

Haris Pozidis

Publications

N. Papandreou and H. Pozidis,
“Flash Memory Reliability and Error Mitigation in Modern SSDs,”
Advanced Memory Reliability Tutorial, IEEE International Reliability Physics Symposium (IRPS), 2022 (invited tutorial)

D. Didona, N. Ioannou, R. Stoica, and K. Kourtis,
“Toward a Better Understanding and Evaluation of Tree Structures on Flash SSDs,”
VLDB, 2021.

H. Pozidis, N. Papandreou and M. Stanisavljevic,
“Circuit and System-Level Aspects of Phase Change Memory,”
in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 68, no. 3, pp. 844-850, March 2021.

T. Mittelholzer, M. Stanisavljevic, N. Papandreou and H. Pozidis,
“High-Throughput ECC with Integrated Chipkill Protection for Nonvolatile Memory Arrays,”
2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021.

R. Pletka, N. Papandreou, R. Stoica, H. Pozidis, N. Ioannou, T. Fisher, A. Fry, K. Ingram, A. Walls,
“Improving NAND flash performance with read heat separation,”
IEEE MASCOTS, 2020.

N. Papandreou, H. Pozidis, N. Ioannou, T. Parnell, R. Pletka, M. Stanisavljevic, R. Stoica, S. Tomic, P. Breen, G. Tressler, A. Fry, T. Fisher, A. Walls,
“Open Block Characterization and Read Voltage Calibration of 3D QLC NAND Flash,”
IEEE International Reliability Physics Symposium (IRPS), 2020.

H. Pozidis, N. Papandreou,
“Phase Change Memory: Technology Reliability and System-Level Implications,”
Advanced Memory Reliability Tutorial, IEEE International Reliability Physics Symposium (IRPS), 2020 (invited tutorial).

Roman A. Pletka, Sasa Tomic,
“Health-Binning: Maximizing the Performance and the Endurance of Consumer-Level NAND Flash,”
SYSTOR, 4:1-4:10, 2016.

T. Parnell, C. Duenner, T. Mittelholzer, N. Papandreou,
“Capacity of the MLC NAND Flash Channel,”
IEEE J. Selected Areas in Communications, Special Issue on Channel Modeling, Coding and Signal Processing for Novel Physical Memory Devices and Systems, 34(9) 2354-2365, 2016.

N. Papandreou, T. Parnell, T. Mittelholzer, H. Pozidis, T. Griffin, G. Tressler, T. Fisher, C. Camp,
“Effect of read disturb on incomplete blocks in MLC NAND flash arrays,”
in Proc. IEEE Int’l Memory Workshop (IMW), Paris, France, 2016.

T. Mittelholzer, T. Parnell, N. Papandreou, H. Pozidis,
“Improving the error-floor performance of binary half-product codes,”
in Proc. Int’l Symposium on Information Theory and its Applications (ISITA), Monterey, CA, 2016.

Ilias Iliadis, Yusik Kim, Slavisa Sarafijanovic, Vinodh Venkatesan,
“Performance Evaluation of a Tape Library System,”
MASCOTS, 59-68, 2016.

Ilias Iliadis, Jens Jelitto, Yusik Kim, Slavisa Sarafijanovic, Vinodh Venkatesan,
“ExaPlan: Queueing-Based Data Placement and Provisioning for Large Tiered Storage Systems,”
MASCOTS, 218-227, 2015.

T. Mittelholzer, T. Parnell, N. Papandreou, H. Pozidis,
“Symmetry-based subproduct codes,”
in Proc. 2015 IEEE Int’l Symposium on Information Theory (ISIT), pp. 251-255, 2015.

T. Parnell, C. Dunner, T. Mittelholzer, N. Papandreou, H. Pozidis,
“Endurance limits of MLC NAND flash,”
in Proc. 2015 IEEE Int’l Conference on Communications (ICC), pp. 376-381, 2015.

T. Parnell, N. Papandreou, T. Mittelholzer, H. Pozidis,
“Performance of cell-to-cell interference mitigation in 1y-nm MLC flash memory,”
in Proc. 15th Non-Volatile Memory Technology Symposium (NVMTS) pp. 1-4, 2015.

N. Papandreou, T. Parnell, H. Pozidis, T. Mittelholzer, E. Eleftheriou, C. Camp, T. Griffin, G. Tressler, A. Walls,
“Enhancing the Reliability of MLC NAND Flash Memory Systems by Read Channel Optimization,”
ACM Transactions on Design Automation of Electronic Systems (TODAES) 20(4), 62, 2015.

Ioannis Koltsidas, Slavisa Sarafijanovic, Martin Petermann, Nils Haustein, Harald Seipp, Robert Haas, Jens Jelitto, Thomas Weigold, Edwin R. Childers, David Pease, Evangelos Eleftheriou,
“Seamlessly integrating disk and tape in a multi-tiered distributed file system,”
ICDE, 1328-1339, 2015.

Nikolas Ioannou, Ioannis Koltsidas, Roman Pletka, Sasa Tomic, Radu Stoica, Thomas Weigold, Evangelos Eleftheriou,
“SALSA: Treating the Weaknesses of Low-Cost Flash in Software,”
Non-Volatile Memories Workshop, 2015.

T. Parnell,
“Flash Controller Design: Enabling Sub-20nm Technology and Beyond,”
in Proc. Int’l Memory Workshop “IMW” Taipei, Taiwan, 2014.

N. Papandreou, T. Parnell, H. Pozid,, T. Mittelholzer, E. Eleftheriou, C. Camp, T. Griffin, G. Tressler, A. Walls,
“Using Adaptive Read Voltage Thresholds to Enhance the Reliability of MLC NAND Flash Memory Systems,"
in Proc. 24th ACM Great Lakes Symp. on VLSI “GLSVLSI,” Houston, TX, 2014.

I. Iliadis,
“Rectifying pitfalls in the performance evaluation of flash solid-state drives,”
Performance Evaluation 79, 235-257, 2014.

Sangeetha Seshadri, Paul Muench, Lawrence Chiu, Ioannis Koltsidas, Nikolas Ioannou, Robert Haas, Yang Liu, Mei Mei, Stephen Blinick,
“Software Defined Just-in-Time Caching in an Enterprise Storage System,”
IBM Journal of Research and Development 58(2/3), 2014.

T. Parnell, N. Papandreou, T. Mittelholzer, H. Pozidis,
“Modelling of the threshold voltage distributions of sub-20nm NAND flash memory,”
in Proc. IEEE Global Communications Conference (GLOBECOM), pp. 2351-2356, 2014.

Hyojun Kim, Ioannis Koltsidas, Nikolas Ioannou, Sangeetha Seshadri, Paul Muench, Clement L. Dickey, Lawrence Chiu,
“How Could a Flash Cache Degrade Database Performance Rather Than Improve It? Lessons to be Learnt from Multi-Tiered Storage,”
INFLOW 2014.

Hyojun Kim, Ioannis Koltsidas, Nikolas Ioannou, Sangeetha Seshadri, Paul Muench, Clement L. Dickey, Lawrence Chiu,
“Flash-Conscious Cache Population for Enterprise Database Workloads,”
ADMS@VLDB, 45-56, 2014.

N. Papandreou, Th. Antonakopoulos, U. Egger, A. Palli, H. Pozidis, E. Eleftheriou,
“A Versatile Platform for Characterization of Solid-State Memory Channels,”
in Proc. 2013 IEEE 18th Int’l Conf. on Digital Signal Processing “DSP 2013,” 2013.

W. Bux, X.-Y. Hu, I. Iliadis, R. Haas,
“Scheduling in Flash-Based Solid-State Drives — Performance Modeling and Optimization,”
in Proc. 20th Annual IEEE Int’l Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Washington, DC, pp. 459-468, 2012.

P. Bonnet, L. Bouganim, I. Koltsidas, S.D. Viglas,
“System Co-Design and Data Management for Flash Devices,”
in Proc. the VLDB Endowment 4(12), Proc. 37th Int’l Conf. on Very Large Data Bases “VLDB 2011,” Seattle, WA, pp. 1504-1505, 2011.

X.-Y. Hu, R. Haas, E. Eleftheriou,
“Container Marking: Combining Data Placement, Garbage Collection and Wear Levelling for Flash,”
in Proc. 2011 IEEE 19th Int’l Symp. on Modeling, Analysis and Simulation of Computer and Telecommunication Systems “MASCOTS 2011,” Singapore, pp. 237-247, 2011.

I. Koltsidas, S.D. Viglas,
“Data Management over Flash Memory,”
in Proc. 2011 Int’l Conf. on Management of Data “SIGMOD,” Athens, Greece, pp. 1209-1212, 2011.

I. Koltsidas, S.D. Viglas,
“Spatial Data Management over Flash Memory,”
in “Advances in Spatial and Temporal Databases,” Proc. 12th Int’l Symp. on Spatial and Temporal Databases “SSTD 2011,” Minneapolis, MN, Lecture Notes in Computer Science, vol. 6849 (Springer), pp. 449-453, 2011.

I. Koltsidas, S.D. Viglas,
“Designing a Flash-Aware Two-Level Cache,”
in “Advances in Databases and Information Systems,” Proc. Advances in Databases and Information Systems “ABDIS,” Vienna, Austria, Lecture Notes in Computer Science, vol. 6909 (Springer), pp. 153-169, 2011.

W. Bux, I. Iliadis,
“Performance of Greedy Garbage Collection in Flash-Based Solid-State Drives,”
Performance Evaluation 67(11) 1172-1186, 2010.

X.-Y. Hu, E. Eleftheriou, R. Haas, I. Iliadis, R. Pletka,
“Write Amplification Analysis in Flash-Based Solid State Drives,”
in Proc. The Israeli Experimental Systems Conference “SYSTOR 2009,” Haifa, Israel, Article 10, 2009.