Tiered storage for big data

The most dominant characteristic of modern HPC and enterprise workloads is the continuous and rapid growth of data volumes that organizations need to store and process. Traditional business models are transforming, with more and more companies and organizations basing mission-critical aspects of their business on the ability to store and process vast amounts of information. In other words, for many companies big data becomes the core of their business.

These developments pose new challenges to storage systems. Firstly, companies want to protect all their data continuously and for a long time, whereas storage costs and access requirements must be optimized for the potential value of the data. Secondly, in contrast to traditional backup and archive systems, today's so-called Active Archives are repositories for the purpose of monetizing or preserving the intrinsic value of the data. This means that all data must always be online (available), e.g. for direct customer access or analytics applications. Thirdly, when repositories scale to billions of files, traditional backup and disaster recovery approaches hit their scalability limits (e.g. meta-data growth, increasing backup/recovery windows etc.). Finally, companies are increasingly building on open standards to protect their investments.

GLUES systemTo tackle these challenges we are designing and building a GPFS-LTFS Integration system that integrates tapes formatted in accordance with the Linear Tape File System (LTFS) standard into the General Parallel File System (GPFS) as a tape storage tier for migration and backup. GPFS is IBM’s disk cluster file system, which is extremely scalable.  Seamlessly integrating LTFS tapes into GPFS makes tape look like disk, makes it easy to use, and creates a common namespace across disk and tape. Flexible migration policies allow administrators to optimze cost, access time, and power consumption by moving data between disk and tape. When it comes to big data, tape is the most efficient storage medium whenever applications can live with the resulting access latency. GPFS-LTFS integration makes tape easy to use and scales disk clusters to truly big active archives at low cost.

Key features of the GPFS-LTFS integration approach are:

  • Global namespace: Common global namespace across disk and tape at GFPS level.

  • Multi-node and multi-library support: Multiple GPFS nodes, multiple tape libraries, across multiple locations can be connected.

  • Open format on portable media: Increased flexibility due to LTFS as an open standard.

  • Simplified infrastructure: Tapes remain self-contained, including all meta-data.

  • Disaster recovery and import/export:  Global namespace can be recreated quickly from the meta-data on the tapes.

  • Flexibility and scalability: Cost/performance efficiency can be adjusted by disk/tape ratio and migration policies. System scales with the number of nodes, drives, and tapes.

  • Simplified tape management: Makes tape management transparent to user by handling: cartridge pooling, reclamation, reconciliation, resource scheduling, replicas, fill policies etc.

  • Scalable backup: Stores backup meta-data on disk and also on the LTFS tapes.