The rapid growth of data with annual growth rates of 50% and more, together with the requirement to contain the cost of data storage at a manageable level, requires efficient data storage, including compression and data deduplication, but also the careful placement of data on the best suited storage medium. In order to find the optimal data placement on different storage tiers (e.g. SSD, disk, tape) as shown at right in terms of cost and performance, it is crucial to understand the access patterns to data. This knowledge can be utilized not only for data placement, but also for intelligent caching and pre-fetching of data, which was one of the research goals of the DOME project.
Magnetic tape is the storage medium that offers by far the lowest cost of ownership for long term storage. An analysis of the tape market showed that most of today’s tape drive and media characteristics (capacity, throughput, longevity, cost, etc.) satisfy customer requirements for use cases such as backup and archival storage. However, there is a strong demand for improvements in tape manageability and usability, for example, a non-proprietary, simple and cost-effective integration of tape storage in the tiered storage hierarchy or even into cloud storage systems. This seamless tape integrating can be done by utilizing the open Linear Tape File System (LTFS) format developed by IBM together with IBM’s General Parallel File System (GPFS).
Such integration enables tape systems to play an important role in active archives, in which data can be seamlessly migrated to the most appropriate storage tier (e.g. SSD, HDD, tape) and where the data is always online and accessible to the users from all storage tiers through a common file system that represents all the tiers in a single name space. Big Data analytics has become a significant driver for large storage capacity requirements and the demand for highly optimized and responsive storage systems. At the same time, data analytics can also be an enabler for an optimized data management and storage system.
In the field of tiered archival storage we address new storage requirements posed by the companies and organizations that base their operations and mission-critical businesses on the ability to store and process vast amounts of data efficiently and cost-effectively.
- Data needs to be easily available through a standard interface and via a single name space.
- Data needs to be protected continuously and stored for a long time.
- Storage costs and access requirements need to be optimized based on time-varying data usage or value.
- System should scale to a very large number of files or data objects.
Scalable active archive is another term used for storage systems that satisfy these requirements. Our research on this topic focuses on integrating solid-state drive, disk, and tape tiers under a single name space, and providing additional management functions for moving the data between the tiers. To provide a single name space, reliability, scalability, and data management, we leverage IBM’s General Parallel File System (GPFS) technology and OpenStack Swift. To add a reliable and cheap storage tier, we integrate the open-standard Linear Tape File System (LTFS) technology.
Tiered storage combines different types of storage media, preferably under a single name space and using a standard interface, equipped with data lifecycle management functions for migrating data between different storage tiers.