METIS is a memory-assisted, efficient, and high-performance storage subsystem that uses novel, holistic, end-to-end, hardware-supported memory and storage abstractions attuned to the demands of DL HPC applications. The project is designed to adopt novel main memory compression architectures, use the freed-up physical memory to develop a cooperative in-memory I/O cache, architect a high-performance NVMe burst buffer as a backend for the cache, and explore comprehensive power models to capture the impact of I/O re-design.
The fundamental novelty and scientific value of this research can be summarized into four tightly coupled research thrusts.
Thrust 1: Design transparent main memory compression architecture to expand OS file cache. This will require the design of architectural and OS support to use the free physical memory to cache file pages.
Thrust 2: Design cooperative memory cache which will need designing efficient data structures and local memory management techniques, enabling transparent global view and RDMA-enabled data paths, and designing cooperative protocols and eviction policies.
Thrust 3: Build NVMe burst buffer to persistently write bursty DRAM-evicted data objects by designing a high performance checkpointing support and developing a dynamic load balancing scheme especially tuned for Deep Learning workloads. We also propose to build endurance-optimized data management techniques to improve NVMe SSD lifespan.
Thrust 4: Explore power modelling and profiling of METIS by extending PowerPack power measurement framework to METIS FPGA implementation, and evaluating METIS power-performance within simulation, emulation, and direct measurement environments.
Provide an I/O solution that better serves the increasingly popular Deep Learning workloads on modern HPC systems.
Maximize the amount of memory available and accessible to HPC applications.
Build cross-stack improvements for HPC systems across architecture, OS, and distributed system paradigms on the foundation of advanced memory compression.
Evaluate the energy efficiency provided by these new innovations in memory compression and distributed caching.
Incorporate university-industry-national laboratory collaboration and classroom integration for adopting and validating METIS as a true solution for DL in HPC.