Postdoctoral Researcher - Dynamic Memory mapping
Develop Memory management strategies for modern HPC+AI systemsWhat you will do
A convergence of AI, HPC & Big Data analytics is being accelerated by the extensive proliferation of modern compute workflows that combine different methodologies & techniques to solve complex problems. These domains are arguably running the same types of data and compute intensive workloads on HPC hardware nowadays, be it niche supercomputers, small institutional clusters or in the cloud.
Distributed Scaling, Occupancy and Bandwidth issues plague all these domains as well. Currently, there are four major trends for this converged domain. First, the average size of datasets for the applications is rapidly increasing. Read-only input matrices that used to be on the order of megabytes or low-order gigabytes are growing into the double-digit gigabyte range and beyond. Second, the applications are continually required to be more and more accurate. This trend leads to larger working set sizes in memory as the resolution of stored and computed data becomes finer. Third, no matter how close accelerators are to the CPU, memory address spaces are still incoherent and automated memory management systems are not yet reaching the performance of hand-crafted solutions for HPC/AI applications. Fourth, while the physical memory size of accelerators is growing it fails to grow at the same rate as the working set sizes of applications. This leads to the conclusion that future HPC systems will rely heavily on efficient memory management for accelerators to be able to handle future working set sizes, and that considerable research will be essential in this field. Thus, a confluence of HW-SW co-design choices optimized for these converged scenarios will be necessary. Memory Management and mapping form a crucial part of this.
The requirement for Dynamic Memory mapping strategies (in unsupervised and semi-supervised online training, dynamic graph analytics, data analytics, sparse linear algebra and databases) only conflates these above-mentioned issues. In conventional HPC systems, the memory management sub-system runs as a separate service or as a part of the runtime management sub-system on a Service Node and it controls memory allocation on the Computational Nodes. It deals with the following issues:
- choosing the most suitable memory according to the allocated processing elements;
- enabling concurrent, thread-safe memory allocation and deallocation while avoiding fragmentation;
- performing translation from virtual to physical addresses, and vice versa;
- performing runtime optimization.
Almost all accelerator/GPU level memory managers offer the standard malloc/free interface & operate on a block of memory with a configurable size. They all also follow a similar approach of splitting the available memory into large blocks (mostly fixed size) & using these to serve the individual allocation requests. Managing these resources includes the use of lists, queues or even hashing. These are far from optimum. A few approaches have been proposed over the last decade & these need to be evaluated on a level playing field and with state-of-the-art hardware to answer the question if dynamic memory mapping and management is as slow as commonly thought of. This also involves thoroughly evaluating compute resource allocation (task/process based, thread-based as well as warp/wave-based), performance scaling, fragmentation and real-world performance considering custom and synthetic workloads as well as standard benchmarks if any.
Following this, novel Memory management strategies must be proposed for these converged domains (with a particular emphasis on mapping). This must result in guidelines for the respective best usage scenario. There should also be insights into the infrastructure interfaces required to integrate any of the tested and proposed memory manager solutions into an application and switch between them for benchmarking purposes.
This project is an initiative of the Compute Systems Architecture Unit (CSA). CSA is researching emerging workloads and their performance on large-scale supercomputer architectures for next-generation Artificial Intelligence (AI) and high-performance computing (HPC) applications. The team is responsible for algorithm research, runtime management innovations, performance modeling, architecture simulation and prototyping for these future applications and the future systems to execute them, to reach multiple orders of magnitude better performance, energy-efficiency, and total-cost-of-ownership.