Hierarchical Resource Management
Current system software is straining to manage the increased complexity and parallelism of today’s largest systems. The Argo project utilizes the divide-and-conquer paradigm to hierarchically manage resources in a scalable fashion. Resource managers at all levels of the system coordinate using a distributed communication backplane.
At the global system level, usually managed as an unordered pool, Argo introduces the concept of enclaves—collections of uniformly configured nodes, managed as single entities. Typically, each job started by the batch scheduler is mapped to its own enclave. Workflow systems can subdivide enclaves and assign the newly created enclaves’ resources to individual components of the workflow. Enclaves provide natural hierarchy, allowing users more control over how their computation progresses and supporting resource naming so workflow tools can seamlessly connect related computations. The Argo team works with ECP application teams on the required APIs.
At the node level, Argo takes advantage of the cgroups mechanism of the Linux kernel to introduce custom Compute Containers. Noisy system services are moved aside to a separate ServiceOS partition, leaving the bulk of node resources for exclusive use by HPC apps. Argo supports multiple Compute Containers on one node, to accommodate complex workloads or in situ analysis components. It currently manages CPU cores and memory, with other resources (interconnect, power) being planned in the future. In Linux, memory management normally works at the NUMA node level, so Argo implemented Finer-Grained Memory Nodes that present logical nodes at arbitrary granularity, giving apps more control over data placement. The resource managers at each level are flexible and allow the resizing at run time to support dynamic resource management, fault tolerance, and so on. They decentralize resource management and have the potential to improve performance isolation between jobs and between system services and application processes.
Deep Memory Hierarchy and Nonvolatile Storage:
Exploiting the prominent role of byte- and block-addressable Non-volatile Memory (NVM) in exascale node architecture, the UMap page fault handler offers new capabilities for memory mapped out-of-core computing with large data sets. UMap is a user space library that provides an efficient mmap()-like interface to a large virtual address range. Page faults triggered by application threads accessing data mapped to the selected range are handled via the Linux userfaultfd protocol, an axynchronous message-oriented kernel-user communication mechanism that avoids the context switch penalty of traditional signal fault handlers.
The multi-threaded UMap handler maintains a page buffer to cache populated pages. Handlers can be specialized to application needs. For example, the “Astro” UMap handler maps in a collection of FITS format telescope data files to form a virtual 3D cube of sky survey time series imagery. Application threads access the 3D cube to calculate median values of image patches to locate asteroids and other transient objects.
The UMap code base is open source and available at https://github.com/LLNL/umap.
Power will be one of the most constrained resources at exascale, requiring careful management and cooperation at all levels. Even on the soon-to-be-delivered Argonne Aurora system, it will be capped at the global system level, with the capability to slosh power across portions of the machine in order to increase performance based on global optimization metrics.
The Global Resource Manager must track power allocations given to computing jobs. The concept of enclaves comes in handy again at this point, because the Enclave Resource Manager can keep track of the power allocation within a job. A naive implementation would hand out an equal chunk to every node in an enclave. In reality, imperfect load balance between the nodes, as well as silicon process variation, may necessitate some—possibly dynamic—variance in order to improve overall resource utilization and maximize the performance on the critical path. On each node, the Node Resource Manager needs to implement and enforce the power budget. An appropriate communication backplane will be required in order to coordinate power management actions across these different entities.
COOLR-MSR is our unified measurement and control stack for on-node power, providing seamless access to relevant hardware features of modern processors, coordinating with the OS, and providing glue layers to existing APIs such as PAPI and TAU. We have also developed a software power-capping mechanism using a feedback control, the dynamic core expanding/shrinking of some runtime systems, and the temperature- aware thread management mechanism. Once integrated, these techniques will enable comprehensive dynamic power management in cooperation with the runtimes and power-aware apps.