Current state-of-the-art HPC operating systems are essentially node operating systems wired together to support collective job launch and user authentication, monitor reliability-availability-serviceability (RAS) data, and share a high-speed interconnect. This legacy design lacks a logical handle and associated computational resources to support our hierarchical view. To remedy this, we propose a new first-class construct called an enclave. Enclaves are related to the existing notion of a “job partition,” the set of dedicated resources allocated to a computation by the job scheduler and system resource manager. Multiple enclaves may run on their dedicated resources within a system. Enclaves provide a framework for supporting software components that can act on the collection of allocated resources to dynamically manage power, respond to faults, and tune performance. In fact, we also extend the enclave concept to the full system. An exascale platform would have (at least) three layers of management: system, enclave, and node.
The management systems on current petascale platforms provide “set and forget” configuration of hardware and “detect and reject” response to system faults. In order to actively manage power budgets, learn from performance data, actively tune system performance, and dynamically respond to fault events, information and control must flow vertically up and down levels of system hierarchy with appropriate optimizations and responses available at each level. Our innovation is a hierarchical framework that supports global optimization and machine learning to build a goal-based, self-aware control system for exascale platforms. The hierarchical design of Argo, with system, enclave, and node views, allows a cross-cutting approach. For example, a whole-system power budget can be apportioned into budgets for each enclave. Optimization components running at the enclave level can manage power targets for individual nodes. Likewise, faults at the node level can be handled within the enclave first, based on the application’s resilience strategy, while simultaneously being shared with the whole-system view to assure that fault patterns between applications and hardware sets can be learned. We also use a goal-based system management to manage power and resilience and to dynamically optimize performance.
- libMSR/msr-safe: A node-local API to measure and cap power
- PuPIL — maximizes performance under a node-level power cap with no prior knowledge of an application
- LEO — Learns pareto-optimal power/performance tradeoffs at the node-level
- POET/Bard — given a model of power/performance tradeoffs meets performance or power constraints optimally