Node OS/R Requirements
Hardware architecture of exascale nodes is expected to be significantly more complex than that of the current ones, particularly with respect to the heterogeneity of resources (accelerators, multiple memory tiers) and the increased level of parallelism. As the HPC market continues to stay ahead of the rest of the computing industry in these areas, there is a danger that the mainstream operating systems will not be ready and “just work” by the time the first exascale systems arrive.
The era of applications consisting solely of uniform MPI tasks running in unison is coming to an end. A more sophisticated programming model and runtime are needed to exploit all the parallelism in emerging and future systems, possibly requiring new modes of interaction with the operating system. Compound applications, consisting of multiple semi-independent components (e.g., simulation and data analysis) are gaining traction; such workloads require a more flexible and dynamic management of resources than is typically offered on contemporary systems.
It is thus prudent to create a customized OS kernel that will provide the necessary capabilities to the HPC applications and runtimes. In an effort not to expend massive resources reinventing the wheel, we strive to leverage existing code bases and techniques in Argo Node OS/R design.
Our approach to bridging the scalability problem is that of divide and conquer. We partition key node resources (CPU cores, memory) into a number of smaller, autonomous groups, each of them of a size that scales well and is appropriate to the task at hand. This provides a more manageable environment and a more predictable performance.
The underlying OS is Linux, and we are taking advantage of containers (specifically, the control groups mechanism) for partitioning. The containers are also specialized by function:
ServiceOS – small container (currently occupying a single core) responsible for overall node management, interrupt handling, running management services such as the dynamic resource manager, local storage manager, system call forwarding infrastructure, local backplane agent, agents of global services, etc.
Compute Containers – responsible for running application code. Size of an individual container depends on application needs, but is generally no larger than a single NUMA node. Compute Containers can provide different capabilities, ranging from a setup optimized for highly parallel HPC jobs (predictable CPU scheduling; large, prefaulted memory pages; fine-grained resource management off-loaded to the low-level Argo runtime system), to a fairly standard Linux environment for legacy and non-highly-parallel workloads. A single application can run across multiple, possibly heterogeneous, containers as we do not enable isolation (namespaces) by default. The container mechanism could of course also be used as a lightweight virtualization substitute for jobs with particular environment requirements.
We need to efficiently provision heterogeneous memory resources to disparate computing elements. In particular, the introduction of NVRAM into the memory hierarchy presents a number of opportunities and challenges. We are leveraging the DI-MMAP subsystem that implements a transparent DRAM cache for NVRAM regions, with extensions such as huge pages.
The impact of multilevel memory hierarchy on application performance, as well as on the power draw of the system, is still unclear. We are investigating memory management policies using the HMsim simulation infrastructure.
We are also extending NUMA support to enable memory partitioning at granularities smaller than a physical NUMA node. Physical memory is partitioned into logical blocks exposed as fake NUMA nodes. We have full control over the size of each logical block and can also influence the physical to virtual mapping. This feature is called Finer-Grained Memory Nodes (FGMN).
Changing application requirements and system capabilities may necessitate new interfaces to efficiently couple the two layers. For example, the set of resources used by an application can change dynamically based on new requests from the application or from the system (e.g., a fault event or an enforcement of a power envelope). In any case, the application, the runtime, and the operating system need to cooperate on making an optimal redistribution.
Where one of the key tasks of mainstream operating systems is to fairly distributed limited resources among competing jobs, in HPC there is typically just one job per node. We are investigating how replacing adversarial system policies with cooperative ones, eliminating spurious permission checks, etc., benefits the performance and what trade-offs it involves.
Our resource partitioning approach introduces a non-trivial problem of which OS services to run where (Compute core, ServiceOS core(s), or dedicated external resources such as I/O forwarding nodes?). There is a possible trade-off between performance and implementation complexity here. We are studying the use of system calls in existing HPC applications to determine which system services are worthwhile to optimize.