BEACON and EXPOSE : Backplanes for the ARGO system

ARGO – which is a new architecture for exascale systems, mandates the need for extensible interfaces and mechanisms to gather system-wide events and performance data, in order to actively control enclaves and nodes. Current systems, without a logical construct for enclave and a backplane for sharing data, preclude simple construction of active components. For example, a control system in a platform today may perhaps recognize precursors to a node failure, but is unable to notify the application and suggest possible resolutions; Similarly, its difficult to co-related cpu or memory usage with an application power consumption.

Our innovation is a two-part backplane linking levels in the system architecture: node, enclave, and entire system. Our Backplane for Event And Control Notification (BEACON) provides interfaces  for gathering event data, based on which components can take appropriate action. BEACON is a lightweight framework that provides  interfaces for sharing event information  as well as other supplementary services, in the ARGO system.

The Exascale Performance Observation and introspection backplane (EXPOSÉ) manages performance data, which is used vertically at all levels in Argo.

The BEACON Backplane

The main idea of BEACON, based on publish-subscribe frameworks,  encompasses backplane end-points (also called BEEPs) – across node, enclave, and system levels – that are responsible for detecting and generating information (including, but not limited to faults), which will then be propagated via BEACON throughout the system. Other BEEPs, which subscribe to this information, can generate appropriate response actions, if needed. However, actual response actions are initiated and performed by the various entities themselves, and not by the BEACON framework. We expect this approach to provide a comprehensive method for detecting, disseminating, and handling information on a system-wide basis.

The below figure shows the placement of the BEACON backplane in the overall ARGO system.

beacon-fig1

Logical Representation of BEACON

Logically, BEACON will function as a distributed, shared, messaging bus for transferring information between enclave-level publisher and subscriber entities.  BEACON will present a publish/subscribe service model at its interface.

In addition to providing event transport, BEACON may provide some, if not all, of the following functionalities, if needed.

  • Response Managers – for managing responses and coordinating different BEEPs by following recovery plans,
  • Translators – which translate events so that they can be understood semantically between BEEPs,
  • Loggers – which log external events and re-publishes events, if necessary and
  • Query Managers  – which manage queries within the BEACON framework

The below figure shows a simplified architecture diagram of BEACON and its services.

beacon-services-fig1

BEACON and its services

The EXPOSE Backplane

While global information access is important in an exascale environment,
the processing of the information in situ is necessary to aggregate,
reduce/filter, analyze, and interpret, in order to determine course of
action, if any. The EXPOSE (Exascale Performance and Observability
Environment) is a framework for developing in situ data introspection and
analysis services. EXPOSE interfaces to BEACON for retrieving data on
subscribed topics of interest and publishing results, control, and other
data back to consumers. Different services can be developed for specific
purposes within the exascale environment, depending on the requirements of
the problem.

For instance, power management is an important requirement in exascale
platforms in order to both use power efficiently and prevent
over-allocation and consumption beyond a power budget. The EXPOSE service
concept can be used for management of global power information. A
node-level measurement of power consumption could publish to BEACON for an
enclave power manager (EPM) to subscribe to. The EPM could then do power
data aggregation within an enclave and send its results to a global power
manager (GPM) for evaluation. The GPM could then send back power control
through BEACON to node-level resource managers. The EPM+GPM would be
implemented as an EXPOSE service.

Introspection of node-level, enclave-level, and global performance data
(captured by the TAU Performance System), its runtime analysis, and its
visualization has been demonstrated. EVPath technology is being use to
build the EXPOSE framework to align it with Hobbes activities. BEACON +
EXPOSE will make it possible to develop scalable distributed services in
exascale systems that are efficient, dynamic, retargetable, and responsive.

Comments are closed.