MonSTer: An Out-of-the-Box Monitoring Tool for High Performance Computing Systems

Name: MonSTer: An Out-of-the-Box Monitoring Tool for High Performance Computing Systems
Start: 2020-09-14T00:00:00Z
End: 2020-09-17T00:00:00Z
Location: Kobe, Japan (Virtual)

Jie Li, Ghazanfar Ali, Ngan Nguyen, Jon Hass, Alan Sill, Tommy Dang, Yong Chen

CLUSTER 2020

Abstract

Understanding the status of high-performance computing platforms and correlating applications to resource usage provide insight into the interactions among platform components. A lot of efforts have been devoted into developing monitoring solutions; however, a large-scale HPC system usually requires a combination of methods/tools to successfully monitor all metrics, which will lead to a huge effort in configuration and monitoring. Besides, monitoring tools are often left behind in the procurement of large-scale HPC systems. These challenges have motivated the development of a next-generation out-of-the-box monitoring tool that can be easily deployed without losing informative metrics. In this work, we introduce MonSTer, an “out-of-the-box” monitoring tool for high-performance computing platforms. MonSTer uses the evolving specification Redfish to retrieve sensor data from Baseboard Management Controller (BMC), and resource management tools such as Univa Grid Engine (UGE) or Slurm to obtain application information and resource usage data. Additionally, it also uses a time-series database (e.g. InfluxDB) for data storage. MonSTer correlates applications to resource usage and reveals insightful knowledge without having additional overhead on the application and computing nodes. This paper presents the design and implementation of MonSTer, as well as experiences gained through real-world deployment on the 467-node Quanah cluster at Texas Tech University’s High Performance Computing Center (HPCC) over the past year.

Date

Sep 14, 2020 12:00 AM — Sep 17, 2020 12:00 AM

Event

CLUSTER 2020

Location

Kobe, Japan (Virtual)

HPC Monitoring Redfish InfluxDB Data Analytics Time-series Data Analytics

MonSTer: An Out-of-the-Box Monitoring Tool for High Performance Computing Systems

Abstract

Jie Li

Ph.D. candidate in Computer Science