Experiences of Storing and Querying Monitoring Data of Large-scale High Performance Computing Platforms


Monitoring high performance computing (HPC) platforms has been very useful in determining system behaviors, failure trends, and detection of certain types of hardware that cause cluster instability. It is critical to the successful operation of a facility. As we move to exascale systems, system monitoring for large-scale HPC platforms becomes more challenging as the scale and complexity of the platforms increases. Vendors or administrators may have their own customized collecting tools; however, they all face the same challenge: dealing with the ever-growing data points. In this research, we will dive into the time series databases (TSDBs) which are popular in DevOps monitoring and real-time data analysis. We will also share the experience and efforts of improving the performance of monitoring the Quanah cluster at Texas Tech High Performance Computing Center.

Download slides here