Monitoring Operating System Status (on a Raspberry Pi cluster)


Monitoring high performance computing (HPC) platforms has been very useful in determining system behaviors, failure trends, and detecting certain types of hardware that cause cluster instability. In our previous studies, we have explored the collection of data from Baseboard Management Controllers (BMCs) and from resource managers (e.g., UGE, Slurm) via dedicated APIs. However, these data are not sufficient for evaluating and inspecting the systems. In this study, we explore the collection of data from operating system side, present a monitoring tool called “Glances” and integrate it into our monitoring toolset. Furthermore, to check the usability of “Glances” and to test the portability of our monitoring toolset, we build a Raspberry Pi cluster for the experiment. A live demo will be given during the talk.

Download slides here