Predicting Abnormal Workloads in HPC Systems


Abnormal workloads in High-Performance Computing (HPC) systems can be defined as workloads exit abnormally and are cancelled by the user or terminated by the job scheduler or operating system due to software or hardware related problems. These anomalous workloads, especially those that consumed significant computational resources in time and space, affect the efficiency of HPC system and thus limit the amount of scientific work that can be achieved. While we are approaching towards the exaFLOP performance goal, anomaly detection of workloads will become increasingly important as system scale and complexity increase. However, predicting anomalous workloads is a non-trivial task. There is no publicly available, labeled dataset of abnormal workloads, nor are there publicly accepted methodologies for predicting workload anomalies. Furthermore, even though HPC monitoring metrics record the performance state of the system, analyzing and extracting insightful information from this massive metrics can be daunting. In this study, we analyze job accounting data collected from a production HPC cluster and use these data to train machine learning models to predict the abnormal workloads. Experimental results show that our prediction model can achieve 97.0% precision and reduce CPU time, integrated memory usage, and IO by up to 24.18%, 27.29%, and 60.37%, respectively.

Download slides here