Seminar, DISCL, Lubbock, Texas
The current resource allocation method in HPC allocates resources to applications in units of nodes where every node is identical. On the other hand, the demands of HPC applications vary significantly in terms of resource usage. The gap between the coarse-grained resource and the varying demands makes the system risk in substantially underutilization, even though the nodes allocation rate seems high. In this work, we perform a large-scale analysis of metrics sampled in NERSC’s Perlmutter, a production HPC system containing both CPU-only nodes and GPU-accelerated nodes. This data-driven analysis gives us a holistic view of how compute resources are used and what characteristics of applications have regarding utilizing the HPC resources. The insights derived from the analysis not only help us evaluate the current architecture design and therefore have impacts on making future procurement decisions, but also motives the research on emerging system architecture design, such as disaggregated memory, to improve the sub-system resource utilization.