Talks and presentations

See a map of all the places I've given a talk!

A Holistic View of Resource Utilization on Perlmutter

August 03, 2022

Seminar, DISCL, Lubbock, Texas

The current resource allocation method in HPC allocates resources to applications in units of nodes where every node is identical. On the other hand, the demands of HPC applications vary significantly in terms of resource usage. The gap between the coarse-grained resource and the varying demands makes the system risk in substantially underutilization, even though the nodes allocation rate seems high. In this work, we perform a large-scale analysis of metrics sampled in NERSC’s Perlmutter, a production HPC system containing both CPU-only nodes and GPU-accelerated nodes. This data-driven analysis gives us a holistic view of how compute resources are used and what characteristics of applications have regarding utilizing the HPC resources. The insights derived from the analysis not only help us evaluate the current architecture design and therefore have impacts on making future procurement decisions, but also motives the research on emerging system architecture design, such as disaggregated memory, to improve the sub-system resource utilization.

SST and Cycle-accurate Simulation of xBGAS

May 03, 2022

Seminar, DISCL, Lubbock, Texas

As the HPC community is entering the exaFLOP regime, designing and building these computers are becoming increasingly difficult. Except for considering the traditional challenges of performance and scaling, system architects have also to consider the challenges in power consumption, cost, programmability, etc. Overcoming these challenges requires a holistic approach that not only makes changes to the subcomponents, such as memory and processor, but also concurrently changes the programming model and applications, i.e. the hardware-software co-design approach. However, it is usually impractical to construct hardware prototypes to explore the vast design space. Simulator, in this scenario, becomes an indispensable tool to guide design decisions. In this talk, I will present the Structural Simulation Toolkit (SST), an open, modular, parallel, multi-scale simulation framework for HPC architectural exploration. More specifically, I will give an overview of SST and explain how it works. I will also give demos of simulating a multi-level memory system. The goal of exploring SST is to implement a cycle-accurate simulation framework for xBGAS, which is a major absence in the previous xBGAS work.

Lightweight Checkpointing of Loop-Based Kernels Using Disaggregated Memory

February 02, 2022

Seminar, DISCL, Lubbock, Texas

Checkpoint/Restart (C/R) is usually used to provide fault tolerance in HPC systems, where the increasing number of hardware components and complexity of software stack increase overall fault rates. Meanwhile, the ever growing power consumption of HPC systems has fostered the exploration of using renewable energy to power the systems. Due to the variable nature of renewable energy sources, such as wind and solar, jobs running on HPC Systems need to be suspended or migrated to fit within the power availability. C/R, in such cases, is also indispensable. Typically, C/R techniques periodically save the status of running processes to checkpoint files stored on a parallel file system. However, this process is expensive; a single checkpoint can take the order of tens of minutes, rendering checkpointing less practical. We propose a lightweight checkpointing technique that uses disaggregated memory as a checkpoint target for loop-based kernels. The disaggregated memory is a concept similar to disaggregated storage; it provides a shared memory pool for compute nodes. With the xBGAS-enabled remote memory accessibility, we hope to checkpoint the necessary data objects on remote memory with minimum overhead and to be able to restart the application from any available compute nodes. In this seminar talk, we will summarize current Checkpoint/Restart techniques, present the concept of the disaggregated memory, and discuss about the proposed C/R, which we name it CRaaS (C/R as a Service). Finally, we will use a loop-based kernel to illustrate the design of CRaaS APIs.

Deep Learning and Monitoring Metrics to Image Encoding for Detecting Applications in HPC systems

November 23, 2021

Seminar, DISCL, Lubbock, Texas

Knowing the knowledge of applications in HPC systems not only allows designing better resource-aware scheduling policies to improve system efficiency, but also provides the opportunity to detect and prohibit malicious programs, such as bit-coin mining and password cracking applications. To this end, in our previous research, we proposed a method to detect and identify applications based on statistical features extracted from the resource consumption metrics. However, our previous experimental results showed that the model using the random forest algorithm with statistical features only worked relatively well on some specific applications; the overall accuracy was not promising. In this seminar, I will talk about our recent efforts to enhance the detection model. Specifically, we are exploring encoding time-series monitoring metrics into images and taking advantage of the deep Convolutional Neural Networks (CNNs) to classify two-dimensional images. Using encoded time-series data and CNNs within a unified framework is expected to boost the performance of the application detection and classification. Our results show that the proposed methodology achieves an average accuracy of 87%; For some specific applications, it achieves an accuracy of more than 95%.

Advanced Visualization and Data Analysis of HPC Cluster and User Application Behavior

November 12, 2021

Conference, The International Conference for High Performance Computing, Networking, Storage, and Analysis, Saint Louis, Missouri

This work presents cutting-edge visualization, monitoring, and management solutions for HPC systems to understand the status of high-performance computing platforms and provide insight into the interactions among platform components. Benefiting from the greatly increased level of detail available from modern baseboard management controllers through Redfish Telemetry and real-time correlations via API and CLI interfaces to HPC job schedulers, this work provides much greater detail than previous similar projects.

Detecting and Identifying Applications by Job Signatures

September 14, 2021

Seminar, DISCL, Lubbock, Texas

As HPC systems are entering the exaFLOP era, the scale and complexity of HPC systems have increased significantly over the past few years. Administrators need to understand not only how the hardware system is performing, but also the typical applications that use the system. In addition, resource contention and energy consumption increase with the computation capability. HPC administrators and researchers need to understand the characteristics of running applications and design better resource-aware scheduling policies to improve system efficiency. Moreover, unauthorized applications, such as bit-coin mining programs, could take advantages of the high computing capability, consuming computing hours that supposed be used for scientific discoveries. Therefore, knowing which applications are running will help administrators to ban these malware in a proactive manner. To address these challenges, it is necessary to detect applications and develop management strategies based on knowledge of the applications. However, this is a no-trivial task if users do not specify the name of the application in their job submission scripts. In this research, we propose approaches to detect and identify applications through job signatures that are built from the monitoring metrics. Specifically, we exploit monitoring metrics collected from LDMS to build job signatures by two approaches: extracting statistical features from time-series data and representing multi-dimensional time-series data with images. Then, we explore several classification algorithms and evaluate their performances in classifying job signatures.

Predicting Abnormal Workloads in HPC Systems

May 08, 2021

Seminar, DISCL, Lubbock, Texas

Abnormal workloads in High-Performance Computing (HPC) systems can be defined as workloads exit abnormally and are cancelled by the user or terminated by the job scheduler or operating system due to software or hardware related problems. These anomalous workloads, especially those that consumed significant computational resources in time and space, affect the efficiency of HPC system and thus limit the amount of scientific work that can be achieved. While we are approaching towards the exaFLOP performance goal, anomaly detection of workloads will become increasingly important as system scale and complexity increase. However, predicting anomalous workloads is a non-trivial task. There is no publicly available, labeled dataset of abnormal workloads, nor are there publicly accepted methodologies for predicting workload anomalies. Furthermore, even though HPC monitoring metrics record the performance state of the system, analyzing and extracting insightful information from this massive metrics can be daunting. In this study, we analyze job accounting data collected from a production HPC cluster and use these data to train machine learning models to predict the abnormal workloads. Experimental results show that our prediction model can achieve 97.0% precision and reduce CPU time, integrated memory usage, and IO by up to 24.18%, 27.29%, and 60.37%, respectively.

The IEEE Cluster2020 Experience, MonSTer Review and Future Work

September 23, 2020

Seminar, DISCL, Lubbock, Texas

MonSTer is an out-of-the-box monitoring tool for high performance computing systems that has been in development for over a year. After several rounds of iterations and optimizations, we have achieved up to 25x performance improvements over the initial implementation, allowing for near real-time acquisition and visualization of monitoring data. This research study has been published at the IEEE Cluster 2020 conference and was presented to the community last week. In this talk, I will share some of my experience from attending the conference and review the work we have done. In addition, I would like to discuss several research directions in the context of “Integrated visualizing, monitoring, and managing HPC systems”.

MonSTer: An Out-of-the-Box Monitoring Tool for High Performance Computing Systems

September 16, 2020

Conference, The 22nd IEEE International Conference on Cluster Computing, Kobe, Japan

Understanding the status of high-performance computing platforms and correlating applications to resource usage provide insight into the interactions among platform components. A lot of efforts have been devoted into developing monitoring solutions; however, a large-scale HPC system usually requires a combination of methods/tools to successfully monitor all metrics, which will lead to a huge effort in configuration and monitoring. Besides, monitoring tools are often left behind in the procurement of large-scale HPC systems. These challenges have motivated the development of a next-generation out-of-the-box monitoring tool that can be easily deployed without losing informative metrics.

Monitoring Operating System Status (on a Raspberry Pi cluster)

July 22, 2020

Seminar, DISCL, Lubbock, Texas

Monitoring high performance computing (HPC) platforms has been very useful in determining system behaviors, failure trends, and detecting certain types of hardware that cause cluster instability. In our previous studies, we have explored the collection of data from Baseboard Management Controllers (BMCs) and from resource managers (e.g., UGE, Slurm) via dedicated APIs. However, these data are not sufficient for evaluating and inspecting the systems. In this study, we explore the collection of data from operating system side, present a monitoring tool called “Glances” and integrate it into our monitoring toolset. Furthermore, to check the usability of “Glances” and to test the portability of our monitoring toolset, we build a Raspberry Pi cluster for the experiment. A live demo will be given during the talk.

Experiences of Storing and Querying Monitoring Data of Large-scale High Performance Computing Platforms

April 15, 2020

Seminar, DISCL, Lubbock, Texas

Monitoring high performance computing (HPC) platforms has been very useful in determining system behaviors, failure trends, and detection of certain types of hardware that cause cluster instability. It is critical to the successful operation of a facility. As we move to exascale systems, system monitoring for large-scale HPC platforms becomes more challenging as the scale and complexity of the platforms increases. Vendors or administrators may have their own customized collecting tools; however, they all face the same challenge: dealing with the ever-growing data points. In this research, we will dive into the time series databases (TSDBs) which are popular in DevOps monitoring and real-time data analysis. We will also share the experience and efforts of improving the performance of monitoring the Quanah cluster at Texas Tech High Performance Computing Center.

Collecting and Storing Telemetry Metrics from RedRaider Cluster

February 27, 2020

Seminar, DISCL, Lubbock, Texas

With the deployment of RedRaider Cluster and the new telemetry model adopted by iDRAC9, collecting and storing the telemetry metrics is encountering new challenges. These challenges come from three aspects. First, the RedRaider Cluster (including the Quanah, Nocona and Matador partitions) has an ever large number of nodes to monitor, containing a total of 728 nodes. Second, iDRAC9 provides more than 180 metrics, which is much more than that of iDRAC8. Third, iDRAC9 implements a new telemetry model, the push model, in the telemetry reports, which requires a different approach to obtain the telemetry metrics. In this talk, I will summarize the approaches we’ve explored to addresses these challenges and present preliminary results from our tests and experiments.

Profiling Power Consumption of Jobs with SLURM

January 29, 2020

Seminar, DISCL, Lubbock, Texas

Power, temperature and performance have all become first-order design constraints for High Performance Computing platforms. These three features influence each other, and affect the architecture and resource scheduling designs. In our previous research, we have proposed a methodology for profiling power consumption at job level based on the assumption that power usage is proportional to the CPU usage. However, this methodology lacks accuracy and cannot be applied for the application that mainly stresses the network. On the other hand, the resource and job management system (e.g. SLURM) has knowledge of both the underlying resources and jobs needs, and it is the best candidate for monitoring and controlling power consumption of jobs. In this talk, we will discuss SLURM architecture and a framework proposed by Yiannis et.al. developed upon SLURM, which allows energy accounting per job with power profiling capabilities along with parameters for energy control features based on static frequency scaling of the CPUs.

PIMS-A Lightweight Processing-In-Memory Accelerator for Stencil Computations

October 01, 2019

Conference, The International Symposium on Memory Systems, Washington, D.C.

Stencil computation is a classic computational kernel present in many high-performance scientific applications, like image processing and partial differential equation solvers (PDE). A stencil computation sweeps over a multi-dimensional grid and repeatedly updates values associated with points using the values from neighboring points. Stencil computations often employ large datasets that exceed cache capacity, leading to excessive accesses to the memory subsystem. As such, 3D stencil computations on large grid sizes are memory-bound. In this paper we present PIMS, an in-memory accelerator for stencil computations. PIMS, implemented in the logic layer of a 3D stacked memory, exploits the high bandwidth provided by through silicon vias to reduce redundant memory traffic. Our comprehensive evaluation using three different grid sizes with six categories of orders indicate that the proposed architecture reduces 48.25% of data movement on average and obtains up to 65.55% of bank conflict reduction.

Monitoring Power Usage of Jobs Running on Quanah Cluster

September 18, 2019

Seminar, DISCL, Lubbock, Texas

Advanced power measurement capabilities are becoming available on large-scale high performance computing (HPC) deployments. Measurement of power/energy usage and its variation during real workloads will enable us to evaluate the potential benefits of incorporating power data into job scheduling and resource management decisions. There are several existing approaches to providing power measurements today, primarily through in-band and out-of-band measurements. In this talk ,we will discuss several power profiling techniques on modern HPC platforms and give a demo of our current implementation of monitoring power usage of jobs running on Quanah cluster. While this is still work in progress, we present the current state of our research in order to show what we are trying to learn, what we are analyzing and how we are analyzing it, and what else we need to accomplish to make further progress.