Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

Why I Believe

less than 1 minute read

Published:

Some thoughts about how he could be a professing intellectual and still be a Christian, by a CS professor and a Christian George Varghese.

Software Architecture Patterns

1 minute read

Published:

Even I am a researcher working on HPC related topics and software engineering is not my focus, I still find that having some basic knowledge of software architecture patterns helps to design and implement research ideas, and it helps you to plot diagrams of architecture design in your paper. Here, I found a brief introduction of the existing architecture patterns and I would like to share it in the post for future reference.

portfolio

publications

MAC: Memory access coalescer for 3D-stacked memory

Published in Proceedings of the 48th International Conference on Parallel Processing, 2019

In this paper we propose MAC (Memory Access Coalescer), a coalescing unit for the 3D-stacked memory. We discuss the design and implementation of MAC, in the context of a custom designed cache-less architecture targeted at data-intensive, irregular applications. Through a custom simulation infrastructure based on the RISC-V toolchain, we show that MAC achieves a coalescing efficiency of 52.85% on average. It improves the performance of the memory system by 60.73% on average for a large set of irregular workloads.

Recommended citation: Wang, Xi, Antonino Tumeo, John D. Leidel, Jie Li, and Yong Chen. "MAC: Memory access coalescer for 3D-stacked memory." In Proceedings of the 48th International Conference on Parallel Processing, pp. 1-10. 2019. https://artlands.github.io/files/wang-icpp-2019.pdf

PIMS: a lightweight processing-in-memory accelerator for stencil computations

Published in Proceedings of the International Symposium on Memory Systems, 2019

In this paper we present PIMS, an in-memory accelerator for stencil computations. PIMS, implemented in the logic layer of a 3D stacked memory, exploits the high bandwidth provided by through silicon vias to reduce redundant memory traffic. Our comprehensive evaluation using three different grid sizes with six categories of orders indicate that the proposed architecture reduces 48.25% of data movement on average and obtains up to 65.55% of bank conflict reduction.

Recommended citation: Li, Jie, Xi Wang, Antonino Tumeo, Brody Williams, John D. Leidel, and Yong Chen. "Pims: a lightweight processing-in-memory accelerator for stencil computations." In Proceedings of the International Symposium on Memory Systems, pp. 41-52. 2019. https://artlands.github.io/files/li-memsys-2019.pdf

Mtsad: Multivariate time series abnormality detection and visualization

Published in IEEE International Conference on Big Data (Big Data), 2019

This paper introduces an approach to analyzing and visualizing highdimensional time series, focusing on identifying multivariate observations that are significantly different from the others. We also propose a prototype, called MTSAD, to guide users when interactively exploring abnormalities in large time series.

Recommended citation: Pham, Vung, Ngan Nguyen, Jie Li, Jon Hass, Yong Chen, and Tommy Dang. "Mtsad: Multivariate time series abnormality detection and visualization." In 2019 IEEE International Conference on Big Data (Big Data), pp. 3267-3276. IEEE, 2019. https://artlands.github.io/files/pham-bigdata-2019.pdf

RadarViewer: Visualizing the dynamics of multivariate data

Published in Practice and Experience in Advanced Research Computing (PEARC), 2020

This showcase presents a visual approach based on clustering and superimposing to construct a high-level overview of sequential event data while balancing the amount of information and the cardinality in it. We also implement an interactive prototype, called RadarViewer, that allows domain analysts to simultaneously analyze sequence clustering, extract useful distribution patterns, drill multiple levels-of-detail to accelerate the analysis. The RadarViewer is demonstrated through case studies with real-world temporal datasets of different sizes.

Recommended citation: Nguyen, Ngan, Jon Hass, Yong Chen, Jie Li, Alan Sill, and Tommy Dang. "RadarViewer: Visualizing the dynamics of multivariate data." In Practice and Experience in Advanced Research Computing, pp. 555-556. 2020. https://artlands.github.io/files/ngan-pearc-2020.pdf

MonSTer: an out-of-the-box monitoring tool for high performance computing systems

Published in IEEE International Conference on Cluster Computing (CLUSTER), 2020

In this work, we introduce MonSTer, an “out-of-the-box” monitoring tool for high-performance computing platforms. MonSTer uses the evolving specification Redfish to retrieve sensor data from Baseboard Management Controller (BMC), and resource management tools such as Univa Grid Engine (UGE) or Slurm to obtain application information and resource usage data. Additionally, it also uses a time-series database (e.g. InfluxDB) for data storage.

Recommended citation: Li, Jie, Ghazanfar Ali, Ngan V. T. Nguyen, Jon Hass, Alan Sill, Tommy Dang and Yong Chen. “MonSTer: An Out-of-the-Box Monitoring Tool for High Performance Computing Systems.” 2020 IEEE International Conference on Cluster Computing (CLUSTER) (2020): 119-129. https://artlands.github.io/files/li-cluster-2020.pdf

HAM: Hotspot-Aware Manager for Improving Communications with 3D-Stacked Memory

Published in IEEE Transactions on Computers, 2021

In this article, we propose a novel Hotspot-Aware Manager (HAM) infrastructure for 3D-stacked memory devices capable of optimizing memory access streams via request aggregation, hotspot detection, and in-memory prefetching. We present the HAM design and implementation, and simulate it on a system using RISC-V embedded cores with attached HMC devices. We extensively evaluate HAM with over 12 benchmarks and applications representing diverse irregular memory access patterns. The results show that, on average, HAM reduces redundant requests by 37.51 percent and increases the prefetch buffer hit rate by 4.2 times, compared to a baseline streaming prefetcher. On the selected benchmark set, HAM provides performance gains of 21.81 percent in average (up to 34.28 percent), and power savings of 35.07 percent over a standard 3D-stacked memory.

Recommended citation: Wang, Xi, Antonino Tumeo, John D. Leidel, Jie Li, and Yong Chen. "HAM: Hotspot-Aware Manager for Improving Communications with 3D-Stacked Memory." IEEE Transactions on Computers 70, no. 6 (2021): 833-848. https://artlands.github.io/files/wang-tc-2021.pdf

talks

Monitoring Power Usage of Jobs Running on Quanah Cluster

Published:

Advanced power measurement capabilities are becoming available on large-scale high performance computing (HPC) deployments. Measurement of power/energy usage and its variation during real workloads will enable us to evaluate the potential benefits of incorporating power data into job scheduling and resource management decisions. There are several existing approaches to providing power measurements today, primarily through in-band and out-of-band measurements. In this talk ,we will discuss several power profiling techniques on modern HPC platforms and give a demo of our current implementation of monitoring power usage of jobs running on Quanah cluster. While this is still work in progress, we present the current state of our research in order to show what we are trying to learn, what we are analyzing and how we are analyzing it, and what else we need to accomplish to make further progress.

PIMS-A Lightweight Processing-In-Memory Accelerator for Stencil Computations

Published:

Stencil computation is a classic computational kernel present in many high-performance scientific applications, like image processing and partial differential equation solvers (PDE). A stencil computation sweeps over a multi-dimensional grid and repeatedly updates values associated with points using the values from neighboring points. Stencil computations often employ large datasets that exceed cache capacity, leading to excessive accesses to the memory subsystem. As such, 3D stencil computations on large grid sizes are memory-bound. In this paper we present PIMS, an in-memory accelerator for stencil computations. PIMS, implemented in the logic layer of a 3D stacked memory, exploits the high bandwidth provided by through silicon vias to reduce redundant memory traffic. Our comprehensive evaluation using three different grid sizes with six categories of orders indicate that the proposed architecture reduces 48.25% of data movement on average and obtains up to 65.55% of bank conflict reduction.

Profiling Power Consumption of Jobs with SLURM

Published:

Power, temperature and performance have all become first-order design constraints for High Performance Computing platforms. These three features influence each other, and affect the architecture and resource scheduling designs. In our previous research, we have proposed a methodology for profiling power consumption at job level based on the assumption that power usage is proportional to the CPU usage. However, this methodology lacks accuracy and cannot be applied for the application that mainly stresses the network. On the other hand, the resource and job management system (e.g. SLURM) has knowledge of both the underlying resources and jobs needs, and it is the best candidate for monitoring and controlling power consumption of jobs. In this talk, we will discuss SLURM architecture and a framework proposed by Yiannis et.al. developed upon SLURM, which allows energy accounting per job with power profiling capabilities along with parameters for energy control features based on static frequency scaling of the CPUs.

Collecting and Storing Telemetry Metrics from RedRaider Cluster

Published:

With the deployment of RedRaider Cluster and the new telemetry model adopted by iDRAC9, collecting and storing the telemetry metrics is encountering new challenges. These challenges come from three aspects. First, the RedRaider Cluster (including the Quanah, Nocona and Matador partitions) has an ever large number of nodes to monitor, containing a total of 728 nodes. Second, iDRAC9 provides more than 180 metrics, which is much more than that of iDRAC8. Third, iDRAC9 implements a new telemetry model, the push model, in the telemetry reports, which requires a different approach to obtain the telemetry metrics. In this talk, I will summarize the approaches we’ve explored to addresses these challenges and present preliminary results from our tests and experiments.

Experiences of Storing and Querying Monitoring Data of Large-scale High Performance Computing Platforms

Published:

Monitoring high performance computing (HPC) platforms has been very useful in determining system behaviors, failure trends, and detection of certain types of hardware that cause cluster instability. It is critical to the successful operation of a facility. As we move to exascale systems, system monitoring for large-scale HPC platforms becomes more challenging as the scale and complexity of the platforms increases. Vendors or administrators may have their own customized collecting tools; however, they all face the same challenge: dealing with the ever-growing data points. In this research, we will dive into the time series databases (TSDBs) which are popular in DevOps monitoring and real-time data analysis. We will also share the experience and efforts of improving the performance of monitoring the Quanah cluster at Texas Tech High Performance Computing Center.

Monitoring Operating System Status (on a Raspberry Pi cluster)

Published:

Monitoring high performance computing (HPC) platforms has been very useful in determining system behaviors, failure trends, and detecting certain types of hardware that cause cluster instability. In our previous studies, we have explored the collection of data from Baseboard Management Controllers (BMCs) and from resource managers (e.g., UGE, Slurm) via dedicated APIs. However, these data are not sufficient for evaluating and inspecting the systems. In this study, we explore the collection of data from operating system side, present a monitoring tool called “Glances” and integrate it into our monitoring toolset. Furthermore, to check the usability of “Glances” and to test the portability of our monitoring toolset, we build a Raspberry Pi cluster for the experiment. A live demo will be given during the talk.

MonSTer: An Out-of-the-Box Monitoring Tool for High Performance Computing Systems

Published:

Understanding the status of high-performance computing platforms and correlating applications to resource usage provide insight into the interactions among platform components. A lot of efforts have been devoted into developing monitoring solutions; however, a large-scale HPC system usually requires a combination of methods/tools to successfully monitor all metrics, which will lead to a huge effort in configuration and monitoring. Besides, monitoring tools are often left behind in the procurement of large-scale HPC systems. These challenges have motivated the development of a next-generation out-of-the-box monitoring tool that can be easily deployed without losing informative metrics.

The IEEE Cluster2020 Experience, MonSTer Review and Future Work

Published:

MonSTer is an out-of-the-box monitoring tool for high performance computing systems that has been in development for over a year. After several rounds of iterations and optimizations, we have achieved up to 25x performance improvements over the initial implementation, allowing for near real-time acquisition and visualization of monitoring data. This research study has been published at the IEEE Cluster 2020 conference and was presented to the community last week. In this talk, I will share some of my experience from attending the conference and review the work we have done. In addition, I would like to discuss several research directions in the context of “Integrated visualizing, monitoring, and managing HPC systems”.

Predicting Abnormal Workloads in HPC Systems

Published:

Abnormal workloads in High-Performance Computing (HPC) systems can be defined as workloads exit abnormally and are cancelled by the user or terminated by the job scheduler or operating system due to software or hardware related problems. These anomalous workloads, especially those that consumed significant computational resources in time and space, affect the efficiency of HPC system and thus limit the amount of scientific work that can be achieved. While we are approaching towards the exaFLOP performance goal, anomaly detection of workloads will become increasingly important as system scale and complexity increase. However, predicting anomalous workloads is a non-trivial task. There is no publicly available, labeled dataset of abnormal workloads, nor are there publicly accepted methodologies for predicting workload anomalies. Furthermore, even though HPC monitoring metrics record the performance state of the system, analyzing and extracting insightful information from this massive metrics can be daunting. In this study, we analyze job accounting data collected from a production HPC cluster and use these data to train machine learning models to predict the abnormal workloads. Experimental results show that our prediction model can achieve 97.0% precision and reduce CPU time, integrated memory usage, and IO by up to 24.18%, 27.29%, and 60.37%, respectively.

Detecting and Identifying Applications by Job Signatures

Published:

As HPC systems are entering the exaFLOP era, the scale and complexity of HPC systems have increased significantly over the past few years. Administrators need to understand not only how the hardware system is performing, but also the typical applications that use the system. In addition, resource contention and energy consumption increase with the computation capability. HPC administrators and researchers need to understand the characteristics of running applications and design better resource-aware scheduling policies to improve system efficiency. Moreover, unauthorized applications, such as bit-coin mining programs, could take advantages of the high computing capability, consuming computing hours that supposed be used for scientific discoveries. Therefore, knowing which applications are running will help administrators to ban these malware in a proactive manner. To address these challenges, it is necessary to detect applications and develop management strategies based on knowledge of the applications. However, this is a no-trivial task if users do not specify the name of the application in their job submission scripts. In this research, we propose approaches to detect and identify applications through job signatures that are built from the monitoring metrics. Specifically, we exploit monitoring metrics collected from LDMS to build job signatures by two approaches: extracting statistical features from time-series data and representing multi-dimensional time-series data with images. Then, we explore several classification algorithms and evaluate their performances in classifying job signatures.

Advanced Visualization and Data Analysis of HPC Cluster and User Application Behavior

Published:

This work presents cutting-edge visualization, monitoring, and management solutions for HPC systems to understand the status of high-performance computing platforms and provide insight into the interactions among platform components. Benefiting from the greatly increased level of detail available from modern baseboard management controllers through Redfish Telemetry and real-time correlations via API and CLI interfaces to HPC job schedulers, this work provides much greater detail than previous similar projects.

Deep Learning and Monitoring Metrics to Image Encoding for Detecting Applications in HPC systems

Published:

Knowing the knowledge of applications in HPC systems not only allows designing better resource-aware scheduling policies to improve system efficiency, but also provides the opportunity to detect and prohibit malicious programs, such as bit-coin mining and password cracking applications. To this end, in our previous research, we proposed a method to detect and identify applications based on statistical features extracted from the resource consumption metrics. However, our previous experimental results showed that the model using the random forest algorithm with statistical features only worked relatively well on some specific applications; the overall accuracy was not promising. In this seminar, I will talk about our recent efforts to enhance the detection model. Specifically, we are exploring encoding time-series monitoring metrics into images and taking advantage of the deep Convolutional Neural Networks (CNNs) to classify two-dimensional images. Using encoded time-series data and CNNs within a unified framework is expected to boost the performance of the application detection and classification. Our results show that the proposed methodology achieves an average accuracy of 87%; For some specific applications, it achieves an accuracy of more than 95%.

Lightweight Checkpointing of Loop-Based Kernels Using Disaggregated Memory

Published:

Checkpoint/Restart (C/R) is usually used to provide fault tolerance in HPC systems, where the increasing number of hardware components and complexity of software stack increase overall fault rates. Meanwhile, the ever growing power consumption of HPC systems has fostered the exploration of using renewable energy to power the systems. Due to the variable nature of renewable energy sources, such as wind and solar, jobs running on HPC Systems need to be suspended or migrated to fit within the power availability. C/R, in such cases, is also indispensable. Typically, C/R techniques periodically save the status of running processes to checkpoint files stored on a parallel file system. However, this process is expensive; a single checkpoint can take the order of tens of minutes, rendering checkpointing less practical. We propose a lightweight checkpointing technique that uses disaggregated memory as a checkpoint target for loop-based kernels. The disaggregated memory is a concept similar to disaggregated storage; it provides a shared memory pool for compute nodes. With the xBGAS-enabled remote memory accessibility, we hope to checkpoint the necessary data objects on remote memory with minimum overhead and to be able to restart the application from any available compute nodes. In this seminar talk, we will summarize current Checkpoint/Restart techniques, present the concept of the disaggregated memory, and discuss about the proposed C/R, which we name it CRaaS (C/R as a Service). Finally, we will use a loop-based kernel to illustrate the design of CRaaS APIs.

SST and Cycle-accurate Simulation of xBGAS

Published:

As the HPC community is entering the exaFLOP regime, designing and building these computers are becoming increasingly difficult. Except for considering the traditional challenges of performance and scaling, system architects have also to consider the challenges in power consumption, cost, programmability, etc. Overcoming these challenges requires a holistic approach that not only makes changes to the subcomponents, such as memory and processor, but also concurrently changes the programming model and applications, i.e. the hardware-software co-design approach. However, it is usually impractical to construct hardware prototypes to explore the vast design space. Simulator, in this scenario, becomes an indispensable tool to guide design decisions. In this talk, I will present the Structural Simulation Toolkit (SST), an open, modular, parallel, multi-scale simulation framework for HPC architectural exploration. More specifically, I will give an overview of SST and explain how it works. I will also give demos of simulating a multi-level memory system. The goal of exploring SST is to implement a cycle-accurate simulation framework for xBGAS, which is a major absence in the previous xBGAS work.

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.