Jie Li's CV

Jie Li

Postdoctoral Researcher, Department of Computer Science, Texas Tech University

Email: jie[dot]li[at]ttu[dot]edu

Homepage: https://lijie.me

EDUCATION

Doctor of Philosophy, Computer Science, Texas Tech University, Lubbock, Texas May 2024

Dissertation: Optimizing High-Performance Computing Systems: Insights from System Monitoring, Workload Management, and Scheduling Strategies

Master of Science, Computer Science, Texas Tech University, Lubbock, Texas 2019

Thesis: PIMS: A Lightweight Processing-in-Memory Accelerator for Stencil Computations

RESEARCH INTERESTS

My research focuses on high-performance computing (HPC), parallel and distributed computing, and computer architecture. The goal of my research is to develop efficient and scalable computing systems that power AI, big data analytics, and scientific computing. My specific interests include resource management and scheduling in HPC systems, with a focus on disaggregated memory architectures, system-level performance monitoring, operational data analysis, machine learning-based system management, and hardware/software co-design for next-generation HPC platforms.

PROFESSIONAL EXPERIENCE

Research Assistant 2019 – 2024

Data-Intensive Scalable Computing Laboratory (DISCL), TTU, Lubbock, Texas

Research and Publication: Conducted innovative research in High-Performance Computing, Computer Architecture, and Parallel and Distributed Computing. Authored and published research papers in reputable academic conferences and journals.
Professional Development and Networking: Actively engaged in the academic community by attending conferences, workshops, and seminars. Presented research papers and posters at these events.
Mentorship and Education: Mentored both graduate and undergraduate students in their independent research studies. Provided guidance on research topics, project development, and data analysis.
Software Development and Collaboration: Played an integral role in developing and maintaining research software and tools. Wrote, tested, and documented code for various projects. Contributed to open-source software initiatives, fostering collaborative innovation.
Server Administration: Managed two high-end servers (Hugo and Alita) hosted at the High-Performance Computing Center. Oversaw server configuration, maintenance, and software management. Ensured consistent server availability and reliability while troubleshooting issues as they arose.

Graduate Student Intern 2021 – 2023 [Summer]

Lawrence Berkeley National Laboratory (LBNL), Berkeley, California

Data Integration and Analysis: Integrated HPC monitoring data from diverse sources (LDMS, DCGM, Slurm, VictoriaMetrics) for comprehensive analysis of system-wide architectural efficiency, including CPU, GPU, DRAM, and HBM2 resource utilization. Identified critical trends and patterns within the data to drive insights into system performance, with a focus on NERSC’s Cori and Perlmutter.
Machine Learning Expertise: Conducted in-depth statistical analysis of job-level monitoring data. Applied a variety of machine learning models, including SVC, LinearSVC, Decision Tree, and Random Forests, to analyze jobs based on time-series features.
Innovative Data Processing: Pioneered a novel approach by encoding time-series monitoring data as images and trained a Convolutional Neural Network (CNN) to classify and predict job applications with high accuracy.
Simulation and System Design: Designed and implemented a discrete event simulator to study resource management and job scheduling in HPC systems, with a specific focus on systems with disaggregated memory resources.

Graduate Student Programmer 2018 – 2019

Teaching, Learning and Professional Development Center (TLPDC),TTU, Lubbock, Texas

Website Maintenance and Communication: Maintained and updated TLPDC web pages, ensuring a fresh and relevant online presence. Facilitated communication with software application providers to meet product requirements efficiently.
Database Management and Security: Managed the MySQL database with precision, safeguarding valuable data assets. Implemented robust backup strategies to protect against data loss. Proactively addressed and resolved database access issues to maintain uninterrupted operations.

SELECTED PROJECTS

Scheduling and Allocation of Disaggregated Memory Resources in HPC Systems 2023

Designed and implemented a discrete event simulator based on Simpy to study resource management and job scheduling in HPC systems with disaggregated memory resources. Customizable to various system configurations and scheduling policies.
Devised a performance degradation model based on prior studies to estimate job runtimes when accessing remote memory resources.
Proposed the innovative FM job scheduling policy, tailored for disaggregated memory systems, yielding superior system throughput and bounded slowdown compared to state-of-the-art policies.
Simulated the FM scheduler in a system with one-fourth of the original local memory. The experimental results shown that it boosts average memory utilization from 27.97% to 79.24%, with only a marginal 5.52% reduction in average job performance.

Monitoring Data Management and Query Performance Optimization 2021

Investigated and identified performance bottlenecks in InfluxDB. Optimized the database schema design, resulting in a remarkable 71.98% reduction in data volume and a significant 1.76X boost in query performance, enhancing data management efficiency.
Designed and implemented a time series deduplication mechanism. It achieved an impressive average data volume reduction of 70.38% and maintained data accuracy with a minimal error of only 0.74% in reconstruction.
Designed and developed MetricsBuilder, a data access accelerator. MetricsBuilder dramatically improved query performance by up to 25X and reduced data transmission volume by 95% compared to traditional SQL queries, streamlining data retrieval.
Implemented an API using the OpenAPI specification. The API provided efficient data access services to data analysis consumers, including JavaScript data visualization applications and Grafana, ensuring seamless access to valuable insights.

High-Performance Computing System Health Monitoring & Performance Data Collection 2020

Explored mechanisms to acquire health status monitoring data from an HPC cluster via the Integrated Dell Remote Access Controller (iDRAC), enhancing cluster management and efficiency.
Spearheaded the development of a suite of tools for automating iDRAC telemetry report configuration, metric analysis, and TimescaleDB table initialization. Efficiently handled diverse data sources and types, streamlining data processing and analysis.
Designed and implemented a robust system monitoring infrastructure capable of asynchronous collection of health status data through the Redfish API and job accounting data via the Slurm REST API.
The Slurm data collection code has been adopted and merged into Dell’s Omnia project for broader industry utilization (Github link: Omnia).

PEER-REVIEWED PUBLICATIONS

[1]

X. Wang, A. Tumeo, J. D. Leidel, Jie Li, and Y. Chen, “MAC: Memory access coalescer for 3D-stacked memory,” in Proceedings of the 48th international conference on parallel processing (ICPP’19), 2019, pp. 1–10. doi: https://doi.org/10.1145/3337821.3337867.

[2]

Jie Li, X. Wang, A. Tumeo, B. Williams, J. D. Leidel, and Y. Chen, “PIMS: A lightweight processing-in-memory accelerator for stencil computations,” in Proceedings of the international symposium on memory systems (MemSys’19), 2019, pp. 41–52. doi: https://doi.org/10.1145/3357526.3357550.

[3]

V. Pham, N. Nguyen, Jie Li, J. Hass, Y. Chen, and T. Dang, “Mtsad: Multivariate time series abnormality detection and visualization,” in 2019 IEEE international conference on big data (BigData’19), IEEE, 2019, pp. 3267–3276. doi: https://doi.org/10.1109/BigData47090.2019.9006559.

[4]

N. Nguyen, J. Hass, Y. Chen, Jie Li, A. Sill, and T. Dang, “Radarviewer: Visualizing the dynamics of multivariate data,” in Practice and experience in advanced research computing (PEARC’20), 2020, pp. 555–556. doi: https://doi.org/10.1145/3311790.3404538.

[5]

Jie Li et al., “Monster: An out-of-the-box monitoring tool for high performance computing systems,” in 2020 IEEE international conference on cluster computing (CLUSTER’20), IEEE, 2020, pp. 119–129. doi: https://doi.org/10.1109/CLUSTER49012.2020.00022.

[6]

X. Wang, A. Tumeo, J. D. Leidel, Jie Li, and Y. Chen, “HAM: Hotspot-aware manager for improving communications with 3D-stacked memory,” IEEE Transactions on Computers (IEEE Trans Comput), vol. 70, no. 6, pp. 833–848, 2021, doi: https://doi.org/10.1109/TC.2021.3066982.

[7]

T. Dang, N. Nguyen, J. Hass, Jie Li, Y. Chen, and A. Sill, “The gap between visualization research and visualization software in high-performance computing center,” The Gap between Visualization Research and Visualization Software (VisGap’21)), 2021, doi: https://doi.org/10.2312/visgap.20211089.

[8]

T. Dang, N. V. Nguyen, Jie Li, A. Sill, J. Hass, and Y. Chen, “JobViewer: Graph-based visualization for monitoring high-performance computing system,” in 2022 IEEE/ACM international conference on big data computing, applications and technologies (BDCAT’22), IEEE, 2022, pp. 110–119. doi: https://doi.org/10.1109/BDCAT56447.2022.00021.

[9]

Jie Li, B. Cook, and Y. Chen, “ARcode: HPC application recognition through image-encoded monitoring data,” arXiv preprint arXiv:2301.08612, 2023, doi: https://doi.org/10.48550/arXiv.2301.08612.

[10]

Jie Li, G. Michelogiannakis, B. Cook, D. Cooray, and Y. Chen, “Analyzing resource utilization in an HPC system: A case study of NERSC’s perlmutter,” in International conference on high performance computing (ISC’23), Springer, 2023, pp. 297–316. doi: https://doi.org/10.1007/978-3-031-32041-5_16.

[11]

Jie Li, R. Wang, G. Ali, T. Dang, A. Sill, and Y. Chen, “Workload failure prediction for data centers,” in 2023 IEEE 16th international conference on cloud computing (CLOUD’23), 2023, pp. 479–485. doi: https://doi.org/10.1109/CLOUD60044.2023.00064.

[12]

C. E. Caon, Jie Li, and Y. Chen, “Effective management of time series data,” in 2023 IEEE 16th international conference on cloud computing (CLOUD’23), 2023, pp. 408–414. doi: https://doi.org/10.1109/CLOUD60044.2023.00055.

[13]

T. Dang, N. V. Nguyen, Jie Li, A. Sill, and Y. Chen, “Spiro: Order-preserving visualization in high performance computing monitoring,” in International symposium on visual computing, Springer, 2023, pp. 109–120.

[14]

J. Li, J. D. Leidel, B. Page, and Y. Chen, “Towards cycle-accurate simulation of xBGAS,” in 2024 international conference on computing, networking and communications (ICNC), IEEE, 2024, pp. 468–472.

[15]

Jie Li et al., “Job scheduling in high performance computing systems with disaggregated memory resources,” in 2024 IEEE international conference on cluster computing (CLUSTER’24), IEEE, 2024, pp. 297–309.

PRESENTATIONS

Conference Presentations

Workload Failure Prediction for Data Centers, CLOUD’23 July 2023
A Holistic View of Resource Utilization on Perlmutter (Poster), SC’22 Nov. 2022
Advanced Visualization and Data Analysis of HPC Cluster and User Application Behavior, SC’21 Nov. 2021
MonSTer: An Out-of-the-Box Monitoring Tool for HPC Systems, CLUSTER’20 Sept. 2020
PIMS: A Lightweight Processing-In-Memory Accelerator for Stencil Computations, MemSys’19 Oct. 2019

Research Seminar Talks

Towards Cycle-Accurate Simulation for xBGAS Apr. 2023
A Holistic View of Resource Utilization on Perlmutter Aug. 2022
SST and Cycle-accurate Simulation of xBGAS May 2022
Lightweight Checkpointing of Loop-Based Kernels Using Disaggregated Memory Feb. 2022
DL and Monitoring Metrics to Image Encoding for Detecting Applications in HPC systems Nov. 2021
Detecting and Identifying Applications by Job Signatures Sept. 2021
Predicting Abnormal Workloads in HPC Systems May 2021
The IEEE Cluster2020 Experience, MonSTer Review and Future Work Sept. 2020
Monitoring Operating System Status on a Raspberry Pi cluster July 2020
Experiences of Storing and Querying Monitoring Data of Large-scale HPC Platforms Apr. 2020
Collecting and Storing Telemetry Metrics from RedRaider Cluster Apr. 2020
Profiling Power Consumption of Jobs with SLURM Jan. 2020
Monitoring Power Usage of Jobs Running on Quanah Cluster Sept. 2019

MENTORING EXPERIENCE

Undergraduate Students (including REU participants)

Mentoring Yusheng Han and Zachary Kay on the topic “Running HPC Applications on the RedRaider Cluster and Analyzing Performance Behaviors”. Independent Study (CS4000) Spring 2022
Mentoring Casey Root on the topic “Monitoring Queue Status via SLURM Rest API”. Independent Study (CS4000) Spring 2021

Graduate Students

Mentoring Cristiano Caon on the topic “Investigating the Data Volume Reduction and Query Optimization in Time Series Databases”. Outcomes include a conference publication in CLOUD’23. Independent Study (CS7000) and Master’s Thesis Fall 2022
Mentoring Aniruddh Sanjaysinh Chavda and Huyen Nguyen on the topic “Usage Behavior Analysis with Clustering Job Accounting Data”. Advanced Operating System (CS5379) Spring 2021
Mentoring Ruonan Wu on the topic “Job Accounting Data Analysis for Quanah Cluster ”. Advanced Operating System (CS5379) Spring 2021
Mentoring Ashritha Puradamane Balachandra on the topic “Improving Query Performance of InfluxDB”. Advanced Operating System (CS5379) Spring 2020

SERVICES

Paper Reviewer

The Journal of Supercomputing
IEEE International Parallel and Distributed Processing Symposium (IPDPS’23)
IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid’22)
IEEE International Conference on Distributed Computing Systems (ICDCS’22)
The International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC’22)
International Parallel Data Systems Workshop (PDSW’22)
IEEE International Conference on Big Data (BigData’20, BigData’21, BigData’22)
IEEE International Conference on Smart Data Services (SDMS’20)

Volunteer

Student volunteer of SC’21, St. Louis, Missouri 2021
Student volunteer of SC’19, Denver, Colorado 2019
Volunteer of Paul’s Project, Grace Campus, Lubbock, Texas 2019

HONORS AND AWARDS

Best Poster Award, NSF Cloud and Autonomic Computing Industry Advisory Board Conference 2022
Summer Thesis/Dissertation Research Award ($2300), Lubbock, Texas 2019

Last Update: Jan 13, 2025