Jie Li
Research Assistant, Department of Computer Science, Texas Tech University
Email: jie[dot]li[at]ttu[dot]edu
Homepage: https://lijie.me
EDUCATION
Doctor of Philosophy, Computer Science, Texas Tech University, Lubbock, Texas
- Dissertation: Data Collection, Management and Automation of High-performance Computing Systems
Master of Science, Computer Science, Texas Tech University, Lubbock, Texas
- Thesis: PIMS: A Lightweight Processing-in-Memory Accelerator for Stencil Computations
- Summer Thesis/Dissertation Research Award ($2300)
RESEARCH INTERESTS
My research interests lie in the field of High-Performance Computing (HPC), encompassing HPC systems monitoring, automation, and management, operational data analytics, job scheduling, and system architecture. I also have a keen interest in parallel and distributed computing and computer architecture.
PROFESSIONAL EXPERIENCE
Research Assistant
- Research and Publication: Conducted innovative research in High-Performance Computing, Computer Architecture, and Parallel and Distributed Computing. Authored and published research papers in reputable academic conferences and journals.
- Professional Development and Networking: Actively engaged in the academic community by attending conferences, workshops, and seminars. Presented research papers and posters at these events.
- Mentorship and Education: Mentored both graduate and undergraduate students in their independent research studies. Provided guidance on research topics, project development, and data analysis.
- Software Development and Collaboration: Played an integral role in developing and maintaining research software and tools. Wrote, tested, and documented code for various projects. Contributed to open-source software initiatives, fostering collaborative innovation.
- Server Administration: Managed two high-end servers (Hugo and Alita) hosted at the High-Performance Computing Center. Oversaw server configuration, maintenance, and software management. Ensured consistent server availability and reliability while troubleshooting issues as they arose.
Graduate Student Intern
- Data Integration and Analysis: Integrated HPC monitoring data from diverse sources (LDMS, DCGM, Slurm, VictoriaMetrics) for comprehensive analysis of system-wide architectural efficiency, including CPU, GPU, DRAM, and HBM2 resource utilization. Identified critical trends and patterns within the data to drive insights into system performance, with a focus on NERSC’s Cori and Perlmutter.
- Machine Learning Expertise: Conducted in-depth statistical analysis of job-level monitoring data. Applied a variety of machine learning models, including SVC, LinearSVC, Decision Tree, and Random Forests, to analyze jobs based on time-series features.
- Innovative Data Processing: Pioneered a novel approach by encoding time-series monitoring data as images and trained a Convolutional Neural Network (CNN) to classify and predict job applications with high accuracy.
- Simulation and System Design: Designed and implemented a discrete event simulator to study resource management and job scheduling in HPC systems, with a specific focus on systems with disaggregated memory resources.
Graduate Student Programmer
- Website Maintenance and Communication: Maintained and updated TLPDC web pages, ensuring a fresh and relevant online presence. Facilitated communication with software application providers to meet product requirements efficiently.
- Database Management and Security: Managed the MySQL database with precision, safeguarding valuable data assets. Implemented robust backup strategies to protect against data loss. Proactively addressed and resolved database access issues to maintain uninterrupted operations.
SELECTED PROJECTS
Scheduling and Allocation of Disaggregated Memory Resources in HPC Systems
- Designed and implemented a discrete event simulator based on Simpy to study resource management and job scheduling in HPC systems with disaggregated memory resources. Customizable to various system configurations and scheduling policies.
- Devised a performance degradation model based on prior studies to estimate job runtimes when accessing remote memory resources.
- Proposed the innovative FM job scheduling policy, tailored for disaggregated memory systems, yielding superior system throughput and bounded slowdown compared to state-of-the-art policies.
- Simulated the FM scheduler in a system with one-fourth of the original local memory. The experimental results shown that it boosts average memory utilization from 27.97% to 79.24%, with only a marginal 5.52% reduction in average job performance.
Monitoring Data Management and Query Performance Optimization
- Investigated and identified performance bottlenecks in InfluxDB. Optimized the database schema design, resulting in a remarkable 71.98% reduction in data volume and a significant 1.76X boost in query performance, enhancing data management efficiency.
- Designed and implemented a time series deduplication mechanism. It achieved an impressive average data volume reduction of 70.38% and maintained data accuracy with a minimal error of only 0.74% in reconstruction.
- Designed and developed MetricsBuilder, a data access accelerator. MetricsBuilder dramatically improved query performance by up to 25X and reduced data transmission volume by 95% compared to traditional SQL queries, streamlining data retrieval.
- Implemented an API using the OpenAPI specification. The API provided efficient data access services to data analysis consumers, including JavaScript data visualization applications and Grafana, ensuring seamless access to valuable insights.
High-Performance Computing System Health Monitoring & Performance Data Collection
- Explored mechanisms to acquire health status monitoring data from an HPC cluster via the Integrated Dell Remote Access Controller (iDRAC), enhancing cluster management and efficiency.
- Spearheaded the development of a suite of tools for automating iDRAC telemetry report configuration, metric analysis, and TimescaleDB table initialization. Efficiently handled diverse data sources and types, streamlining data processing and analysis.
- Designed and implemented a robust system monitoring infrastructure capable of asynchronous collection of health status data through the Redfish API and job accounting data via the Slurm REST API.
- The Slurm data collection code has been adopted and merged into Dell’s Omnia project for broader industry utilization (Github link: Omnia).
PAPERS UNDER REVIEW/PREPRINTS
[3]
J. Li et al.,
“Scheduling and Allocation of Disaggregated Memory Resources in HPC Systems,” submitted to 38th IEEE International Parallel &
Distributed Processing Symposium (IPDPS 2024)
[2]
J. Li et al.,
“Towards Cycle-accurate Simulation of xBGAS,” submitted to 2024 International Conference on Computing, Networking and Communications (ICNC 2024)
[1]
J. Li, B. Cook, and Y. Chen,
“ARcode: HPC application recognition through image-encoded
monitoring data,” arXiv preprint arXiv:2301.08612, 2023,
doi: https://doi.org/10.48550/arXiv.2301.08612.
PEER-REVIEWED PUBLICATIONS
[1]
X.
Wang, A. Tumeo, J. D. Leidel, J. Li, and Y. Chen,
“MAC: Memory access coalescer for 3D-stacked memory,” in
Proceedings of the 48th international conference on parallel
processing (ICPP’19), 2019, pp. 1–10. doi: https://doi.org/10.1145/3337821.3337867.
[2]
J. Li, X. Wang, A. Tumeo, B.
Williams, J. D. Leidel, and Y. Chen, “PIMS: A lightweight
processing-in-memory accelerator for stencil computations,” in
Proceedings of the international symposium on memory systems
(MemSys’19), 2019, pp. 41–52. doi: https://doi.org/10.1145/3357526.3357550.
[3]
V.
Pham, N. Nguyen, J. Li, J. Hass, Y. Chen, and T. Dang,
“Mtsad: Multivariate time series abnormality detection and
visualization,” in 2019 IEEE international conference on big
data (BigData’19), IEEE, 2019, pp. 3267–3276. doi:
https://doi.org/10.1109/BigData47090.2019.9006559.
[4]
N.
Nguyen, J. Hass, Y. Chen, J. Li, A. Sill, and T. Dang,
“Radarviewer: Visualizing the dynamics of multivariate
data,” in Practice and experience in advanced research
computing (PEARC’20), 2020, pp. 555–556. doi: https://doi.org/10.1145/3311790.3404538.
[5]
J. Li et al.,
“Monster: An out-of-the-box monitoring tool for high performance
computing systems,” in 2020 IEEE international conference on
cluster computing (CLUSTER’20), IEEE, 2020, pp.
119–129. doi: https://doi.org/10.1109/CLUSTER49012.2020.00022.
[6]
X.
Wang, A. Tumeo, J. D. Leidel, J. Li, and Y. Chen,
“HAM: Hotspot-aware manager for improving communications with
3D-stacked memory,” IEEE Transactions on Computers
(IEEE Trans Comput), vol. 70, no. 6, pp. 833–848,
2021, doi: https://doi.org/10.1109/TC.2021.3066982.
[7]
T.
Dang, N. Nguyen, J. Hass, J. Li, Y. Chen, and A. Sill,
“The gap between visualization research and visualization software
in high-performance computing center,” The Gap between
Visualization Research and Visualization Software
(VisGap’21)), 2021, doi: https://doi.org/10.2312/visgap.20211089.
[8]
T.
Dang, N. V. Nguyen, J. Li, A. Sill, J. Hass, and Y.
Chen, “JobViewer: Graph-based visualization for monitoring
high-performance computing system,” in 2022 IEEE/ACM
international conference on big data computing, applications and
technologies (BDCAT’22), IEEE, 2022, pp. 110–119.
doi: https://doi.org/10.1109/BDCAT56447.2022.00021.
[9]
J. Li, G. Michelogiannakis, B.
Cook, D. Cooray, and Y. Chen, “Analyzing resource utilization in
an HPC system: A case study of NERSC’s perlmutter,” in
International conference on high performance computing
(ISC’23), Springer, 2023, pp. 297–316. doi: https://doi.org/10.1007/978-3-031-32041-5_16.
[10]
J. Li, R. Wang, G. Ali, T.
Dang, A. Sill, and Y. Chen, “Workload failure prediction for data
centers,” in 2023 IEEE 16th international conference on cloud
computing (CLOUD’23), 2023, pp. 479–485. doi: https://doi.org/10.1109/CLOUD60044.2023.00064.
[11]
C.
E. Caon, J. Li, and Y. Chen, “Effective
management of time series data,” in 2023 IEEE 16th
international conference on cloud computing
(CLOUD’23), 2023, pp. 408–414. doi: https://doi.org/10.1109/CLOUD60044.2023.00055.
PRESENTATIONS
Conference Presentations
- Workload Failure Prediction for Data Centers,
CLOUD’23 - A Holistic View of Resource Utilization on Perlmutter (Poster),
SC’22 - Advanced Visualization and Data Analysis of HPC Cluster and User Application Behavior,
SC’21 - MonSTer: An Out-of-the-Box Monitoring Tool for HPC Systems,
CLUSTER’20 - PIMS: A Lightweight Processing-In-Memory Accelerator for Stencil Computations,
MemSys’19
Research Seminar Talks
- Towards Cycle-Accurate Simulation for xBGAS
- A Holistic View of Resource Utilization on Perlmutter
- SST and Cycle-accurate Simulation of xBGAS
- Lightweight Checkpointing of Loop-Based Kernels Using Disaggregated Memory
- DL and Monitoring Metrics to Image Encoding for Detecting Applications in HPC systems
- Detecting and Identifying Applications by Job Signatures
- Predicting Abnormal Workloads in HPC Systems
- The IEEE Cluster2020 Experience, MonSTer Review and Future Work
- Monitoring Operating System Status on a Raspberry Pi cluster
- Experiences of Storing and Querying Monitoring Data of Large-scale HPC Platforms
- Collecting and Storing Telemetry Metrics from RedRaider Cluster
- Profiling Power Consumption of Jobs with SLURM
- Monitoring Power Usage of Jobs Running on Quanah Cluster
MENTORING EXPERIENCE
Undergraduate Students (including REU participants)
- Mentoring Yusheng Han and Zachary Kay on the topic “Running HPC Applications on the RedRaider Cluster and Analyzing Performance Behaviors”.
Independent Study (CS4000) - Mentoring Casey Root on the topic “Monitoring Queue Status via SLURM Rest API”.
Independent Study (CS4000)
Graduate Students
- Mentoring Cristiano Caon on the topic “Investigating the Data Volume Reduction and Query Optimization in Time Series Databases”. Outcomes include a conference publication in CLOUD’23.
Independent Study (CS7000) and Master’s Thesis - Mentoring Aniruddh Sanjaysinh Chavda and Huyen Nguyen on the topic “Usage Behavior Analysis with Clustering Job Accounting Data”.
Advanced Operating System (CS5379) - Mentoring Ruonan Wu on the topic “Job Accounting Data Analysis for Quanah Cluster ”.
Advanced Operating System (CS5379) - Mentoring Ashritha Puradamane Balachandra on the topic “Improving Query Performance of InfluxDB”.
Advanced Operating System (CS5379)
SERVICES
Paper Reviewer
- The Journal of Supercomputing
- IEEE International Parallel and Distributed Processing Symposium (IPDPS’23)
- IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid’22)
- IEEE International Conference on Distributed Computing Systems (ICDCS’22)
- The International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC’22)
- International Parallel Data Systems Workshop (PDSW’22)
- IEEE International Conference on Big Data (BigData’20, BigData’21, BigData’22)
- IEEE International Conference on Smart Data Services (SDMS’20)
Volunteer
- Student volunteer of SC’21,
St. Louis, Missouri - Student volunteer of SC’19,
Denver, Colorado - Volunteer of Paul’s Project,
Grace Campus, Lubbock, Texas
Last Update: November 4, 2023