10-13 March 2025
Sands Expo and Convention Centre
Marina Bay Sands, Singapore

Location: Room P5 – Peony Jr 4411-2 (Level 4)

Track Chair: Prof DK Panda

[Peer-Reviewed]

Programme:

TimeSession
01:30pm – 01:35pmOpening, AI & HPC

– Prof Dhabaleswar K (DK) Panda, Professor & University Distinguished Scholar

01:35pm – 02:15pmImproving the Efficiency of a Deep Reinforcement Learning-Based Power Management System for HPC Clusters Using Curriculum Learning

Powering down idle nodes in HPC systems can save energy, but improper shutdown timing may degrade the Quality of Service (QoS). We propose a Deep Reinforcement Learning (DRL) agent enhanced with Curriculum Learning (CL) to optimize node shutdown timing. Using Batsim-py, we compare various curriculum strategies, with the easy-to-hard approach achieving the best energy—consuming 3.73% less energy than the existing DRL agent and 4.66% less than the best timeout policy. Additionally, job waiting time is reduced by 9.24%. We also evaluate the model’s generality across diverse scenarios. These findings demonstrate the effectiveness of CL and DRL for HPC power management.

– Prof Muhammad Alfian Amrizal, Assistant Professor, Universitas Gadjah Mada

02:15pm – 02:55pmAnomaly Detection in Large-Scale Monitoring Systems using a Language Model

Anomaly Detection in large-scale monitoring systems, especially within high-performance computing (HPC), is a significant challenge because disruptions when the computer running can operations and reduce overall efficiency. We propose a novel framework called Anomaly Detection in Large‐Scale Monitoring Systems using a Language Model (AD‐LM). This framework uses a language-model-driven workflow for anomaly detection. Starting with, AD-LM applies BERTopic for topic modelling, which groups log entries into meaningful clusters, helping to expose patterns that indicate potential anomalies. The next technique, using a graph-based classification model identifies system failures by capturing key relationships within both HPC and large-scale logs. This framework supports high-speed processing and minimal memory usage—essential qualities in HPC settings. We evaluated AD-LM on three real-world log datasets (Hadoop Distributed File Systems, BlueGene/L, and Thunderbird) and achieved F1-scores of 0.995, 0.997, and 0.998, respectively—outstanding well-known anomaly-detection benchmarks with little overhead. Our findings confirm AD-LM’s effectiveness for real-time anomaly detection in HPC and large-scale scenarios, underscoring its robustness, adaptability, and efficient resource consumption.

– Mr Supasate Vorathammathorn, Research Assistant, King Mongkut’s University of Technology Thonburi

02:55pm – 03:35pmShould AI Optimize Your Code? A Comparative Study of Classical Optimizing Compilers Versus Current Large Language Models

This study aims to answer a fundamental question for the compiler community:”Can AI-driven models revolutionize the way we approach code optimization?”. This paper presents a comparative analysis between three classical optimizing compilers and two state-of-the-art Large Language Models, assessing their respective abilities and limitations in optimizing code for maximum efficiency. Additionally, we introduce a benchmark suite of challenging optimization patterns and an automatic mechanism for evaluating performance and correctness of the code generated by LLMs. We used three different prompting methodologies to assess the performance of the LLMs – Simple Instruction (IP), Detailed Instruction Prompting (DIP) and Chain of Thought (CoT).

– Mr Miguel Rosas, Ph.D. Candidate, University of Delaware

03:35pm – 04:00pmTea Break

04:00pm – 04:05pmPerformance Optimization, Tools, and Energy Efficiency

– Prof Yao Chen, Research Assistant Professor, National University of Singapore

04:05pm – 04:45pmTaming The Overhead of Hiding Samples in Deep Neural Network Training

Recent empirical evidence indicates that there are performance benefits associated with (1) using larger datasets during the training of deep neural networks (DNN), and (2) scaling to unprecedented dataset sizes for pre-training attention-based models. However, the downside of using large datasets is the increased cost of training and the pressure on non-compute sub-systems of supercomputers and clusters used for DNN training (e.g., the file system). In this work, we focus on reducing the total amount of training data samples while maintaining the accuracy level. Recent online \textit{sample hiding} approach proposed to dynamically hide the least-important samples in a dataset during the training process to reduce the total amount of computing and the training time, while maintaining the accuracy level. However, estimating the importance of samples leads to a non-trivial additional overhead. In this study, we propose an efficient mechanism to approximate the importance of samples for reducing such overhead. Empirical results on various datasets and models show that our proposed method (\textbf{ESH}) can remove most of the overhead, e.g., ESH reduces the total training time by up to 27.9\% compared to the baseline by hiding 28.8\% number of samples on average during the training.

– Dr Truong Thao Nguyen, Researcher, National Institute of Advanced Industrial Science and Technology (AIST), Japan

04:45pm – 05:25pmResearch and Development of Evaluation Tools on User Job Level Index of HPC Cluster

As supercomputing internet expands, the number of users grows rapidly, but proficiency varies across disciplines. To enhance user capabilities and optimize cluster resource utilization, this paper proposes a quantifiable evaluation system for user job levels on HPC clusters. Using Shanghai Jiao Tong University’s supercomputer as an example, we detail the system’s design, including indicator selection, data processing, weighting, and index calculation. Indicators reflect job frequency, efficiency, and parallel computing skills. We employ logarithmic and normalization treatments, use the entropy method and AHP for weighting. We developed an open-source evaluation tool, which helps cluster administrators monitor the job levels of users, and guide users to get improvements.

– Ms Gao Yiqin, Engineer, Shanghai Jiao Tong University

05:25pm – 06:05pm Smart In-Situ Visualization using Information Entropy-based Viewpoint Selection and Smooth Camera Path Generation

In-situ visualization has received increasing attention as an effective approach to reduce the data I/O and storage demands, particularly in HPC-based large-scale simulations, where data is processed online as it is being generated rather than stored for later offline visual analysis. This work presents an alternative in-situ visualization approach based on smart visualization to generate a subset of rendering images to assist the offline interactive visual analysis tasks. It combines information entropy-based viewpoint selections with a smooth camera path interpolation to generate a sequence of time-lapse rendering images that can later be manipulated interactively via a GUI-based viewer.

– Mr Kazuya Adachi, Graduate student, Kobe University, RIKEN R-CCS

06:05pmClosing