10-13 March 2025
Sands Expo and Convention Centre
Marina Bay Sands, Singapore

SUBMIT

ACCEPTED POSTERS

The following SCA2025 posters will be displayed at the Poster Presentations / Delegate Lounge, Melati Jr Room 4010-4110 (Level 4).

Poster Presentation Timeslots:

  • 11 Mar 12:00pm – 01:00pm
  • 11 Mar 05:00pm – 06:00pm


Winner for the SCA2025 Best Student Poster, will be announced at the SCA2025 Papers Breakout Track (Room – P5, Peony Jr) on Wednesday, 12 March 2025, at 03:35pm.

SCA2025 Best Student Poster Award

POS111s1 – A Localized Implicit Method for Accelerating Conjugate Gradient Method in PDE Simulation

Fuma Suenaga

Abstract: We propose an accelerated approximate conjugate gradient method (CG) in PDE simulations. The CG method, widely used for PDE simulations, solves a linear system given by a coefficient matrix and a right-hand vector. There have been studies to accelerate the CG method, multigrid methods and domain decomposition strategies, relying on domain knowledge about simulation models. The proposed method achieves acceleration without domain knowledge. The core idea for acceleration is to divide a linear system into localized approximate subsystems. In order to reduce the computational complexity, the proposed method decomposes a coefficient matrix A into the block sub-matrices Ak on its diagonal. The worst computational complexity of the proposed method O(NZ(Ak)) is less than that of a naive implicit method O(NZ(A)), where Z(A) is the number of the non-zero elements in A. The accuracy of the proposed method relies on the distribution of the non-zero elements in a coefficient matrix because of the approximation of the localized boundary conditions. Nearest-neighbor interaction models improve the practicability of the proposed method because such models result in a strongly diagonal distribution. We evaluated the processing time and the accuracy per simulation time step with varying block size B. We employed 2-D FEM that calculates the heat conduction of a cylindrical aluminum plate of 5012 computational points.
The rapid decrease in the accuracy against the gentle increase in processing time suggests that an appropriate choice of B provides a fast approximation of the solution with a small accuracy loss.

POS101s1 – A study on energy efficiency influencing factors based on Green500

Hyungwook Shim

Abstract: With the rapid growth of AI-related industries, the need for reducing and optimizing energy consumption in large-scale computational resources, such as supercomputers, has become increasingly important. This study focuses on supercomputers listed in the Green 500, categorizing existing benchmarking evaluation variables into input and output factors. An energy efficiency objective function was introduced, and DEA was conducted using BCC model. The study analyzed the relative efficiency levels among supercomputers and identified factors and levels of potential efficiency improvements.

POS103s1 – Driving AI Excellence: True High-Performance, Distributed Storage Solutions to Empower AI

Madhu Thorat

Abstract: In today’s large data-driven enterprises, vast volumes of data are generated and leveraged by AI to derive actionable insights. However, storing structured and unstructured data across separate file and object storage systems, combined with accessing data through multiple protocols, often leads to data duplication, management challenges, and rising costs. These complexities can also impact data access performance, hindering the progress of AI-driven projects. To address these critical issues, there is growing need for unified storage solutions that can:
1) Consolidate file and object storage into a single platform, reducing costs and eliminating data duplication.
2) Enable high-performance access to the same data instance via multiple protocols (including NFS, S3) while ensuring consistency.
These requirements can be effectively addressed with High-performance, Distributed file and Object storage solutions, which provide unified access to data in both file and object formats. This approach simplifies data management, boosts performance, and reduces storage costs. Research shows that by 2029, over 80% of unstructured data will be stored on distributed storage solutions, up from 40% in 2024.
This poster will give information about key considerations and simple design strategies employed by IBM to implement high-performance, distributed storage solutions for AI use cases. Key features include high-performance multi-protocol data access, scalable storage capacity, robust security, and accelerated data processing through global data platform that integrates both structured and unstructured data. The poster also presents a real-world use case, demonstrating how IBM storage solutions effectively support large-scale enterprise infrastructures for AI applications.

POS105s1 – Open Composer: A web application for generating and managing batch jobs on HPC clusters

Masahiro Nakao

Abstract: Using HPC clusters requires extensive prerequisite knowledge, such as Linux commands and job schedulers, which poses a high learning cost for beginners. To address this challenge, we have developed Open Composer, a web application designed to simplify the submission of batch jobs, the primary use case for HPC clusters. Open Composer runs on Open OnDemand, the de facto standard web portal for HPC clusters. Open Composer features automated job script generation using web forms, job submission functionalities, and more. This poster describes the design, development, and usage of Open Composer.

POS106s1 – Development of Efficient HPC Management Methods via Detailed Monitoring for Large Node System

Masamichi Nakamura

Abstract: The Jaxa supercomputing system (JSS) has a wide area of roles such as an infrastructure of numerical simulations, data centre for a large-scale data analysis, etc. The job filling rate on it in FY2023 is about 95% or more. To enhance effective and healthy uses of the JSS in future researches, we are considering not only the job filling rate but also appropriate usages of system components during jobs. Also, realizing such improvements, there is a great importance in a status visualization.

In this study, we analyzed usage metrics on the SORA and the RURI, which are main systems of the JSS, and applied them to monitor system statuses, covering thousands of jobs and nodes. We created some view patterns, grasping entire status just on one screen.

Time series of the CPU utilization on a sample day for 5,760 nodes are analyzed. That shows some nodes fill less than 50% in the CPU utilization for several hours although those nodes are occupied by jobs. This is a beneficial result shows an importance of the CPU utilization not only the job filling rate.

We also created a 100 nodes-averaged view in which 58 lines corresponds to 5,760 nodes for monitoring information for thousands of nodes just on one screen. This also successfully shows losses in CPU utilization, and contributes to an easy-to-understand management for large HPC system status. The view designs are under implementation to our job-management tool, M:Arthur. We plan to add and discuss further results in the poster.

POS107s1 – Development of HPC Job Management Methods on GPU Servers Considering Power Consumption

Hiroshi Ito

Abstract: Data centres with GPUs have attracted great attention by demands for AI, numerical simulations, etc. High-end GPUs and multiple use of it are suitable for big data and large models. However, GPUs consume a huge electric power, so data centres face to difficulties in daily operations. Although sequential job runs are managed by a job scheduler like Slurm, that cannot consider the power consumption. To control it along with the calculation time, an appropriate job management tool is essential, and basic understanding for job-level power consumption is required as the first step in realizing it.

In this study, sample simulations are performed with LAMMPS, a molecular dynamics simulation package, and energy saving conditions are investigated. For the investigation, we created the three models consisted of 6720, 53760, 840000 atoms by using the Fe(OH)3 solvent models, attached in the LAMMPS packages for ReaxFF benchmarking. To detect power consumptions, EAR, the energy management flamework, is utilized.

In the simulations with the three models on the two hardware conditions of 32CPUs and 1CPU+1GPU, the cases of 1CPU+1GPU show lower power consumption than those of 32CPUs at all the model sizes. Calculations on CPUs simply effects on the simulation time with less effects on energy efficiency, besides, GPU contributes to not only reducing the simulation time but also the energy consumption. Those are fruitful results for a job management. In the poster, we will present detailed results and our job management tool, M:Arthur, in which the monitor of EAR results is now being implemented.

POS110s1 – An Auto-Tuning Approach for GPU-Accelerated Out-of-Core Stencil Computation

Yuto Arakawa

Abstract: This poster presents an auto-tuning approach to find the best parameter values for GPU-accelerated out-of-core stencil computation, which deals with large matrices exceeding the capacity of GPU memory. Out-of-core stencil solvers typically deploy a pipelining approach, which can be further accelerated with some optimization techniques: (1) temporal blocking, which iterates computation with less amount of CPU-GPU data transfer, (2) region sharing, which performs data reuse among different chunks, and (3) lossy compression, which saves the bandwidth of PCIe bus. However, a tuning process is needed to maximize the performance with best values for execution parameters, such as the temporal blocking size, the chunk size, and the on/off of lossy compression. The proposed approach finds the best values by predicting the execution time with an analytical model that reduces the number of feasible combinations. The idea for this reduction is to consider several restrictions on the amount of GPU memory usage. Accordingly, our analytical model requires some hardware parameters such as the capacity of GPU memory, the effective bandwidths of GPU memory and PCIe bus, and so on. The analytical model predicts GPU computation time and CPU-GPU data transfer time according to linear prediction that uses the bandwidth parameters. In experiments, our approach reduced these combinations to 260 candidates by considering the memory restrictions. The approach rapidly suggested the best values with predicted times instead of iterating time-consuming executions. The proposed model successfully predicted the execution time with low error rates of 1.2%.

POS112s1 – Center-Wide High Performance Communication for QC-HPC Hybrid Heterogeneous Coupling Computing

Shinji Sumimoto

Abstract: Quantum computing (QC) technology has made remarkable progress and is becoming one of the important computing technologies of the future. However, the number of qubits that current quantum computers can execute is limited, so research on collaborative computing between QC and existing high-performance computing is expanding. This is called QC-HPC hybrid computing.

In this situation, we are working on the JHPC-quantum project together with the RIKEN Center for Computational Science, SoftBank, Osaka University, and plan to realize QC-HPC hybrid computing between multiple computer centers by 2026.

In this project, we are developing a cooperative scheduler for QC-HPC coupling among multiple centers and a coupled computing environment for QC-HPC jobs. This paper describes the wide-area center communication library WaitIO-Router, which extends the h3-Open-BDEC software stacks to realize a coupled computing environment for QC-HPC jobs.

POS113s1 – GPU implementation and performance evaluation of AMDKIIT for plane-wave based DFT calculation

Paramita Ghosh

Abstract: AMDKIIT is a density functional theory (DFT) program package that utilizes plane-wave basis sets to perform ab initio molecular dynamics using higher rungs of density functionals. The software suite offers a variety of tools for conducting self-consistent field (SCF) calculations, geometry optimizations, and molecular dynamics simulations. This comprehensive performance evaluation of AMDKIIT focuses on its computational efficiency, scalability, and accuracy. Through benchmark tests on simple molecules and bulk materials, we assess the software’s performance across diverse computational environments. The results highlight AMDKIIT’s robustness in handling systems of varying sizes and complexities, with particular attention given to parallel scalability and execution time. This analysis provides valuable insights into the software’s capabilities, enhancing our understanding of its performance in large-scale simulations. Efforts were also made to enhance computational performance through code optimization. A comprehensive profiling of the base code identified bottlenecks, which were addressed by introducing parallelization with Open-ACC to leverage GPU acceleration. As a result, key sections of the code were GPU-enabled, leading to a notable performance boost with a 2x speedup in initial tests.

POS114s1 – A Hardware/Software Co-Design Approach to Profiling RISC-V Accelerators

Guenchul Park

Abstract: The rapid advancement of Large Language Models (LLMs) has spurred significant growth in the development and deployment of large-scale AI applications across diverse fields, creating a demand for high-performance computing (HPC) hardware, including GPUs, NPUs, and DPUs. In response, we are developing RISC-V based accelerators tailored for HPC workloads. To fully leverage the capabilities of these novel accelerators, a robust software stack is essential, with profiling tools playing a critical role in performance optimization. This paper presents the hardware-software co-design of a profiling solution for our RISC-V based accelerator, encompassing the implementation of hardware performance counters for accurate state monitoring and the development of a dedicated profiling library for efficient data acquisition and analysis. This integrated approach provides a fundamental framework for understanding and optimizing the performance characteristics of emerging RISC-V based HPC accelerators.

POS115s1 – DDoS Attack Detection with Time-Series Predictions Using Empirical Dynamic Modeling

Wassapon Watanakeesuntorn

Abstract: Cloud-Edge Continuum Computing Infrastructure is an emerging infrastructure designed to unify cloud and edge clusters into a single computing system. Users can gain benefits to access geo-distributed sensors on edge devices with low latency and seamlessly execute high-performance applications on cloud clusters. However, this infrastructure can be susceptible to cyberattacks from unknown sources. Distributed Denial of Service (DDoS) is a common cyberattack in computer networks that floods large amounts of traffic from multiple sources to cause the victim service becomes unavailable. DDoS detection is a challenging topic in the field of network security. Deep learning has been successfully applied to detect DDoS in previous studies. However, machine learning based models typically require large amounts of data to train, making them unsuitable for real-time detection systems. In this research, we propose Empirical Dynamic Modeling (EDM) to detect DDoS attacks in computer networks. EDM is a mathematical framework for modeling nonlinear dynamical systems and can be used to predict future states of these systems. We assume that computer network traffic is a dynamical system that can be modeled and predicted using EDM. We can detect anomalies when the network metrics predicted by an EDM-based model (trained under normal conditions) significant deviate from actual measurements. The preliminary results indicate that EDM predicts time series faster than LSTM-based model and achieves higher classification accuracy than the AR-based model.

POS115s1 – A Hardware/Software Co-Design Approach to Profiling RISC-V Accelerators

Guenchul Park

Abstract: The rapid advancement of Large Language Models (LLMs) has spurred significant growth in the development and deployment of large-scale AI applications across diverse fields, creating a demand for high-performance computing (HPC) hardware, including GPUs, NPUs, and DPUs. In response, we are developing RISC-V based accelerators tailored for HPC workloads. To fully leverage the capabilities of these novel accelerators, a robust software stack is essential, with profiling tools playing a critical role in performance optimization. This paper presents the hardware-software co-design of a profiling solution for our RISC-V based accelerator, encompassing the implementation of hardware performance counters for accurate state monitoring and the development of a dedicated profiling library for efficient data acquisition and analysis. This integrated approach provides a fundamental framework for understanding and optimizing the performance characteristics of emerging RISC-V based HPC accelerators.

POS116s1 – Benchmarking Quantum Computing Simulation Frameworks

Shaobo Zhang

Abstract: Exploration of Quantum Computing (QC) algorithms relies on simulating these algorithms on classical computers, as real Quantum hardware is in its early stages. General QC simulation is a computationally and memory-intensive process due to the exponential scaling of quantum states with the number of qubits. Many frameworks have been developed for simulation and this study compares the performance of some commonly used ones to discover their best use cases. We benchmark three commonly used frameworks: CUDAQ, Qiskit and PennyLane. We find that even with the latest GraceHopper superchip, frameworks make inefficient use of GPU acceleration, with the most performant being CUDAQ. However, CUDAQ lacks some of the features of other frameworks along with many QC algorithms having been written in PennyLane or Qiskit. The limitations, in performance and available functionality can severely limit testing of new QC algorithms and more development is required to make these HPC friendly.

POS120s1 – PIMID: A Full-System Simulator for Processing-in-Memory with Intricacy and Diversity

Yuan He

Abstract: Processing-in-Memory (PIM) has emerged as a promising solution to the longstanding memory wall challenge, driven by the increasing disparity between processor and memory performance. As computational demands continue to grow, conventional memory systems struggle to keep pace, prompting the need for innovative approaches that reduce data movement and improve energy efficiency. PIM offers a way forward by integrating computational capabilities directly into memory systems, but current tools for exploring this paradigm remain limited in their scope and flexibility.

To address these shortcomings, we introduce PIMID, a full-system simulation framework tailored for the comprehensive evaluation of PIM architectures. PIMID stands out with its ability to co-simulate host and memory devices in real time, alongside its support for diverse memory technologies. It also allows for detailed and fine-grained configurations of the processing elements (PEs) at various levels plus advanced support for in-memory networks, which enable precise architectural explorations with complex data communication patterns inside the memory. By offering extensive configurability and support for emerging technologies, PIMID serves as a robust foundation for exploring the next generation of memory-centric computing systems.

POS121s1 – Study of Dark Data Management on HPCI Shared Storage

Hidetomo Kaneyama

Abstract: Large-scale storage systems have so-called cold data that have not been accessed for an extended period. Among these, data whose contents are unknown even to its owner is called dark data. The HPCI Shared Storage is a critical infrastructure for storing and sharing data generated by supercomputers at universities and research institutes in Japan. As of January 2025, it stored data with a capacity of 33 PB and 188 million files. Of these, data that has not been accessed for over a year constitutes 90.2% of the total capacity and 88.8% of the total number of files. On this system, the growing volume of cold and dark data has led to a shortage of storage capacity that should ideally be allocated for active research. This issue arises from factors such as a lack of understanding of data content and significance and insufficient metadata describing the computing environment, which makes data management challenging. Thus, we propose a framework for automated metadata annotation utilizing the workflow tools. As proof of concept, we utilize the open-source workflow tool WHEEL, already used on the supercomputer Fugaku. The proposed framework aims to streamline metadata management in the HPCI Shared Storage system and enhance the usability and accessibility of stored data. The future goals include establishing a data management workflow using this framework, including DOI issuance and support for open science initiatives.

POS122s1 – Codesign and Development Plan of EigenExa3

Toshiyuki Imamura

Abstract: EigenExa, currently in its second iteration, serves as a dense eigenvalue solver meticulously crafted for the cutting-edge Japanese flagship system, Fugaku-NEXT. This advanced software has transcended traditional numerical methods by seamlessly integrating innovative techniques such as DC and MRRR. These enhancements significantly boost the speed and reliability of computations, ensuring high-performance outcomes. As the landscape of computer technology continually evolves, EigenExa stands poised to adapt to emerging hardware advancements. This includes embracing low-precision arithmetic and leveraging highly optimized algorithms tailored for efficient matrix multiplication. Looking ahead, the anticipated release of EigenExa version 3, set for the third quarter of 2026 to mark EigenExa’s 20th anniversary, promises a comprehensive overhaul of its internal numerical framework. The current version will still play a vital role in the iterative refinement scheme by providing initial guesses to facilitate this transition. Furthermore, we are excited to announce the introduction of a single-precision version of EigenExa, along with a concerted effort to streamline block H-matrix techniques. This will enhance our ability to generate initial guesses more efficiently than ever, ultimately paving the way for even more robust computational capabilities.

POS123s1 – Towards Efficient and Advanced Operation of Next-Generation Computing Infrastructure in Japan

Toshihiro Hanawa

Abstract: The Feasibility Study (FS) for next-generation computing infrastructure by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) had been started in August 2022. The FS project consists of three research and survey fields: system, new computation principles, and operation technology. We have studied operation technology on HPC in this FS project. Our mission is to propose options for the next-generation computing infrastructure for the post-Fugaku era, considering the needs of science, industry, and society. In addition, we aim to offer a common platform for nationwide use, providing easy, flexible, and seamless access to logically aggregate multiple supercomputer and cloud systems in Japan. This platform contributes to the achievement of SDGs and supports research DX for all domestic researchers. We have collaborated with researchers and technical staff from national universities and national institutes that operate Japan’s leading computing resources and academic information networks in cooperation with vendors related to Cloud and data center operation.

POS125s1 – An MPI-Based Tensor Network Contraction Approach for Quantum Circuit Simulation

Shusen Liu

Abstract: We introduce a highly flexible and computationally efficient framework for simulating quantum circuits through tensor network contraction optimized for deployment on multi-core and heterogeneous CPU clusters utilizing the Message Passing Interface (MPI). Our methodology transforms the contraction of high-order tensors into a series of standard matrix-matrix multiplications, analogous to an \texttt{einsum}-style operation, thereby leveraging optimized linear algebra routines to enhance performance (up to a 1.3$\times$ speedup with 24 qubits). This approach ensures broad compatibility across diverse CPU architectures by eliminating dependencies on GPU acceleration, facilitating scalable and portable quantum circuit simulations on a wide range of high-performance computing environments.

POS126s1 – Asynchronous scheduling of communication and I/O integrated with Rust

Ryosuke Maeda

Abstract: In ad-hoc file system environments, computing nodes receive a massive number of I/O requests, making the asynchronous and multiplexed handling of both communication and storage I/O an effective strategy. However, creating many native threads can result in overhead from thread management and context switching, as well as negatively impact the primary computation, potentially degrading overall performance. To address this, we propose a method that leverages Rust’s lightweight threading mechanism (async/await) to efficiently schedule communication and storage I/O asynchronously using only a small number of threads.

POS127s1 – Intel Spin-qubit Quantum Simulator Performance Evaluation on Supercomputer Fugaku

Soratouch Pornmaneerattanatri

Abstract: The Spin-qubit Quantum Computer(QC), one of the prominent QC architectures, is currently under development by Intel. Utilize CMOS manufacturing technology to achieve small-scale and effectively increased scalability. Qubits of spin-qubit QC exhibit long coherence time and high-fidelity qubit read-out, making them compelling QC. Manufacturers of QC commonly provide a quantum simulator that emulates quantum phenomena specific to their QC designs. Similarly, Intel provided the state-vector quantum simulator (IQS)[1] that facilitate the testing and validation of Intel quantum programs. Superconductive and trapped ION quantum simulators have been available to our researchers on the supercomputer Fugaku, that have Fujitsu A64FX based-on ARM architecture. To extend our support to our researchers, we successfully ported the full-stack Intel Quantum Software Development Kit (IQSDK) to the supercomputer Fugaku. However, this process involved modifying the source code and software environment tailored to the Intel software environment by replacing it with the supercomputer Fugaku software environment. A performance evaluation is performed to measure the impact of these modifications.

POS128s1 – Fostering data-centric research in Japan: Design of the mdx II cloud platform

Tomonori Hayami

Abstract: The mdx II cloud platform, developed through a collaborative initiative among universities and research institutes in Japan, addresses the growing demands for data-centric research across scientific disciplines. Traditional high-performance computing (HPC) systems often struggle to meet the needs of data-driven approaches; thus, mdx II was designed to provide a versatile Infrastructure-as-a-Service (IaaS) environment. This platform features 60 CPU nodes and 7 GPU nodes, equipped with cutting-edge Intel and NVIDIA processors, ensuring sufficient computational power for data-intensive research. Additionally, mdx II incorporates a high-performance all-NVMe Lustre file system and S3-compatible object storage, facilitating efficient data access. The platform supports the entire lifecycle of data-centric research, from collection and sharing to processing and publishing. Launched in November 2024, mdx II aims to empower researchers, including those with limited HPC experience, by providing essential tools for collaboration and data reuse. This paper outlines the requirements, design, and current status of the mdx II platform, highlighting its role in fostering data-centric research in Japan.

POS129s1 – Policy Evaluation Platform for Parallel Multi-Agent Simulation on High Performance Computing Infrastructure

Fukuharu Tanaka

Abstract: In urban planning, accurately modeling human movement and traffic flow is essential for effective decision-making. While data-driven simulations aid in congestion mitigation and evacuation planning, existing platforms struggle to systematically compare policies due to high computational costs. This study proposes a scalable platform leveraging supercomputers to accelerate policy exploration and comparison.
The platform operates in two environments: a local system and high-performance computing infrastructure (HPCI). Locally, real-world data from GPS, cellular networks, LiDAR, and cameras is used to reconstruct pedestrian flow and generate initial simulations. On HPCI, multiple policy simulations run in parallel, significantly improving execution speed. Bayesian optimization or genetic algorithms further enhance policy optimization.
However, simple road network partitioning can hinder agent decision-making, as pedestrians in separate segments may lack access to cross-segment information. To address this, status-sharing mechanisms are necessary. Frequent synchronization, however, reduces efficiency. To mitigate this, we propose a neural network-based method that predicts future cell states and transmits interactions to neighboring cells, minimizing synchronization frequency while maintaining decision accuracy. This approach balances computational efficiency and policy evaluation fidelity in large-scale simulations.

POS130s1 – An API remoting system for accessing large-scale array data on-demand

Keichi Takahashi

Abstract: This poster introduces an API remoting system for on-demand access to large-scale multi-dimensional array data stored in remote high-performance computing environments. As the volume of scientific data continues to grow, efficient data transfer becomes critical, particularly when users require only small subsets of massive datasets. The proposed system utilizes a client-server architecture, where the server manages data storage, and the client interfaces with user applications through a NumPy-like API. This system effectively reduces network bandwidth usage and reduces latency by streaming only the requested data slices. Unlike existing solutions that necessitate data conversion or are limited to specific file formats, this approach allows direct access to various array file formats, including NumPy, HDF5, and netCDF. Future work will focus on enhancing performance through caching, prefetching, and compression, further optimizing the user experience in scientific data analysis.

POS131s1 – Towards High-Performance and City-Scale Human Behavior Simulation with LLM-Powered Multi-Agent System

Haruki Yonekura

Abstract: As urbanization accelerates, modeling of human behavior in city-scale settings increasingly plays a critical role in driving smart society development, such as Japan’s Society 5.0 concept. Existing simulation approaches face a challenge in balancing realism and scalability, and often depend on oversimplistic agent behavior that fails to capture the complexity of human decision-making processes. In contrast, in this work, we extend past studies of LLM-facilitated smart home simulators to a city-scale simulation in which LLM-powered agents engage in real-life decision processes.
We identify key challenges, including high computational costs, factorially growing interactions between agents, and demand for efficient environment representation. Our initial benchmarking with the Fugaku supercomputer shows that a 2-bit quantized Llama3-8B model generates text at a pace of only 0.07 tokens per second, and therefore necessitates employing optimizations such as caching and parallel decoding. In addition, in order to manage long simulations effectively, we introduce a hierarchical inference scheme that leverages modular sub-models specific to specific urban settings. By combining compressed representations and decentralized inference, our approach aims to maximize both computational efficiency and behavior realism.
Our work extends state-of-the-art in digital twin simulations with the integration of multi-agent scalability and high-fidelity decision processes, and enables real-world urban planning and smart city development in practice.

POS133s1 – Reinforcement Learning by Quantum Computation

Su Thet Htar

Abstract: Reinforcement Learning (RL) has proven effective in solving complex decision-making problems but struggles with computational efficiency in high-dimensional environments. To address these limitations, we propose a quantum-based RL approach that leverages quantum superposition and Grover’s quantum search algorithm. By encoding the classical Markov Decision Process (MDP) within a quantum circuit, our method enables the parallel exploration of multiple agent actions and environment responses, leading to substantial computational speedups. Furthermore, Grover’s algorithm accelerates the trajectory optimization, allowing the quantum agent to identify the most rewarding path in a single iteration. Our results demonstrate that Quantum Reinforcement Learning (QRL) effectively replicates classical MDP behavior while outperforming classical Q-learning algorithms in terms of exploration and decision-making speed, offering a promising and efficient alternative for reinforcement learning tasks.

POS134s1 – DataStates: Scalable Data Management in the Age of AI

Bogdan Nicolae

Abstract: Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP) by introducing powerful AI systems capable of understanding and generating human-like text with remarkable fluency and coherence.
These models, trained on vast amounts of data, are capable of performing a wide range of tasks: language translation, text summarizing, knowledge distillation, enabling researchers to navigate complex scientific literature more efficiently.
In a quest to improve the quality of inferences, LLMs are routinely made of billions of parameters, as illustrated by the GPT, LLaMa, BLOOM, and Qwen family of models. Several predictions anticipate LLMs will soon reach trillion scale parameters, prompting the need to ingest massive amounts of training data and produce a large number of reusable data artifacts: embeddings, model checkpoints, cached attention computations during inferences, etc. High performance storage, provenance tracking and reuse of these data artifacts is essential in enabling scalability and cost-effectineness in the operation of LLMs. This poster introduces DataStates, a scalable lineage-driven data management framework for evolving datasets. We focus on several contributions that highlight the benefit of lineage-driven data management to solve the challenges outlined above.