10-13 March 2025
Sands Expo and Convention Centre
Marina Bay Sands, Singapore

Location: Room P9 – Peony Jr 4411 (Level 4)

Abstract: The convergence of AI and High-Performance Computing (HPC) is revolutionizing deep learning workflows, enabling scalable model training, fine-tuning, and inference. As AI workloads grow in complexity, HPC systems—originally designed for scientific simulations—are evolving with cutting-edge hardware like GPUs and high-speed interconnects, making them ideal for AI-specific tasks. This transformation is fostering new practices in HPC, such as containerized environments, distributed training, and optimized resource management. This tutorial will provide a comprehensive overview of best practices for utilizing HPC platforms for AI development. Key topics include distributed training with PyTorch’s Distributed Data Parallel (DDP), horizontal and vertical scaling approaches, and container technologies like Enroot for reproducibility. Attendees will gain hands-on experience in setting up HPC environments, managing workloads, and scaling models, equipping them to leverage HPC infrastructure for high-performance AI model training.

Workshop URL:
GitHub: https://github.com/snsharma1311/SCA-2025-DistributedTraining (Will be up by 15th Feb, 2025)

Important Notes/ Prequisites:

  1. Participants are required to bring their own laptops for hands-on. Please install ssh client for remotely accessing the HPC system.
  2. Participants may find sample codes and references from out GitHub repository (https://github.com/snsharma1311/SCA-2025-DistributedTraining) by 15-02-2025. Additionally, we’ll be updating information like VPN setup and other pre-requisites etc. in the repository as well.
  3. The tutorial is designed for intermediate-level users who are familiar with HPC systems and have experience in AI development. Ideal participants should have basic understanding of:
    • HPC and AI software stacks
    • Python programming
    • Linux environments
    • PyTorch for deep learning

POC Details: For enquiries, please contact: shashank.sharma@cdac.in

Agenda:

Introduction to AI-focused HPC Setups (30 minutes)
Presenter: Mr. Shashank Sharma/ Mr. Anandhu Nair

Theory & Hands-On

Distributed Training Concepts (30 minutes)
Presenter: Mr. Shashank Sharma

Theory

Distributed Training with PyTorch (1 hour)
Presenter: Mr. Kishor Y D

Hands-On

Containerized Training with Enroot & DeepSpeed Demonstration (1 hour)
Presenter: Ms. Sowmya Shree

Hands-On & Demo