Location: Room P9 – Peony Jr 4411 (Level 4)
Abstract: The convergence of AI and High-Performance Computing (HPC) is revolutionizing deep learning workflows, enabling scalable model training, fine-tuning, and inference. As AI workloads grow in complexity, HPC systems—originally designed for scientific simulations—are evolving with cutting-edge hardware like GPUs and high-speed interconnects, making them ideal for AI-specific tasks. This transformation is fostering new practices in HPC, such as containerized environments, distributed training, and optimized resource management. This tutorial will provide a comprehensive overview of best practices for utilizing HPC platforms for AI development. Key topics include distributed training with PyTorch’s Distributed Data Parallel (DDP), horizontal and vertical scaling approaches, and container technologies like Enroot for reproducibility. Attendees will gain hands-on experience in setting up HPC environments, managing workloads, and scaling models, equipping them to leverage HPC infrastructure for high-performance AI model training.
Workshop URL:
GitHub: https://github.com/snsharma1311/SCA-2025-DistributedTraining (Will be up by 15th Feb, 2025)
Important Notes/ Prequisites:
- Participants are required to bring their own laptops for hands-on. Please install ssh client for remotely accessing the HPC system.
- Participants may find sample codes and references from out GitHub repository (https://github.com/snsharma1311/SCA-2025-DistributedTraining) by 15-02-2025. Additionally, we’ll be updating information like VPN setup and other pre-requisites etc. in the repository as well.
- The tutorial is designed for intermediate-level users who are familiar with HPC systems and have experience in AI development. Ideal participants should have basic understanding of:
- HPC and AI software stacks
- Python programming
- Linux environments
- PyTorch for deep learning
POC Details: For enquiries, please contact: shashank.sharma@cdac.in
Agenda:
Introduction to AI-focused HPC Setups (30 minutes)
Presenter: Mr. Shashank Sharma/ Mr. Anandhu Nair
Theory & Hands-On
- HPC hardware configurations for AI workloads: CPUs, GPUs, and networking.
- Software tools: Environment setup, libraries, virtual environments.
- Job management with SLURM and container technologies.
Distributed Training Concepts (30 minutes)
Presenter: Mr. Shashank Sharma
Theory
- Scaling strategies: Vertical vs. Horizontal scaling.
- Distributed training theory: data, model and hybrid parallelism.
Distributed Training with PyTorch (1 hour)
Presenter: Mr. Kishor Y D
Hands-On
- Multi-GPU training using PyTorch DDP.
- Code walkthrough and practical exercises with SLURM.
Containerized Training with Enroot & DeepSpeed Demonstration (1 hour)
Presenter: Ms. Sowmya Shree
Hands-On & Demo
- Containerizing deep learning workflows.
- Best practices for Enroot in multi-node environments.
- DeepSpeed Demonstration – How to train with less resources