Location: Room O5 – Orchid Jr 4211 (Level 4)
Abstract: High-Performance Networking technologies are generating a lot of excitement towards building next generation High-End Computing (HEC) systems for HPC and AI with GPGPUs, accelerators, and Data Center Processing Units (DPUs), and a variety of application workloads.
This tutorial will provide an overview of these emerging technologies, their architectural features, current market standing, and suitability for designing HEC systems. It will start with a brief overview of IB, HSE, RoCE, and Omni-Path interconnect. An in-depth overview of the architectural features of these interconnects will be presented. It will be followed with an overview of the emerging NVLink, NVLink2, NVSwitch, EFA, and Slingshot architectures.
We will then present advanced features of commodity high-performance networks that enable performance and scalability. We will then provide an overview of enhanced offload capable network adapters like DPUs/IPUs (Smart NICs), their capabilities and features. Next, an overview of software stacks for high-performance networks like Open Fabrics Verbs, LibFabrics, and UCX comparing the performance of these stacks will be given. Next, challenges in designing MPI library for these interconnects, solutions and sample performance numbers will be presented.
For any enquiries, please contact: Panda, Dhabaleswar <panda@cse.ohio-state.edu>; Subramoni, Hari <subramoni.1@osu.edu>; Michalowicz, Benjamin <michalowicz.2@osu.edu>
Workshop URL: https://nowlab.cse.ohio-state.edu/tutorials/scasia25-hpn/
Agenda:
- Trends in High-End Computing
- Why High-Performance Networking for HPC and AI?
- TCP vs User-level communication protocols
- Requirements (communication, I/O, performance, cost, RAS) from the perspective of designing next generation high-end systems and scalable data centers
- Communication Model and Semantics of High-Performance Networks
- Communication Model and Semantics of High-Performance Networks
- Architectural Overview of High-Performance Networks
- IB, HSE, their Convergence and Features
- Omni-Path Interconnect Architecture
- NVLink and NVSwitch Interconnect Architecture
- AMD Infinity Fabric Interconnect Architecture
- Amazon EFA Interconnect Architecture
- Cray Slingshot Interconnect Architecture
- Overview of Emerging Smart Network Interfaces
- Architectural features and principles of offloading
- Acceleration capabilities for HPC and AI applications
- High-Performance Network Deployments for AI Workloads
- Overview and architectural features of Cerebras WSE
- Overview and architectural features of Habana Gaudi
- Overview of Software Stacks for Commodity High-Performance Networks
- Vendors, Switches, and Host Channel Adapters
- Overview of OpenFabrics Architecture and Convergence
- Pointers to IB, Omni-Path, and HSE Installations
- Sample Case Studies and Performance Numbers
- Hands-on Exercises
- Evaluating and understanding the performance of high-performance networks at the fabric level
- Evaluating and understanding the performance of high-performance networks at the MPI level
- Conclusions and Final Q&A, and Discussion