•
•
1 min read
Distributed AI/ML model training on HPC clusters
This tutorial will cover the basic concepts and fundamentals of High Performance Computing (HPC), Artificial Intelligence (AI), why HPC is important for AI, and distributed training strategies with a focus on PyTorch Distributed Data Parallel (DDP). Through handson exercises, participants will learn step by step how to scale training from a single GPU to multiple GPUs on a single node, and finally extending to multi-node distributed environment.
Start the conversation