Zhenhua (Zane) He

Zhenhua (Zane) He

from Texas A&M University 1Post
Dr. Zhenhua He is an Associate Research Scientist at Texas A&M University High Performance Research Computing (HPRC). With a background in geoscience and computer science, his expertise spans geomechanical modeling and computer vision. At HPRC, Dr. He leads eHorts to benchmark novel high performance computing (HPC) technologies for AI and machine learning (ML) workflows and trains researchers nationwide to eHectively utilize these advanced systems.

Distributed AI/ML model training on HPC clusters

This tutorial will cover the basic concepts and fundamentals of High Performance Computing (HPC), Artificial Intelligence (AI), why HPC is important for AI, and distributed training strategies with a focus on PyTorch Distributed Data Parallel (DDP). Through handson exercises, participants will learn step by step how to scale training from a single GPU to multiple GPUs on a single node, and finally extending to multi-node distributed environment.