Distributed AI/ML model training on HPC clusters

This tutorial will cover the basic concepts and fundamentals of High Performance Computing (HPC), Artificial Intelligence (AI), why HPC is important for AI, and distributed training strategies with a focus on PyTorch Distributed Data Parallel (DDP). Through handson exercises, participants will learn step by step how to scale training from a single GPU to multiple GPUs on a single node, and finally extending to multi-node distributed environment.

Written by Zhenhua (Zane) He

Written by

Zhenhua (Zane) He

Dr. Zhenhua He is an Associate Research Scientist at Texas A&M University High Performance Research Computing (HPRC). With a background in geoscience and computer science, his expertise spans geomechanical modeling and computer vision. At HPRC, Dr. He leads eHorts to benchmark novel high performance computing (HPC) technologies for AI and machine learning (ML) workflows and trains researchers nationwide to eHectively utilize these advanced systems.

Start the conversation

See all Online Agenda

Online Agenda

•Nov 01, 2025

Talk to Your ML Models. Insights from Sensitivity Analysis (XVARS) and Explainable AI (XAI)

Sensitivity Analysis (XVARS) and Explainable AI (XAI)

Written by Banamali Panigrahi

Online Agenda

•Nov 01, 2025

Large Language Model (LLM) Agent in Hydrology

This tutorial introduces AQUAH, the first language-based agent specifically designed for hydrologic modeling. With simple natural-language prompts (e.g., 'simulate floods for a specific basin'), AQUAH autonomously retrieves data, configures hydrologic models, executes simulations, and produces analyst-ready reports, simplifying complex environmental modeling tasks.

Written by Zhi Li

Online Agenda

•Oct 25, 2025

Web Application Development for Hydrologic Modeling and Visualization

This session introduces Tethys Platform, an open-source framework for building earth-science- focused web applications. For over a decade, scientists have used Tethys to build web applications for hydrologic modeling, visualization, and decision support. Participants will see two complementary approaches to web app development.

Written by Nathan Swain, Shawn Crawley, and Corey Krewson

Distributed AI/ML model training on HPC clusters

Zhenhua (Zane) He

Start the conversation

Related

Talk to Your ML Models. Insights from Sensitivity Analysis (XVARS) and Explainable AI (XAI)

Large Language Model (LLM) Agent in Hydrology

Web Application Development for Hydrologic Modeling and Visualization

Pages