High Performance Computing for AI / ML Workloads

From: 2,000.00

Last Date of Registration : 

Clear
SKU: IIT Bombay | Date - Category:

Profile of the instructor

Description of the course

In today’s data-driven world, Artificial Intelligence (AI) and Machine Learning (ML) are at the forefront of innovation, solving complex problems across industries. However, training and deploying advanced AI/ML models often require immense computational power and efficiency. High-Performance Computing (HPC) provides the infrastructure and methodologies to meet these demands, enabling faster training, larger-scale experiments, and real-time applications.

This two-day, 10-hour course offers a comprehensive introduction to HPC in the context of AI/ML. Participants will learn about parallel computing, distributed training, and scalable data pipelines while gaining hands-on experience with cutting-edge tools and frameworks. Designed for AI/ML practitioners, researchers, and enthusiasts, the course bridges the gap between theory and application, empowering attendees to leverage HPC for their AI/ML workflows.

By the end of the course, participants will be equipped with the skills to optimize their models for HPC environments, handle large datasets efficiently, and explore advanced techniques in distributed AI/ML. Whether you’re looking to accelerate your projects or prepare for the future of AI at scale, this course is your gateway to mastering HPC for AI/ML.

Fee Structure

Students : Rs. 2360 (Rs. 2000+ 18%GST)

College teachers : Rs. 3540 (Rs. 3000+ 18%GST)

Industry Professionals : Rs. 5900 (Rs. 5000+18%GST)

Intended Audience

This course is designed to cater to a wide range of participants who are interested in leveraging High-Performance Computing (HPC) to enhance their AI/ML workflows and combining AI/ML expertise with advanced computational skills to unlock the full potential of high-performance technologies.

  1. AI/ML Practitioners
    • Data scientists, machine learning engineers, and AI developers who want to optimize their models for large-scale and distributed environments.
    • Professionals aiming to reduce training time, handle large datasets, and improve model performance using HPC infrastructure.
  2. Researchers and Academics
    • Researchers working on computationally intensive AI/ML projects in fields such as genomics, climate modeling, astrophysics, or finance.
    • Academics seeking to explore HPC applications for cutting-edge research and large-scale simulations.
  3. Engineers and IT Professionals
    • Systems engineers and DevOps professionals responsible for deploying and managing HPC infrastructure for AI/ML workloads.
    • Cloud architects looking to integrate HPC capabilities into cloud-based AI/ML solutions.
  4. Graduate and Postgraduate Students
    • Students in AI/ML, computer science, data science, or related fields who want to build hands-on expertise with HPC tools and technologies.
    • Those preparing for research or industry roles involving large-scale AI/ML projects.
  5. Enterprise and Industry Teams
    • Teams from industries such as healthcare, finance, manufacturing, and retail seeking to scale AI/ML applications to handle real-world demands.
    • Organizations aiming to enhance their AI/ML capabilities by integrating HPC for better efficiency and performance.
  6. AI/ML Enthusiasts
    • Enthusiasts eager to explore the intersection of HPC and AI/ML, and gain insights into how advanced computing can drive innovation in the field.

This course is ideal for anyone interested in combining AI/ML expertise with advanced computational skills to unlock the full potential of high-performance technologies.

Eligibility criteria

The following category participants can attend the course

  1. Students
  2. Faculty
  3. Industry professionals

To ensure participants can fully benefit from the course, they should meet the following prerequisites:

1.Programming Knowledge

  • Proficiency in Python, especially for AI/ML applications.
    • Understanding of key libraries like NumPy, Pandas, and Matplotlib.
  • Familiarity with basic shell scripting and command-line tools.

2. AI/ML Foundations

  • Basic knowledge of machine learning and deep learning concepts, including:
    • Training, validation, and testing workflows.
    • Common types of models (e.g., neural networks, decision trees).
    • Experience with frameworks like TensorFlow or PyTorch.

3. Computational Basics

  • An understanding of concepts like:
    • CPUs, GPUs, and memory hierarchies.
    • Basic parallelism and distributed systems.

4. Familiarity with Data Handling

  • Ability to work with datasets using tools like Pandas or SQL.
  • Knowledge of data preprocessing and feature engineering.

5. System and Resource Management (Optional but Helpful)

  • Basic experience with Linux/Unix environments.
  • Familiarity with resource management tools (e.g., SLURM) is a plus.

Recommended but Not Mandatory

  • Prior exposure to HPC environments or cloud platforms (e.g., AWS, Google Cloud, Azure).
  • Basic understanding of performance profiling tools (e.g., TensorBoard, NVIDIA Nsight).

Who Can Join Without These Prerequisites?

Participants without direct experience in all these areas can still benefit if they:

  • Are motivated to quickly learn the basics of AI/ML and parallel computing.
  • Have a technical background (e.g., engineering, computer science) that enables them to grasp the concepts during the course.

If needed, we can also provide pre-course resources or a preparatory session to help participants get up to speed.

Modules of the course - Day 1

Day 1: Foundations of HPC in the AI Era (5 hours)

Session 1: AI/ML Meets HPC – Why It Matters (1 hour)

  • Computational demands of modern AI/ML models
  • Limits of single-node training
  • What HPC brings to the table: throughput, scale, and reproducibility
  • Real-world case studies (AlphaFold, BERT pretraining, CFD+AI)

Session 2: HPC Hardware and Architecture Primer (1.5 hours)

  • Multicore CPUs vs many-core GPUs
  • Memory hierarchy: cache, RAM, shared memory, NUMA
  • Interconnects: PCIe, NVLink, Infiniband
  • Storage systems for AI: IOPS vs throughput

Session 3: Parallel Computing Fundamentals (1.5 hours)

  • Types of parallelism: data, task, pipeline
  • Concepts of scalability: strong vs weak scaling
  • Speedup and efficiency: Amdahl’s and Gustafson’s laws
  • Communication and synchronization overheads

Session 4: Hands-on Session I – Simulating AI Workloads on HPC (1 hour)

  • SSH + module environment + resource introspection (htop, nvidia-smi)
  • Batch job submission (e.g., sbatch, queue selection)
    • Toy AI job (e.g., matrix multiplication, logistic regression) in Python with NumPy
  • Measure CPU vs GPU timing using time and profiling basics

Modules of the course - Day 2

Day 2: Scaling AI: Strategies and Experiments (5 hours)

Session 5: Data and Model Scaling Without Framework Lock-in (1 hour)

  • Batch size and memory tradeoffs
  • Manual data sharding and batching
  • Naive vs optimized I/O paths (e.g., CSV vs memory-mapped arrays)

Session 6: Distributed Training – Concepts and Challenges (1.5 hours)

  • Manual implementation of data parallelism (model replica + averaging)
  • Inter-node communication bottlenecks
  • Fault tolerance and checkpointing as system design
  • Thought experiment: how to train an LLM from scratch with 100 nodes

Session 7: Performance Engineering for AI Workloads (1 hour)

    • When to parallelize: cost-benefit analysis
    • Profiling: CPU-bound vs I/O-bound vs memory-bound jobs
  • Hands-on use of tools: perf, nvprof, nvtop, iostat, sar

Session 8: Hands-on Session II – AI Model Scaling Experiment (1.5 hours)

    • Run small ML models (e.g., MLP/CNN) with increased data sizes and batch sizes
    • Compare performance on 1-core, 4-core, GPU
  • Write and run Python code to manually parallelize data preprocessing (e.g., using multiprocessing)
  • Visualize speedup trends

Session Details

Course duration – 2 days

Hands on components

  1. Exploring an HPC Environment
    • Connect to an HPC cluster using SSH.
    • Navigate the file system and explore hardware configurations.
    • Submit and monitor basic tasks on the cluster.
  2. Parallel Computing with Python
    • Write and execute Python scripts utilizing multiprocessing for parallelism.
    • Perform a data-parallel computation using frameworks like Dask or Ray.
  3. Profiling AI/ML Models on CPU vs. GPU
    • Benchmark a sample deep learning model (e.g., MNIST or CIFAR-10) using PyTorch or TensorFlow.
    • Compare training speeds and performance on CPU and GPU.
  4. Running Jobs with a Scheduler (e.g., SLURM)
    • Learn to allocate resources and submit jobs using SLURM or similar schedulers.
    • Optimize resource usage and monitor job progress.
  5. Distributed Training with Deep Learning Frameworks
    • Set up a distributed training pipeline using Horovod or PyTorch Distributed.
    • Train a model on multiple GPUs or nodes and monitor the performance.
  6. Performance Profiling and Debugging
    • Use tools like NVIDIA Nsight, TensorBoard, or PyTorch Profiler to analyze model performance.
    • Identify bottlenecks in memory usage or computation and resolve them.
  7. Building a Scalable Data Pipeline
    • Implement a distributed data preprocessing pipeline using Apache Spark or Dask.
    • Feed processed data into an ML model for training or inference.
  8. Fine-Tuning Pre-Trained Models on HPC
    • Fine-tune a pre-trained model (e.g., BERT or ResNet) on a real-world dataset.
    • Experiment with distributed training and mixed precision to optimize performance.
  9. Case Study: Real-World Application
    • Work in small groups to solve a practical problem using HPC for AI/ML.
    • Present the results, discuss challenges, and share insights with the class.

Certification Criteria

The following are important for certification

  1. Attendance is mandatory
  2. MCQs and Short answers

Key Learning outcomes

 

  1. Understand HPC Fundamentals
    • Grasp the core principles of High-Performance Computing and its role in accelerating AI/ML workloads.
    • Differentiate between types of parallelism (data, task, and model parallelism) and apply them to AI/ML tasks.
  2. Leverage HPC Tools and Frameworks
    • Use industry-standard tools like SLURM, Dask, Horovod, and PyTorch Distributed for managing and scaling AI/ML workloads.
    • Optimize AI/ML workflows with job scheduling and resource management on HPC systems.
  3. Deploy and Optimize Distributed Training Pipelines
    • Implement distributed training strategies for deep learning models using GPUs and HPC clusters.
    • Monitor and profile AI/ML workloads to identify and resolve performance bottlenecks.
  4. Build Scalable Data Pipelines
    • Design and deploy data preprocessing pipelines for large-scale AI/ML applications using tools like Apache Spark and Dask.
    • Handle challenges of large datasets, including distributed storage and efficient data loading.
  5. Gain Practical Hands-On Experience
    • Run AI/ML models on HPC hardware, comparing CPU, GPU, and distributed configurations.
    • Apply learned concepts to real-world case studies, fine-tuning pre-trained models for domain-specific tasks.
  6. Explore Future Trends and Applications
    • Understand emerging trends in HPC, including quantum computing, edge AI, and exascale systems.
    • Explore practical applications of HPC in AI/ML, such as genomics, weather forecasting, and training large language models.

Participants will leave the course with a solid foundation in HPC for AI/ML, ready to apply these skills to accelerate their own projects and explore advanced opportunities in the field

Course Outcomes

By the end of the course, participants will:

  • Understand the computational anatomy of AI/ML models
  • Be able to profile, optimize, and scale AI workloads without relying heavily on black-box frameworks
  • Gain practical experience running and scaling AI jobs on HPC infrastructure
  • Think critically about resource usage, parallel efficiency, and system bottlenecks

 

Reviews

There are no reviews yet.

Be the first to review “High Performance Computing for AI / ML Workloads”