High Performance Computing for AI / ML Workloads
From: ₹2,000.00
Last Date of Registration :
- Profile of the instructor
- Description of the course
- Fee Structure
- Intended Audience
- Eligibility criteria
- Modules of the course - Day 1
- Modules of the course - Day 2
- Session Details
- Hands on components
- Certification Criteria
- Key Learning outcomes
- Course Outcomes
- Reviews (0)
Description of the course
In today’s data-driven world, Artificial Intelligence (AI) and Machine Learning (ML) are at the forefront of innovation, solving complex problems across industries. However, training and deploying advanced AI/ML models often require immense computational power and efficiency. High-Performance Computing (HPC) provides the infrastructure and methodologies to meet these demands, enabling faster training, larger-scale experiments, and real-time applications.
This two-day, 10-hour course offers a comprehensive introduction to HPC in the context of AI/ML. Participants will learn about parallel computing, distributed training, and scalable data pipelines while gaining hands-on experience with cutting-edge tools and frameworks. Designed for AI/ML practitioners, researchers, and enthusiasts, the course bridges the gap between theory and application, empowering attendees to leverage HPC for their AI/ML workflows.
By the end of the course, participants will be equipped with the skills to optimize their models for HPC environments, handle large datasets efficiently, and explore advanced techniques in distributed AI/ML. Whether you’re looking to accelerate your projects or prepare for the future of AI at scale, this course is your gateway to mastering HPC for AI/ML.
Fee Structure
Students : Rs. 2360 (Rs. 2000+ 18%GST)
College teachers : Rs. 3540 (Rs. 3000+ 18%GST)
Industry Professionals : Rs. 5900 (Rs. 5000+18%GST)
Intended Audience
This course is designed to cater to a wide range of participants who are interested in leveraging High-Performance Computing (HPC) to enhance their AI/ML workflows and combining AI/ML expertise with advanced computational skills to unlock the full potential of high-performance technologies.
- AI/ML Practitioners
- Data scientists, machine learning engineers, and AI developers who want to optimize their models for large-scale and distributed environments.
- Professionals aiming to reduce training time, handle large datasets, and improve model performance using HPC infrastructure.
- Researchers and Academics
- Researchers working on computationally intensive AI/ML projects in fields such as genomics, climate modeling, astrophysics, or finance.
- Academics seeking to explore HPC applications for cutting-edge research and large-scale simulations.
- Engineers and IT Professionals
- Systems engineers and DevOps professionals responsible for deploying and managing HPC infrastructure for AI/ML workloads.
- Cloud architects looking to integrate HPC capabilities into cloud-based AI/ML solutions.
- Graduate and Postgraduate Students
- Students in AI/ML, computer science, data science, or related fields who want to build hands-on expertise with HPC tools and technologies.
- Those preparing for research or industry roles involving large-scale AI/ML projects.
- Enterprise and Industry Teams
- Teams from industries such as healthcare, finance, manufacturing, and retail seeking to scale AI/ML applications to handle real-world demands.
- Organizations aiming to enhance their AI/ML capabilities by integrating HPC for better efficiency and performance.
- AI/ML Enthusiasts
- Enthusiasts eager to explore the intersection of HPC and AI/ML, and gain insights into how advanced computing can drive innovation in the field.
This course is ideal for anyone interested in combining AI/ML expertise with advanced computational skills to unlock the full potential of high-performance technologies.
Eligibility criteria
The following category participants can attend the course
- Students
- Faculty
- Industry professionals
To ensure participants can fully benefit from the course, they should meet the following prerequisites:
1.Programming Knowledge
- Proficiency in Python, especially for AI/ML applications.
- Understanding of key libraries like NumPy, Pandas, and Matplotlib.
- Familiarity with basic shell scripting and command-line tools.
2. AI/ML Foundations
- Basic knowledge of machine learning and deep learning concepts, including:
- Training, validation, and testing workflows.
- Common types of models (e.g., neural networks, decision trees).
- Experience with frameworks like TensorFlow or PyTorch.
3. Computational Basics
- An understanding of concepts like:
- CPUs, GPUs, and memory hierarchies.
- Basic parallelism and distributed systems.
4. Familiarity with Data Handling
- Ability to work with datasets using tools like Pandas or SQL.
- Knowledge of data preprocessing and feature engineering.
5. System and Resource Management (Optional but Helpful)
- Basic experience with Linux/Unix environments.
- Familiarity with resource management tools (e.g., SLURM) is a plus.
Recommended but Not Mandatory
- Prior exposure to HPC environments or cloud platforms (e.g., AWS, Google Cloud, Azure).
- Basic understanding of performance profiling tools (e.g., TensorBoard, NVIDIA Nsight).
Who Can Join Without These Prerequisites?
Participants without direct experience in all these areas can still benefit if they:
- Are motivated to quickly learn the basics of AI/ML and parallel computing.
- Have a technical background (e.g., engineering, computer science) that enables them to grasp the concepts during the course.
If needed, we can also provide pre-course resources or a preparatory session to help participants get up to speed.
Modules of the course - Day 1
Day 1: Foundations of HPC in the AI Era (5 hours)
Session 1: AI/ML Meets HPC – Why It Matters (1 hour)
- Computational demands of modern AI/ML models
- Limits of single-node training
- What HPC brings to the table: throughput, scale, and reproducibility
- Real-world case studies (AlphaFold, BERT pretraining, CFD+AI)
Session 2: HPC Hardware and Architecture Primer (1.5 hours)
- Multicore CPUs vs many-core GPUs
- Memory hierarchy: cache, RAM, shared memory, NUMA
- Interconnects: PCIe, NVLink, Infiniband
- Storage systems for AI: IOPS vs throughput
Session 3: Parallel Computing Fundamentals (1.5 hours)
- Types of parallelism: data, task, pipeline
- Concepts of scalability: strong vs weak scaling
- Speedup and efficiency: Amdahl’s and Gustafson’s laws
- Communication and synchronization overheads
Session 4: Hands-on Session I – Simulating AI Workloads on HPC (1 hour)
- SSH + module environment + resource introspection (htop, nvidia-smi)
- Batch job submission (e.g., sbatch, queue selection)
-
- Toy AI job (e.g., matrix multiplication, logistic regression) in Python with NumPy
- Measure CPU vs GPU timing using time and profiling basics
Modules of the course - Day 2
Day 2: Scaling AI: Strategies and Experiments (5 hours)
Session 5: Data and Model Scaling Without Framework Lock-in (1 hour)
- Batch size and memory tradeoffs
- Manual data sharding and batching
- Naive vs optimized I/O paths (e.g., CSV vs memory-mapped arrays)
Session 6: Distributed Training – Concepts and Challenges (1.5 hours)
- Manual implementation of data parallelism (model replica + averaging)
- Inter-node communication bottlenecks
- Fault tolerance and checkpointing as system design
- Thought experiment: how to train an LLM from scratch with 100 nodes
Session 7: Performance Engineering for AI Workloads (1 hour)
-
- When to parallelize: cost-benefit analysis
- Profiling: CPU-bound vs I/O-bound vs memory-bound jobs
- Hands-on use of tools: perf, nvprof, nvtop, iostat, sar
Session 8: Hands-on Session II – AI Model Scaling Experiment (1.5 hours)
-
- Run small ML models (e.g., MLP/CNN) with increased data sizes and batch sizes
- Compare performance on 1-core, 4-core, GPU
- Write and run Python code to manually parallelize data preprocessing (e.g., using multiprocessing)
- Visualize speedup trends
Session Details
Course duration – 2 days
Hands on components
- Exploring an HPC Environment
- Connect to an HPC cluster using SSH.
- Navigate the file system and explore hardware configurations.
- Submit and monitor basic tasks on the cluster.
- Parallel Computing with Python
- Write and execute Python scripts utilizing multiprocessing for parallelism.
- Perform a data-parallel computation using frameworks like Dask or Ray.
- Profiling AI/ML Models on CPU vs. GPU
- Benchmark a sample deep learning model (e.g., MNIST or CIFAR-10) using PyTorch or TensorFlow.
- Compare training speeds and performance on CPU and GPU.
- Running Jobs with a Scheduler (e.g., SLURM)
- Learn to allocate resources and submit jobs using SLURM or similar schedulers.
- Optimize resource usage and monitor job progress.
- Distributed Training with Deep Learning Frameworks
- Set up a distributed training pipeline using Horovod or PyTorch Distributed.
- Train a model on multiple GPUs or nodes and monitor the performance.
- Performance Profiling and Debugging
- Use tools like NVIDIA Nsight, TensorBoard, or PyTorch Profiler to analyze model performance.
- Identify bottlenecks in memory usage or computation and resolve them.
- Building a Scalable Data Pipeline
- Implement a distributed data preprocessing pipeline using Apache Spark or Dask.
- Feed processed data into an ML model for training or inference.
- Fine-Tuning Pre-Trained Models on HPC
- Fine-tune a pre-trained model (e.g., BERT or ResNet) on a real-world dataset.
- Experiment with distributed training and mixed precision to optimize performance.
- Case Study: Real-World Application
- Work in small groups to solve a practical problem using HPC for AI/ML.
- Present the results, discuss challenges, and share insights with the class.
Certification Criteria
The following are important for certification
- Attendance is mandatory
- MCQs and Short answers
Key Learning outcomes
- Understand HPC Fundamentals
- Grasp the core principles of High-Performance Computing and its role in accelerating AI/ML workloads.
- Differentiate between types of parallelism (data, task, and model parallelism) and apply them to AI/ML tasks.
- Leverage HPC Tools and Frameworks
- Use industry-standard tools like SLURM, Dask, Horovod, and PyTorch Distributed for managing and scaling AI/ML workloads.
- Optimize AI/ML workflows with job scheduling and resource management on HPC systems.
- Deploy and Optimize Distributed Training Pipelines
- Implement distributed training strategies for deep learning models using GPUs and HPC clusters.
- Monitor and profile AI/ML workloads to identify and resolve performance bottlenecks.
- Build Scalable Data Pipelines
- Design and deploy data preprocessing pipelines for large-scale AI/ML applications using tools like Apache Spark and Dask.
- Handle challenges of large datasets, including distributed storage and efficient data loading.
- Gain Practical Hands-On Experience
- Run AI/ML models on HPC hardware, comparing CPU, GPU, and distributed configurations.
- Apply learned concepts to real-world case studies, fine-tuning pre-trained models for domain-specific tasks.
- Explore Future Trends and Applications
- Understand emerging trends in HPC, including quantum computing, edge AI, and exascale systems.
- Explore practical applications of HPC in AI/ML, such as genomics, weather forecasting, and training large language models.
Participants will leave the course with a solid foundation in HPC for AI/ML, ready to apply these skills to accelerate their own projects and explore advanced opportunities in the field
Course Outcomes
By the end of the course, participants will:
- Understand the computational anatomy of AI/ML models
- Be able to profile, optimize, and scale AI workloads without relying heavily on black-box frameworks
- Gain practical experience running and scaling AI jobs on HPC infrastructure
- Think critically about resource usage, parallel efficiency, and system bottlenecks
Reviews
There are no reviews yet.