Voltar aos Cursos
AI024 Professional

Introduction to ROCm and HIP Programming: A Practical Tutorial

A practical, modern guide to AMD GPU programming with ROCm and HIP. It covers the full software stack, installation, build workflows, kernel programming, memory management, performance engineering, library usage, CUDA porting, and production debugging practices.

5.0
30h
361 estudantes
0 curtidas
Inteligência Artificial

Visão Geral do Curso

📚 Content Summary

A practical, modern guide to AMD GPU programming with ROCm and HIP. It covers the full software stack, installation, build workflows, kernel programming, memory management, performance engineering, library usage, CUDA porting, and production debugging practices.

Master AMD GPU programming and CUDA-to-HIP portability with this technical deep dive.

Author: EvoClass

Acknowledgments: AMD official ROCm and HIP documentation base, including projects like ROCm, HIP, and ROCm LLVM.

🎯 Learning Objectives

  1. Define HIP and its role within the ROCm ecosystem in a single concise sentence.
  2. Distinguish between ROCm (platform), HIP (interface), and ROCm libraries (building blocks).
  3. Identify the hierarchical layers of the ROCm architecture from hardware to application frameworks.
  4. Define the relationship between the HIP SDK and the ROCm platform across different operating systems.
  5. Execute a systematic installation workflow, including support matrix verification and post-installation path configuration.
  6. Compile and run a minimal verification program to troubleshoot common driver and environment access issues.
  7. Understand why a robust build strategy is essential for reconciling source portability with architecture-specific performance.
  8. Implement portable kernel launches using the hipLaunchKernelGGL macro as an alternative to CUDA's triple-angle-bracket syntax.
  9. Configure production-grade CMake projects that target specific ROCm architectures and manage external library dependencies.
  10. Define the anatomy of a HIP kernel and apply the basic execution formula for thread indexing.

🔹 Lesson 1: Introduction to ROCm and HIP Architecture

Overview: This lesson provides a foundational overview of the ROCm platform and the HIP programming language. It clarifies the relationship between the full ROCm stack, the HIP interface, and high-level libraries, while establishing realistic expectations for CUDA-to-AMD portability and performance engineering.

Learning Outcomes:

  • Define HIP and its role within the ROCm ecosystem in a single concise sentence.
  • Distinguish between ROCm (platform), HIP (interface), and ROCm libraries (building blocks).
  • Identify the hierarchical layers of the ROCm architecture from hardware to application frameworks.

🔹 Lesson 2: Installation and Environment Setup

Overview: This lesson guides GPU developers and HPC engineers through the essential strategies for setting up a HIP-ready environment on both Linux and Windows platforms. It emphasizes a "platform reality" approach where developers must verify hardware/software compatibility before proceeding with a structured installation workflow and final verification using the hipcc compiler.

Learning Outcomes:

  • Define the relationship between the HIP SDK and the ROCm platform across different operating systems.
  • Execute a systematic installation workflow, including support matrix verification and post-installation path configuration.
  • Compile and run a minimal verification program to troubleshoot common driver and environment access issues.

🔹 Lesson 3: The Build Toolchain: hipcc and Project Layout

Overview: This lesson explores the essential toolchain and organizational strategies for developing HIP applications on AMD hardware. It transitions the developer from simple command-line builds using the hipcc driver to professional, production-ready project configurations using CMake. Key focus areas include portable kernel launch macros, architecture-specific optimization, and the critical distinction between source-level portability and binary performance.

Learning Outcomes:

  • Understand why a robust build strategy is essential for reconciling source portability with architecture-specific performance.
  • Implement portable kernel launches using the hipLaunchKernelGGL macro as an alternative to CUDA's triple-angle-bracket syntax.
  • Configure production-grade CMake projects that target specific ROCm architectures and manage external library dependencies.

🔹 Lesson 4: HIP Programming Model and Kernel Development

Overview: This lesson explores the fundamental architecture of HIP kernels, focusing on how work is mapped from logical problems to hardware execution through grids and blocks. It provides a blueprint for robust GPU programming, covering the essential execution formula, performance bottlenecks (memory vs. compute), and the mandatory implementation of error-checking and synchronization for production-ready code.

Learning Outcomes:

  • Define the anatomy of a HIP kernel and apply the basic execution formula for thread indexing.
  • Configure grid and block sizes effectively and implement benchmarking to find optimal throughput.
  • Implement robust error-handling macros and apply synchronization semantics to manage device-host interaction.

🔹 Lesson 5: Memory Management and Data Patterns

Overview: This lesson focuses on the central pillar of GPU programming: memory management. It covers the categorization of memory types (Pageable, Pinned, Device, and Managed), the performance implications of data transfer mechanisms, and the critical role of memory access patterns—specifically coalescing—in achieving peak performance. Students will learn to balance the ease of use provided by managed memory with the explicit control required for high-performance HPC applications.

Learning Outcomes:

  • Differentiate between pageable and pinned host memory and identify when to use each for optimal transfer speed.
  • Implement device memory allocation and unified/managed memory using HIP APIs (hipMalloc, hipHostMalloc, hipMallocManaged).
  • Analyze memory access patterns to ensure coalesced access and avoid performance bottlenecks like strided access.

🔹 Lesson 6: Streams, Events, and Asynchronous Execution

Overview: This lesson transitions developers from a synchronous programming model to a concurrent mindset, focusing on how to maximize GPU utilization through HIP streams and events. It covers the mechanics of overlapping data transfers with kernel execution via chunked pipelines and introduces the trade-offs between stream capture and explicit graph construction. Additionally, it highlights critical production considerations, including the use of graph-safe libraries and high-precision timing on the GPU.

Learning Outcomes:

  • Identify the performance benefits of asynchronous execution and concurrent streams over synchronous execution.
  • Implement chunked pipelines to overlap host-to-device communication with kernel computation.
  • Differentiate between stream capture and explicit graph construction for reducing launch overhead.

🔹 Lesson 7: Performance Engineering on AMD GPUs

Overview: This lesson establishes a scientific framework for optimizing software on AMD hardware, moving beyond guesswork to a systematic, measurement-driven approach. It covers the architectural relationship between Compute Units, wavefronts, and register pressure, while providing practical methodologies for profiling with rocprofv3 and implementing robust benchmarking skeletons.

Learning Outcomes:

  • Implement the 6-step HIP optimization workflow to identify and resolve performance bottlenecks.
  • Analyze the trade-off between register pressure and occupancy to maximize hardware utilization.
  • Execute accurate performance measurements using hardware events and multi-iteration benchmarking best practices.

🔹 Lesson 8: The ROCm Library Ecosystem

Overview: This lesson introduces the "Library-first" engineering philosophy, prioritizing high-performance, pre-built ROCm libraries over custom kernel development. It covers the categorization of the ROCm library stack (Math, FFT, Primitives, and ML/AI) and provides a decision framework for choosing between portable hip* interfaces and AMD-native roc* implementations. Additionally, learners will explore the critical requirements for "graph safety" when integrating libraries into HIP graph-captured workflows.

Learning Outcomes:

  • Apply the "Library-first" engineering principle to justify the use of pre-tested primitives over custom kernels.
  • Distinguish between hip* and roc* libraries based on portability requirements and performance needs.
  • Categorize ROCm libraries into their respective functional domains (Math, FFT, Primitives, ML/AI).

🔹 Lesson 9: Porting CUDA Applications to HIP

Overview: This lesson covers the systematic transition of CUDA source code to the portable HIP C++ framework. Students will learn to execute an incremental porting workflow using automated tools like hipify-perl and hipify-clang, identify critical portability traps such as hardware-specific warpSize assumptions, and implement a rigorous validation process to compare performance and correctness post-migration.

Learning Outcomes:

  • Execute the 6-step incremental porting workflow to minimize debugging overhead.
  • Select and apply the appropriate automated translation tool (hipify-perl vs. hipify-clang) based on source code complexity.
  • Identify and resolve architecture-sensitive "portability traps," specifically those involving warpSize and mechanical translation errors.

🔹 Lesson 10: Debugging, Testing, and Production Practices

Overview: This lesson covers the essential tools and methodologies for moving GPU kernels from development to production on the ROCm platform. It details the use of ROCgdb and AddressSanitizer for error detection, establishes a rigorous four-layer testing strategy, and provides a production checklist to ensure kernel correctness and performance stability.

Learning Outcomes:

  • Use ROCgdb, ltrace, and AddressSanitizer to identify source-level bugs and memory access errors in GPU code.
  • Implement a four-layer testing strategy to validate helpers, kernel correctness, edge cases, and performance regressions.
  • Apply production code patterns and checklists to manage kernel interfaces, documentation, and environment-driven debugging.