CUDA Programming Guide
The official, comprehensive resource for developers to learn the CUDA programming model and how to write high-performance code that executes on NVIDIA GPUs. This guide covers the platform architecture, programming interface, advanced hardware features, and technical specifications.
課程
課程總覽
📚 Content Summary
The official, comprehensive resource for developers to learn the CUDA programming model and how to write high-performance code that executes on NVIDIA GPUs. This guide covers the platform architecture, programming interface, advanced hardware features, and technical specifications.
Master the art of parallel computing with the industry-standard guide to NVIDIA CUDA.
Author: NVIDIA Corporation
Acknowledgments: Copyright © 2007-2024 NVIDIA Corporation & affiliates. All rights reserved.
🎯 Learning Objectives
- Define the roles of the host (CPU) and device (GPU) within a heterogeneous system.
- Explain the SIMT programming model and the hierarchical organization of threads, blocks, and grids.
- Differentiate between PTX (Parallel Thread Execution) and binary code (cubins) and explain how Just-in-Time (JIT) compilation facilitates compatibility.
- Develop and Compile CUDA Kernels: Write global functions, configure execution with triple-chevron notation, and manage the NVCC compilation workflow.
- Optimize Memory and Data Movement: Distinguish between Unified, Explicit, and Mapped memory models, and implement page-locked host memory for efficient transfers.
- Manage Parallel Execution: Utilize CUDA Streams, Events, and Cooperative Groups to manage asynchronous tasks and synchronize CPU-GPU operations.
- Perform complex pointer arithmetic and identify architectural bottlenecks (von Neumann vs. Harvard).
- Implement advanced CUDA execution patterns, including Programmatic Dependent Kernel Launches and Heterogeneous Batched Memory Transfers.
- Utilize hardware-specific features like Thread Scopes, Asynchronous Proxies, and Pipelines to maximize concurrency.
- Configure and tune Unified Memory performance using prefetching, usage hints, and page size management.
🔹 Lesson 1: CUDA Fundamentals and Architectural Overview
Overview: This lesson introduces the CUDA parallel computing platform and its underlying hardware architecture. It explores how heterogeneous systems utilize both CPUs and GPUs, the SIMT (Single Instruction, Multiple Threads) programming model, and the hierarchy of threads, blocks, and grids. Additionally, it covers the CUDA compilation workflow, including the roles of PTX, cubins, and fatbins in ensuring binary and forward compatibility.
Learning Outcomes:
- Define the roles of the host (CPU) and device (GPU) within a heterogeneous system.
- Explain the SIMT programming model and the hierarchical organization of threads, blocks, and grids.
- Differentiate between PTX (Parallel Thread Execution) and binary code (cubins) and explain how Just-in-Time (JIT) compilation facilitates compatibility.
🔹 Lesson 2: Core GPU Programming and Execution Management
Overview: This lesson covers the fundamental and advanced aspects of GPU programming using CUDA C++. It transitions from basic kernel specification and the NVCC compilation workflow to complex execution management topics, including SIMT kernel design, shared memory bank conflicts, and asynchronous execution using streams and events. Students will learn to balance memory models (Unified vs. Explicit) and optimize hardware occupancy for high-performance computing.
Learning Outcomes:
- Develop and Compile CUDA Kernels: Write global functions, configure execution with triple-chevron notation, and manage the NVCC compilation workflow.
- Optimize Memory and Data Movement: Distinguish between Unified, Explicit, and Mapped memory models, and implement page-locked host memory for efficient transfers.
- Manage Parallel Execution: Utilize CUDA Streams, Events, and Cooperative Groups to manage asynchronous tasks and synchronize CPU-GPU operations.
🔹 Lesson 3: Advanced Memory Logic and Multi-GPU Systems
Overview: This lesson explores the transition from fundamental memory architectures and pointer logic to advanced GPU acceleration techniques. It covers the hardware-level execution models (SIMT, Independent Thread Scheduling), sophisticated synchronization mechanisms (Asynchronous Barriers, Scoped Atomics), and the orchestration of multi-GPU systems using both Runtime and Driver APIs.
Learning Outcomes:
- Perform complex pointer arithmetic and identify architectural bottlenecks (von Neumann vs. Harvard).
- Implement advanced CUDA execution patterns, including Programmatic Dependent Kernel Launches and Heterogeneous Batched Memory Transfers.
- Utilize hardware-specific features like Thread Scopes, Asynchronous Proxies, and Pipelines to maximize concurrency.
🔹 Lesson 4: Optimization, Graphs, and Hardware Accelerators
Overview: This lesson covers high-performance CUDA programming techniques, focusing on optimizing data movement and execution flow. It explores the transition from stream-based execution to persistent CUDA Graphs, the granular control of Unified Memory through prefetching and hints, and the utilization of hardware-specific accelerators like the Tensor Memory Accelerator (TMA) and L2 Cache persistence. Additionally, it details advanced synchronization patterns, resource partitioning via Green Contexts, and cross-API interoperability for modern heterogeneous computing.
Learning Outcomes:
- Configure and tune Unified Memory performance using prefetching, usage hints, and page size management.
- Construct, update, and execute CUDA Graphs, including the use of memory nodes and device-side launches.
- Implement advanced synchronization using Asynchronous Barriers and the Producer-Consumer pattern.
🔹 Lesson 5: Technical Reference and Language Extensions
Overview: This lesson provides a deep technical dive into the CUDA programming model's reference specifications and C++ language extensions. It covers the hardware-software interface via compute capabilities, environment variables for runtime control, and the specific syntax requirements for writing high-performance device code using modern C++ standards, cooperative groups, and specialized hardware intrinsics.
Learning Outcomes:
- Identify hardware constraints and feature sets based on GPU Compute Capability versions.
- Configure the CUDA execution environment and JIT compilation using system-level environment variables.
- Apply C++ language extensions (annotations, lambdas, and templates) while adhering to device-side restrictions.