大规模并行处理器编程：实践导向方法

本课程全面介绍使用CUDA C环境进行GPU计算和并行编程。内容涵盖GPU架构、数据并行性、线程管理、内存优化以及高级性能考量，并通过实际案例（如MRI重建和分子可视化）进行说明。

4.9

36.0h

569 名学生

12 lessons

0 点赞

人工智能

开始学习

课程概述

📚 内容概要

本课程全面介绍了使用CUDA C环境进行GPU计算与并行编程的基础知识。内容涵盖GPU架构、数据并行性、线程管理、内存优化以及高级性能考量，并通过磁共振成像（MRI）重建和分子可视化等实际案例加以说明。

掌握高性能并行计算的艺术，通过实践导向的指南学习CUDA与GPU架构。

作者： 大卫·B·基尔克（David B. Kirk），胡文梅（Wen-mei W. Hwu）

致谢： 伊恩·巴克（Ian Buck）、约翰·尼克尔斯（John Nickolls）、NVIDIA DevTech团队、黄仁勋（Jensen Huang）、戴维·吕布克（David Luebke）、比尔·比恩（Bill Bean）、西蒙·格林（Simon Green）、马克·哈里斯（Mark Harris）、曼朱·赫吉（Manju Hedge）、纳迪姆·莫汉默德（Nadeem Mohammad）、布伦特·奥斯特（Brent Oster）、彼得·肖利（Peter Shirley）、埃里克·杨（Eric Young）和西里尔·泽勒（Cyril Zeller）。

🎯 学习目标

区分多核CPU与多核GPU在设计哲学和性能演进路径上的差异。
识别现代GPU架构的关键组件，包括流式多处理器（SMs）和内存结构。
应用阿姆达尔定律计算理论加速比，并识别顺序瓶颈的影响。
对比固定功能流水线与可编程统一处理器阵列之间的架构差异。
解释“GPGPU”作为过渡阶段的作用，以及早期着色器编程模型的局限性。
分析原子操作、屏障同步和双精度支持等硬件特性如何推动向可扩展通用计算的转变。
识别并利用矩阵-矩阵乘法算法中的数据并行性。
实现设备内存管理，包括分配、主机与设备间的数据传输，以及释放。
使用适当的线程索引和网格/块配置构建并启动CUDA内核。
设计多维线程层次结构（网格与块），将复杂数据结构映射到GPU硬件上。

课程共 12 课时 · 预计 36.0h

课程

Lesson

1 Lesson 1

This lesson explores the evolution of parallel computing, highlighting the "Great Divergence" where GPUs surpassed CPUs in performance by prioritizing throughput-oriented architecture over sequential latency. Students will learn to differentiate between these processing models, understand the impact of the "Power Wall" on CPU design, and analyze how GPU transistor budgeting enables massive parallel computation.

2 Lesson 2

This lesson explores the evolution of GPU architecture, focusing on the "real-time imperative" that necessitated a shift from serial CPU processing to parallel hardware acceleration. Students will learn how early innovations like SLI and the "wide and slow" design philosophy enabled the high-throughput performance required to meet strict frame-time budgets in modern computing.

3 Lesson 3

This lesson explores the CUDA execution model, focusing on the architectural differences between the latency-optimized CPU (Host) and the throughput-optimized GPU (Device). Students will learn how to manage the lifecycle of a CUDA kernel, implement memory allocation using cudaMalloc and cudaMemcpy, and organize threads into grids and blocks to perform parallel computations.

4 Lesson 4

This lesson explores the fundamentals of CUDA kernel execution, focusing on the transition from CPU-based iteration to data-centric GPU parallelism. Students will learn to implement the global indexing formula, manage execution configurations for transparent scalability, and apply boundary guards to ensure safe memory access across multidimensional data.

5 Lesson 5

This lesson explores the "Memory Wall" in GPU computing, where computational throughput outpaces memory bandwidth, creating a significant performance bottleneck. Students will learn to mitigate these constraints by implementing shared memory tiling strategies, optimizing data reuse, and managing hardware resource limits to maximize occupancy.

6 Lesson 6

This lesson explores the SIMT execution model, focusing on how hardware organizes threads into 32-thread warps and linearizes them for efficient scheduling. Students will learn to evaluate performance through warp partitioning, branch divergence analysis, and memory access patterns to optimize GPU kernel utilization.

7 Lesson 7

This lesson explores how Excess Encoding (biased representation) enables high-speed hardware sorting by ensuring that bit patterns maintain a monotonic relationship with their numerical values. By replacing the sign-bit discontinuity of Two's Complement with this biased format, engineers can utilize simple, efficient unsigned comparators to perform rapid operations like Z-buffering in parallel processors.

8 Lesson 8

This lesson explores the computational challenges of non-Cartesian MRI reconstruction, where spiral trajectories require iterative solvers or gridding instead of standard Fast Fourier Transforms. Students will learn how to overcome these bottlenecks by leveraging GPU-based massive parallelism, specifically focusing on voxel-to-thread mapping to optimize reconstruction speed for time-sensitive clinical applications like Sodium MRI.

9 Lesson 9

This lesson explores the use of Direct Coulomb Summation (DCS) and GPU acceleration to generate electrostatic potential maps for molecular visualization. Students will learn to optimize rendering pipelines through techniques like loop unrolling and constant memory broadcasting to efficiently handle large-scale atomic data.

10 Lesson 10

This lesson explores the transition from sequential processing to parallel computing, emphasizing how computational thinking helps overcome the power wall and frequency limits. Students will learn to evaluate parallel algorithm performance, manage the trade-offs between numerical precision and execution speed, and apply problem decomposition to optimize distributed systems.

11 Lesson 11

This lesson introduces the OpenCL framework as a solution for managing heterogeneous computing environments, where a host CPU orchestrates tasks across diverse accelerators like GPUs and FPGAs. Students will learn to utilize the OpenCL platform layer for hardware discovery, understand the device model's hierarchy, and implement portable, efficient kernels that adapt to different architectural requirements.

12 Lesson 12

This lesson explores the evolution of GPU architecture from graphics-focused designs to the compute-first Fermi generation, which introduced unified memory hierarchies and IEEE 754-2008 compliance. Students will learn how these advancements, including hardware-managed caching and improved thread scheduling, enable complex scientific computing and general-purpose programming beyond traditional 2D grid tasks.