程式設計大量平行處理器：實務導向方法

本課程提供對使用 CUDA C 環境進行 GPU 計算與並行程式設計的全面介紹。內容涵蓋 GPU 架構、資料平行性、線程管理、記憶體優化以及高階效能考量，並透過真實案例（如磁振造影重建與分子可視化）加以說明。

4.9

36.0h

569 學習者

12 lessons

0 讚好

人工智能

開始學習

課程總覽

📚 內容概要

本課程提供對使用 CUDA C 環境進行 GPU 計算與平行程式設計的全面介紹。內容涵蓋 GPU 架構、資料平行性、線程管理、記憶體優化以及進階效能考量，並透過如磁振造影重建與分子可視化等真實案例加以說明。

透過實用且動手導向的指南，掌握高效能平行運算的藝術，深入理解 CUDA 與 GPU 架構。

作者： David B. Kirk, Wen-mei W. Hwu

致謝： Ian Buck, John Nickolls, NVIDIA DevTech 團隊，Jensen Huang, David Luebke, Bill Bean, Simon Green, Mark Harris, Manju Hedge, Nadeem Mohammad, Brent Oster, Peter Shirley, Eric Young, 及 Cyril Zeller。

🎯 學習目標

区分多核心 CPU 與多核心 GPU 之間的設計哲學與效能發展路徑。
認識現代 GPU 架構中的關鍵組件，包括串流多處理器（SMs）與記憶體結構。
應用阿姆達爾定律（Amdahl's Law）計算理論加速比，並辨識順序瓶頸的影響。
對比固定功能管线與可程式化統一處理器陣列之間的架構差異。
解釋「GPGPU」作為中間階段的角色，以及早期著色器程式模型的限制。
分析原生硬體功能（如原子操作、屏障同步與雙精確度支援）如何促成可擴展通用計算的轉變。
在矩陣-矩陣乘法演算法中辨識並利用資料平行性。
實作裝置記憶體管理，包含記憶體配置、主機與裝置間的資料傳輸，以及釋放。
使用適當的線程索引與格網/區塊組態，建構並啟動 CUDA 核心。
設計多維度線程層次結構（格網與區塊），以將複雜資料結構映射至 GPU 硬體。

課程共 12 课时 · 预计 36.0h

1 第 1 課：平行運算與 GPU 架構導論

2 第 2 課：GPU 計算的演進與未來展望

3 第 3 課：CUDA 程式結構與記憶體管理

4 第 4 課：進階 CUDA 線程與排程

5 第 5 課：記憶體優化與共享記憶體分塊

6 第 6 課：效能分析與 SIMT 執行

7 第 7 課：浮點運算與數值準確性

8 第 8 課：案例研究：磁振造影重建的平行化

9 第 9 課：案例研究：分子可視化與多 GPU 執行

10 第 10 課：計算思維與平行演算法選擇

11 第 11 課：OpenCL 程式模型導論

12 第 12 課：現代 GPU 特性與未來展望

課程

Lesson

1 Lesson 1

This lesson explores the evolution of parallel computing, highlighting the "Great Divergence" where GPUs surpassed CPUs in performance by prioritizing throughput-oriented architecture over sequential latency. Students will learn to differentiate between these processing models, understand the impact of the "Power Wall" on CPU design, and analyze how GPU transistor budgeting enables massive parallel computation.

2 Lesson 2

This lesson explores the evolution of GPU architecture, focusing on the "real-time imperative" that necessitated a shift from serial CPU processing to parallel hardware acceleration. Students will learn how early innovations like SLI and the "wide and slow" design philosophy enabled the high-throughput performance required to meet strict frame-time budgets in modern computing.

3 Lesson 3

This lesson explores the CUDA execution model, focusing on the architectural differences between the latency-optimized CPU (Host) and the throughput-optimized GPU (Device). Students will learn how to manage the lifecycle of a CUDA kernel, implement memory allocation using cudaMalloc and cudaMemcpy, and organize threads into grids and blocks to perform parallel computations.

4 Lesson 4

This lesson explores the fundamentals of CUDA kernel execution, focusing on the transition from CPU-based iteration to data-centric GPU parallelism. Students will learn to implement the global indexing formula, manage execution configurations for transparent scalability, and apply boundary guards to ensure safe memory access across multidimensional data.

5 Lesson 5

This lesson explores the "Memory Wall" in GPU computing, where computational throughput outpaces memory bandwidth, creating a significant performance bottleneck. Students will learn to mitigate these constraints by implementing shared memory tiling strategies, optimizing data reuse, and managing hardware resource limits to maximize occupancy.

6 Lesson 6

This lesson explores the SIMT execution model, focusing on how hardware organizes threads into 32-thread warps and linearizes them for efficient scheduling. Students will learn to evaluate performance through warp partitioning, branch divergence analysis, and memory access patterns to optimize GPU kernel utilization.

7 Lesson 7

This lesson explores how Excess Encoding (biased representation) enables high-speed hardware sorting by ensuring that bit patterns maintain a monotonic relationship with their numerical values. By replacing the sign-bit discontinuity of Two's Complement with this biased format, engineers can utilize simple, efficient unsigned comparators to perform rapid operations like Z-buffering in parallel processors.

8 Lesson 8

This lesson explores the computational challenges of non-Cartesian MRI reconstruction, where spiral trajectories require iterative solvers or gridding instead of standard Fast Fourier Transforms. Students will learn how to overcome these bottlenecks by leveraging GPU-based massive parallelism, specifically focusing on voxel-to-thread mapping to optimize reconstruction speed for time-sensitive clinical applications like Sodium MRI.

9 Lesson 9

This lesson explores the use of Direct Coulomb Summation (DCS) and GPU acceleration to generate electrostatic potential maps for molecular visualization. Students will learn to optimize rendering pipelines through techniques like loop unrolling and constant memory broadcasting to efficiently handle large-scale atomic data.

10 Lesson 10

This lesson explores the transition from sequential processing to parallel computing, emphasizing how computational thinking helps overcome the power wall and frequency limits. Students will learn to evaluate parallel algorithm performance, manage the trade-offs between numerical precision and execution speed, and apply problem decomposition to optimize distributed systems.

11 Lesson 11

This lesson introduces the OpenCL framework as a solution for managing heterogeneous computing environments, where a host CPU orchestrates tasks across diverse accelerators like GPUs and FPGAs. Students will learn to utilize the OpenCL platform layer for hardware discovery, understand the device model's hierarchy, and implement portable, efficient kernels that adapt to different architectural requirements.

12 Lesson 12

This lesson explores the evolution of GPU architecture from graphics-focused designs to the compute-first Fermi generation, which introduced unified memory hierarchies and IEEE 754-2008 compliance. Students will learn how these advancements, including hardware-managed caching and improved thread scheduling, enable complex scientific computing and general-purpose programming beyond traditional 2D grid tasks.