AI032 Nghề nghiệp

Lập trình Các Bộ Xử Lý Song Song Khổng Lồ: Tiếp cận Thực Hành

Khóa học này cung cấp một giới thiệu toàn diện về tính toán GPU và lập trình song song sử dụng môi trường CUDA C. Nó bao gồm kiến trúc GPU, tính song song dữ liệu, quản lý luồng, tối ưu hóa bộ nhớ và các vấn đề hiệu suất nâng cao, được minh họa thông qua các ví dụ thực tế như tái tạo MRI và trực quan hóa phân tử.

4.9

36.0h

569 học viên

12 lessons

0 lượt thích

Trí tuệ nhân tạo

Bắt đầu học

Tổng quan khóa học

📚 Tóm tắt Nội dung

Khóa học này cung cấp một giới thiệu toàn diện về tính toán GPU và lập trình song song bằng môi trường CUDA C. Nó bao gồm kiến trúc GPU, tính song song dữ liệu, quản lý luồng, tối ưu hóa bộ nhớ và các yếu tố hiệu suất nâng cao, được minh họa thông qua các ví dụ thực tế như tái tạo MRI và trực quan hóa phân tử.

Chinh phục nghệ thuật tính toán song song hiệu suất cao với hướng dẫn thực hành, trực tiếp về CUDA và kiến trúc GPU.

Tác giả: David B. Kirk, Wen-mei W. Hwu

Lời cảm ơn: Ian Buck, John Nickolls, đội NVIDIA DevTech, Jensen Huang, David Luebke, Bill Bean, Simon Green, Mark Harris, Manju Hedge, Nadeem Mohammad, Brent Oster, Peter Shirley, Eric Young, và Cyril Zeller.

🎯 Mục tiêu Học tập

Phân biệt triết lý thiết kế và xu hướng hiệu suất giữa các CPU đa lõi và GPU nhiều lõi.
Nhận diện các thành phần chính trong kiến trúc GPU hiện đại, bao gồm các đơn vị xử lý luồng (SMs) và cấu trúc bộ nhớ.
Áp dụng Luật Amdahl để tính tốc độ tăng theo lý thuyết và xác định ảnh hưởng của các điểm nghẽn tuần tự.
So sánh sự khác biệt kiến trúc giữa các đường truyền cố định và mảng xử lý có thể lập trình thống nhất.
Giải thích vai trò của "GPGPU" như một bước chuyển tiếp và những hạn chế của mô hình lập trình shader ban đầu.
Phân tích cách các tính năng phần cứng như thao tác nguyên tử, đồng bộ rào chắn và hỗ trợ số đôi đã thúc đẩy quá trình chuyển đổi sang tính toán tổng quát quy mô lớn.
Nhận diện và tận dụng tính song song dữ liệu trong các thuật toán nhân ma trận.
Triển khai quản lý bộ nhớ thiết bị bao gồm cấp phát, truyền dữ liệu giữa host và device, và giải phóng.
Xây dựng và khởi chạy các kernel CUDA sử dụng chỉ số luồng và cấu hình lưới/khối phù hợp.
Thiết kế các hiệ thống luồng đa chiều (lưới và khối) để ánh xạ các cấu trúc dữ liệu phức tạp lên phần cứng GPU.

Bài học 共 12 课时 · 预计 36.0h

1 Giới thiệu về Tính toán Song song và Kiến trúc GPU

2 Sự tiến hóa và tương lai của Tính toán GPU

3 Cấu trúc Chương trình CUDA và Quản lý Bộ nhớ

4 Luồng và Lên lịch CUDA nâng cao

5 Tối ưu hóa Bộ nhớ và Chia nhỏ Bộ nhớ Chung

6 Phân tích Hiệu suất và Thực thi SIMT

7 Số học dấu phẩy động và Độ chính xác Số học

8 Trường hợp nghiên cứu: Song song hóa Tái tạo MRI

9 Trường hợp nghiên cứu: Trực quan hóa Phân tử và Thực thi đa GPU

10 Tư duy Tính toán và Chọn lựa Thuật toán Song song

11 Giới thiệu Mô hình Lập trình OpenCL

12 Các Tính năng GPU Hiện đại và Hướng Tương lai

Bài học

Lesson

1 Lesson 1

This lesson explores the evolution of parallel computing, highlighting the "Great Divergence" where GPUs surpassed CPUs in performance by prioritizing throughput-oriented architecture over sequential latency. Students will learn to differentiate between these processing models, understand the impact of the "Power Wall" on CPU design, and analyze how GPU transistor budgeting enables massive parallel computation.

2 Lesson 2

This lesson explores the evolution of GPU architecture, focusing on the "real-time imperative" that necessitated a shift from serial CPU processing to parallel hardware acceleration. Students will learn how early innovations like SLI and the "wide and slow" design philosophy enabled the high-throughput performance required to meet strict frame-time budgets in modern computing.

3 Lesson 3

This lesson explores the CUDA execution model, focusing on the architectural differences between the latency-optimized CPU (Host) and the throughput-optimized GPU (Device). Students will learn how to manage the lifecycle of a CUDA kernel, implement memory allocation using cudaMalloc and cudaMemcpy, and organize threads into grids and blocks to perform parallel computations.

4 Lesson 4

This lesson explores the fundamentals of CUDA kernel execution, focusing on the transition from CPU-based iteration to data-centric GPU parallelism. Students will learn to implement the global indexing formula, manage execution configurations for transparent scalability, and apply boundary guards to ensure safe memory access across multidimensional data.

5 Lesson 5

This lesson explores the "Memory Wall" in GPU computing, where computational throughput outpaces memory bandwidth, creating a significant performance bottleneck. Students will learn to mitigate these constraints by implementing shared memory tiling strategies, optimizing data reuse, and managing hardware resource limits to maximize occupancy.

6 Lesson 6

This lesson explores the SIMT execution model, focusing on how hardware organizes threads into 32-thread warps and linearizes them for efficient scheduling. Students will learn to evaluate performance through warp partitioning, branch divergence analysis, and memory access patterns to optimize GPU kernel utilization.

7 Lesson 7

This lesson explores how Excess Encoding (biased representation) enables high-speed hardware sorting by ensuring that bit patterns maintain a monotonic relationship with their numerical values. By replacing the sign-bit discontinuity of Two's Complement with this biased format, engineers can utilize simple, efficient unsigned comparators to perform rapid operations like Z-buffering in parallel processors.

8 Lesson 8

This lesson explores the computational challenges of non-Cartesian MRI reconstruction, where spiral trajectories require iterative solvers or gridding instead of standard Fast Fourier Transforms. Students will learn how to overcome these bottlenecks by leveraging GPU-based massive parallelism, specifically focusing on voxel-to-thread mapping to optimize reconstruction speed for time-sensitive clinical applications like Sodium MRI.

9 Lesson 9

This lesson explores the use of Direct Coulomb Summation (DCS) and GPU acceleration to generate electrostatic potential maps for molecular visualization. Students will learn to optimize rendering pipelines through techniques like loop unrolling and constant memory broadcasting to efficiently handle large-scale atomic data.

10 Lesson 10

This lesson explores the transition from sequential processing to parallel computing, emphasizing how computational thinking helps overcome the power wall and frequency limits. Students will learn to evaluate parallel algorithm performance, manage the trade-offs between numerical precision and execution speed, and apply problem decomposition to optimize distributed systems.

11 Lesson 11

This lesson introduces the OpenCL framework as a solution for managing heterogeneous computing environments, where a host CPU orchestrates tasks across diverse accelerators like GPUs and FPGAs. Students will learn to utilize the OpenCL platform layer for hardware discovery, understand the device model's hierarchy, and implement portable, efficient kernels that adapt to different architectural requirements.

12 Lesson 12

This lesson explores the evolution of GPU architecture from graphics-focused designs to the compute-first Fermi generation, which introduced unified memory hierarchies and IEEE 754-2008 compliance. Students will learn how these advancements, including hardware-managed caching and improved thread scheduling, enable complex scientific computing and general-purpose programming beyond traditional 2D grid tasks.