AI032 プロフェッショナル

大規模並列プロセッサのプログラミング：実践型アプローチ

このコースでは、CUDA C環境を用いたGPUコンピューティングと並列プログラミングの包括的な導入を提供します。GPUアーキテクチャ、データ並列性、スレッド管理、メモリ最適化、および高度なパフォーマンスに関する考察を扱い、MRI再構成や分子可視化などの実世界の事例を通じて解説します。

4.9

36.0h

569 受講者

12 lessons

0 いいね

人工知能

学習を開始

コース概要

📚 コンテンツ概要

このコースでは、CUDA C環境を用いたGPUコンピューティングおよび並列プログラミングの包括的な入門を提供します。GPUアーキテクチャ、データ並列性、スレッド管理、メモリ最適化、そして高度なパフォーマンスに関する考察について、MRI再構成や分子可視化といった実世界の事例を通じて解説します。

実践的で手を動かす形のガイドを通じて、高性能並列コンピューティングの技術をマスターしよう。

著者: David B. Kirk, Wen-mei W. Hwu

謝辞: Ian Buck, John Nickolls, NVIDIA DevTechチーム、Jensen Huang, David Luebke, Bill Bean, Simon Green, Mark Harris, Manju Hedge, Nadeem Mohammad, Brent Oster, Peter Shirley, Eric Young, Cyril Zeller.

🎯 学習目標

マルチコアCPUとマニーコアGPUの設計哲学および性能の発展軌跡の違いを明確に識別する。
現代的なGPUアーキテクチャの主要な構成要素（ストリーミングマルチプロセッサ（SM）やメモリ構造など）を特定する。
アムダールの法則を適用し、理論的なスピードアップを計算し、直列処理のボトルネックの影響を把握する。
固定機能パイプラインとプログラマブル統合プロセッサアレイのアーキテクチャ上の相違点を対比する。
「GPGPU」が中間段階として果たした役割と初期のシェーダー・プログラミングモデルの制限を説明する。
原子的操作、バリア同期、倍精度サポートなどのハードウェア機能が、スケーラブルな汎用コンピューティングへの移行を可能にした仕組みを分析する。
行列乗算アルゴリズムにおけるデータ並列性を識別し、活用する。
デバイスメモリ管理（割り当て、ホストとデバイス間のデータ転送、解放）を実装する。
適切なスレッドインデックスおよびグリッド/ブロック構成を使用して、CUDAカーネルを構築し起動する。
多次元スレッド階層（グリッドとブロック）を設計し、複雑なデータ構造をGPUハードウェアにマッピングする。

レッスン共 12 课时 · 预计 36.0h

1 並列コンピューティングとGPUアーキテクチャの基礎

2 GPUコンピューティングの進化と将来展望

3 CUDAプログラム構造とメモリ管理

4 高度なCUDAスレッドとスケジューリング

9 事例研究：分子可視化とマルチGPU実行

10 計算的思考と並列アルゴリズムの選択

11 OpenCLプログラミングモデルの導入

12 現代のGPU機能と将来展望

レッスン

Lesson

1 Lesson 1

This lesson explores the evolution of parallel computing, highlighting the "Great Divergence" where GPUs surpassed CPUs in performance by prioritizing throughput-oriented architecture over sequential latency. Students will learn to differentiate between these processing models, understand the impact of the "Power Wall" on CPU design, and analyze how GPU transistor budgeting enables massive parallel computation.

2 Lesson 2

This lesson explores the evolution of GPU architecture, focusing on the "real-time imperative" that necessitated a shift from serial CPU processing to parallel hardware acceleration. Students will learn how early innovations like SLI and the "wide and slow" design philosophy enabled the high-throughput performance required to meet strict frame-time budgets in modern computing.

3 Lesson 3

This lesson explores the CUDA execution model, focusing on the architectural differences between the latency-optimized CPU (Host) and the throughput-optimized GPU (Device). Students will learn how to manage the lifecycle of a CUDA kernel, implement memory allocation using cudaMalloc and cudaMemcpy, and organize threads into grids and blocks to perform parallel computations.

4 Lesson 4

This lesson explores the fundamentals of CUDA kernel execution, focusing on the transition from CPU-based iteration to data-centric GPU parallelism. Students will learn to implement the global indexing formula, manage execution configurations for transparent scalability, and apply boundary guards to ensure safe memory access across multidimensional data.

5 Lesson 5

This lesson explores the "Memory Wall" in GPU computing, where computational throughput outpaces memory bandwidth, creating a significant performance bottleneck. Students will learn to mitigate these constraints by implementing shared memory tiling strategies, optimizing data reuse, and managing hardware resource limits to maximize occupancy.

6 Lesson 6

This lesson explores the SIMT execution model, focusing on how hardware organizes threads into 32-thread warps and linearizes them for efficient scheduling. Students will learn to evaluate performance through warp partitioning, branch divergence analysis, and memory access patterns to optimize GPU kernel utilization.

7 Lesson 7

This lesson explores how Excess Encoding (biased representation) enables high-speed hardware sorting by ensuring that bit patterns maintain a monotonic relationship with their numerical values. By replacing the sign-bit discontinuity of Two's Complement with this biased format, engineers can utilize simple, efficient unsigned comparators to perform rapid operations like Z-buffering in parallel processors.

8 Lesson 8

This lesson explores the computational challenges of non-Cartesian MRI reconstruction, where spiral trajectories require iterative solvers or gridding instead of standard Fast Fourier Transforms. Students will learn how to overcome these bottlenecks by leveraging GPU-based massive parallelism, specifically focusing on voxel-to-thread mapping to optimize reconstruction speed for time-sensitive clinical applications like Sodium MRI.

9 Lesson 9

This lesson explores the use of Direct Coulomb Summation (DCS) and GPU acceleration to generate electrostatic potential maps for molecular visualization. Students will learn to optimize rendering pipelines through techniques like loop unrolling and constant memory broadcasting to efficiently handle large-scale atomic data.

10 Lesson 10

This lesson explores the transition from sequential processing to parallel computing, emphasizing how computational thinking helps overcome the power wall and frequency limits. Students will learn to evaluate parallel algorithm performance, manage the trade-offs between numerical precision and execution speed, and apply problem decomposition to optimize distributed systems.

11 Lesson 11

This lesson introduces the OpenCL framework as a solution for managing heterogeneous computing environments, where a host CPU orchestrates tasks across diverse accelerators like GPUs and FPGAs. Students will learn to utilize the OpenCL platform layer for hardware discovery, understand the device model's hierarchy, and implement portable, efficient kernels that adapt to different architectural requirements.

12 Lesson 12

This lesson explores the evolution of GPU architecture from graphics-focused designs to the compute-first Fermi generation, which introduced unified memory hierarchies and IEEE 754-2008 compliance. Students will learn how these advancements, including hardware-managed caching and improved thread scheduling, enable complex scientific computing and general-purpose programming beyond traditional 2D grid tasks.