Back to Courses
AI032 Professional

Programming Massively Parallel Processors: A Hands-on Approach

This course provides a comprehensive introduction to GPU computing and parallel programming using the CUDA C environment. It covers GPU architectures, data parallelism, thread management, memory optimization, and advanced performance considerations, illustrated through real-world case studies like MRI reconstruction and molecular visualization.

4.9
36h
569 students
0 likes
Artificial Intelligence

Course Overview

📚 Content Summary

This course provides a comprehensive introduction to GPU computing and parallel programming using the CUDA C environment. It covers GPU architectures, data parallelism, thread management, memory optimization, and advanced performance considerations, illustrated through real-world case studies like MRI reconstruction and molecular visualization.

Master the art of high-performance parallel computing with a practical, hands-on guide to CUDA and GPU architectures.

Author: David B. Kirk, Wen-mei W. Hwu

Acknowledgments: Ian Buck, John Nickolls, NVIDIA DevTech team, Jensen Huang, David Luebke, Bill Bean, Simon Green, Mark Harris, Manju Hedge, Nadeem Mohammad, Brent Oster, Peter Shirley, Eric Young, and Cyril Zeller.

🎯 Learning Objectives

  1. Distinguish between the design philosophies and performance trajectories of multicore CPUs and many-core GPUs.
  2. Identify the key components of a modern GPU architecture, including Streaming Multiprocessors (SMs) and memory structures.
  3. Apply Amdahl's Law to calculate theoretical speedup and identify the impact of sequential bottlenecks.
  4. Contrast the architectural differences between fixed-function pipelines and programmable unified processor arrays.
  5. Explain the role of "GPGPU" as an intermediate step and the restrictions of early shader programming models.
  6. Analyze how hardware features like atomic operations, barrier synchronization, and double-precision support enabled the transition to scalable general-purpose computing.
  7. Identify and exploit data parallelism within matrix-matrix multiplication algorithms.
  8. Implement device memory management including allocation, data transfer between host and device, and deallocation.
  9. Construct and launch CUDA kernels using appropriate thread indexing and grid/block configurations.
  10. Design multidimensional thread hierarchies (Grids and Blocks) to map complex data structures to GPU hardware.

🔹 Lesson 1: Introduction to Parallel Computing and GPU Architectures

Overview: This lesson explores the fundamental shift from sequential to parallel computing, driven by the diverging design philosophies of CPUs and GPUs. Students will examine the "Multicore" versus "Many-core" trajectories, understand the hardware architecture that allows GPUs to achieve massive throughput, and learn the mathematical constraints of speedup via Amdahl's Law.

Learning Outcomes:

  • Distinguish between the design philosophies and performance trajectories of multicore CPUs and many-core GPUs.
  • Identify the key components of a modern GPU architecture, including Streaming Multiprocessors (SMs) and memory structures.
  • Apply Amdahl's Law to calculate theoretical speedup and identify the impact of sequential bottlenecks.

🔹 Lesson 2: The Evolution and Future of GPU Computing

Overview: This lesson traces the architectural journey of the Graphics Processing Unit (GPU) from its origins as a specialized fixed-function hardware for rendering triangles to its current state as a powerful, unified, general-purpose parallel processor. Students will explore the shift from rigid graphics pipelines to programmable shaders, the emergence of the GPGPU movement, and the modern scalable architectures that drive current scientific and engineering simulations.

Learning Outcomes:

  • Contrast the architectural differences between fixed-function pipelines and programmable unified processor arrays.
  • Explain the role of "GPGPU" as an intermediate step and the restrictions of early shader programming models.
  • Analyze how hardware features like atomic operations, barrier synchronization, and double-precision support enabled the transition to scalable general-purpose computing.

🔹 Lesson 3: CUDA Program Structure and Memory Management

Overview: This lesson covers the fundamental architecture of a CUDA program, emphasizing the distinction between Host (CPU) and Device (GPU) execution. Students will learn to identify data parallelism in matrix operations, manage separate memory spaces using the CUDA API, and organize parallel execution through a hierarchy of grids, blocks, and threads using the Single-Program, Multiple-Data (SPMD) style.

Learning Outcomes:

  • Identify and exploit data parallelism within matrix-matrix multiplication algorithms.
  • Implement device memory management including allocation, data transfer between host and device, and deallocation.
  • Construct and launch CUDA kernels using appropriate thread indexing and grid/block configurations.

🔹 Lesson 4: Advanced CUDA Threading and Scheduling

Overview: This lesson explores the hierarchical organization of threads in CUDA, focusing on how multidimensional indexing maps to physical data and hardware resources. It details the mechanisms of barrier synchronization and transparent scalability, concluding with the architectural principles of thread assignment and warp-based scheduling used to achieve latency tolerance in high-performance computing.

Learning Outcomes:

  • Design multidimensional thread hierarchies (Grids and Blocks) to map complex data structures to GPU hardware.
  • Implement precise data indexing using built-in CUDA variables (blockIdx, threadIdx, blockDim).
  • Apply barrier synchronization to ensure data integrity while maintaining transparent scalability across different GPU architectures.

🔹 Lesson 5: Memory Optimization and Shared Memory Tiling

Overview: This lesson explores how memory bandwidth and resource constraints act as primary bottlenecks in parallel computing. It details the use of "tiling" to reduce global memory traffic and explains the critical role of synchronization barriers (__syncthreads()) and the strategic choice between registers and shared memory to optimize performance.

Learning Outcomes:

  • Analyze how register and shared memory limits determine the level of parallelism (occupancy) in a kernel.
  • Quantify the reduction in global memory bandwidth consumption achieved through tiling techniques.
  • Identify the necessity of synchronization functions to maintain data integrity during shared memory access.

🔹 Lesson 6: Performance Analysis and SIMT Execution

Overview: This lesson explores the architectural and algorithmic considerations essential for optimizing CUDA kernels. It transitions from basic execution models—specifically the Single-Instruction, Multiple-Thread (SIMT) unit and warp partitioning—to advanced performance tuning techniques including memory coalescing, tiled matrix multiplication, and the dynamic partitioning of Streaming Multiprocessor (SM) resources.

Learning Outcomes:

  • Analyze the mapping of multi-dimensional thread blocks to the hardware’s linear warp execution order.
  • Evaluate and minimize control flow divergence in parallel reduction algorithms.
  • Optimize global memory bandwidth by implementing memory coalescing and tiled data access patterns.

🔹 Lesson 7: Floating-Point Arithmetic and Numerical Accuracy

Overview: This lesson covers the fundamental architecture of floating-point numbers, focusing on the IEEE 754 standard components: sign, excess-encoded exponent, and normalized mantissa. Students will explore how these bit patterns map to a discrete number line and how the limitations of this representation affect the accuracy of complex algorithms like large-scale summations.

Learning Outcomes:

  • Deconstruct the floating-point format to calculate numeric values from bit patterns using normalized representation and excess encoding.
  • Visualize the distribution of representable numbers on a number line and explain the impact of bit allocation between exponent and mantissa.
  • Quantify numerical inaccuracy using ULP and identify how different rounding modes contribute to error.

🔹 Lesson 8: Case Study: Parallelizing MRI Reconstruction

Overview: This lesson explores the parallelization of advanced Magnetic Resonance Imaging (MRI) reconstruction on GPUs. It focuses on the iterative reconstruction process for non-Cartesian trajectories, specifically optimizing the computationally intensive F^H d kernel through loop transformations, constant memory management, data layout reorganization, and the use of hardware-accelerated trigonometric functions.

Learning Outcomes:

  • Understand the transition from Cartesian FFT-based reconstruction to iterative linear-solver-based algorithms for non-Cartesian k-space data.
  • Apply loop fission and loop interchange to transform sequential C code into a structure suitable for massive CUDA thread mapping.
  • Optimize memory throughput using constant memory chunking and Array-of-Structs (AoS) data layouts.

🔹 Lesson 9: Case Study: Molecular Visualization and Multi-GPU Execution

Overview: This lesson explores the practical application of GPU computing to molecular visualization, specifically using the Direct Coulomb Summation (DCS) method to calculate electrostatic potential maps. Students will progress from a basic kernel implementation to highly optimized versions that leverage instruction unrolling, memory coalescing, and padding.

Learning Outcomes:

  • Implement a Direct Coulomb Summation (DCS) kernel using CUDA constant memory and global memory latency-hiding techniques.
  • Optimize kernel performance through instruction unrolling and the reuse of common coordinate calculations.
  • Apply memory coalescing and padding strategies to align GPU global memory accesses for maximum bandwidth.

🔹 Lesson 10: Computational Thinking and Parallel Algorithm Selection

Overview: This lesson explores the transition from sequential thinking to parallel problem-solving by focusing on the goals of parallel programming and the strategic selection of algorithms. Students will learn to decompose problems into parallelizable units, apply computational thinking to bridge the gap between domain science and hardware architecture, and evaluate algorithm performance.

Learning Outcomes:

  • Identify the primary goals of parallel programming and calculate theoretical speedup using Amdahl’s Law.
  • Differentiate between task-level and data-level decomposition and apply atom-centric (scatter) versus grid-centric (gather) strategies.
  • Evaluate and select parallel algorithms based on criteria such as memory bandwidth, computational complexity, and architectural constraints.

🔹 Lesson 11: Introduction to the OpenCL Programming Model

Overview: This lesson introduces OpenCL as a framework for heterogeneous parallel computing, focusing on its data parallelism model and hierarchical hardware abstraction. Students will learn to map OpenCL’s NDRange and memory structures to CUDA equivalents and master the host-side management of devices through a dynamic compilation model.

Learning Outcomes:

  • Map OpenCL parallelism and memory hierarchies to CUDA-specific architectures (e.g., mapping Work-groups to Blocks and Local Memory to Shared Memory).
  • Implement OpenCL kernel functions and manage the host-side execution environment using Contexts and Command Queues.
  • Execute the dynamic compilation workflow to build kernels from source code at runtime.

🔹 Lesson 12: Modern GPU Features and Future Outlook

Overview: This lesson explores the architectural and functional evolution of GPUs, focusing on the transition toward sophisticated memory management, enhanced kernel execution capabilities, and increased core performance. Students will examine how features like Unified Device Memory Space and kernel-level function calls transition the GPU into a general-purpose processor.

Learning Outcomes:

  • Explain the significance of Memory Architecture Evolution and the move toward a 64-bit Unified Device Memory Space.
  • Analyze how Enhanced Atomic Operations and kernel-level Function Calls enable the implementation of complex data structures and algorithms.
  • Evaluate the performance impacts of Simultaneous Kernel Execution, Double-Precision Speed improvements, and Control Flow Efficiency in modern GPU environments.