AI032 전문가

프로그래밍 마스시브하게 병렬 처리기: 실습 중심 접근

이 과정은 CUDA C 환경을 사용한 GPU 컴퓨팅과 병렬 프로그래밍에 대한 포괄적인 소개를 제공합니다. GPU 아키텍처, 데이터 병렬성, 스레드 관리, 메모리 최적화 및 고급 성능 고려 사항을 다루며, MRI 재구성 및 분자 시각화와 같은 실제 사례 연구를 통해 설명합니다.

4.9

36.0h

569 학생들

12 lessons

0 좋아요

인공지능

수강 시작하기

강좌 개요

📚 콘텐츠 요약

이 과정은 CUDA C 환경을 활용한 GPU 컴퓨팅과 병렬 프로그래밍에 대한 포괄적인 소개를 제공합니다. GPU 아키텍처, 데이터 병렬성, 스레드 관리, 메모리 최적화 및 고급 성능 고려 사항을 다루며, MRI 재구성 및 분자 시각화와 같은 실제 사례 연구를 통해 설명됩니다.

실용적이고 실습 중심의 가이드를 통해 고성능 병렬 컴퓨팅의 기술을 숙달하세요.

저자: 데이비드 B. 키르크, 웬메이 웨이 후

감사의 말: 아이언 버크, 존 닉올스, NVIDIA DevTech 팀, 젠슨 황, 데이비드 루브케, 빌 비안, 사이먼 그린, 마크 해리스, 맨주 헤지, 나데엠 모하메드, 브렌트 오스터, 피터 쉴러, 에릭 영, 시릴 제러.

🎯 학습 목표

멀티코어 CPU와 멀티코어 GPU의 설계 철학과 성능 전개 방식을 구분합니다.
현대적인 GPU 아키텍처의 핵심 구성 요소(예: 스트리밍 멀티프로세서(SM) 및 메모리 구조)를 식별합니다.
암다할의 법칙을 적용하여 이론적 속도 향상을 계산하고 순차적 블로킹의 영향을 파악합니다.
고정 기능 파이프라인과 프로그래머블 통합 프로세서 배열 간의 아키텍처 차이를 비교합니다.
GPGPU가 중간 단계로서의 역할과 초기 샷터 프로그래밍 모델의 제약 조건을 설명합니다.
원자 연산, 장벽 동기화, 이중 정밀도 지원과 같은 하드웨어 기능이 확장 가능한 일반 목적 컴퓨팅으로의 전환을 가능하게 했음을 분석합니다.
행렬-행렬 곱셈 알고리즘 내에서 데이터 병렬성을 식별하고 활용합니다.
할당, 호스트와 디바이스 간의 데이터 전송, 할당 해제를 포함한 디바이스 메모리 관리를 구현합니다.
적절한 스레드 인덱싱과 그리드/블록 구성으로 CUDA 커널을 생성하고 실행합니다.
복잡한 데이터 구조를 GPU 하드웨어에 매핑하기 위해 다차원 스레드 계층 구조(그리드 및 블록)를 설계합니다.

수업 共 12 课时 · 预计 36.0h

1 병렬 컴퓨팅과 GPU 아키텍처 소개

2 GPU 컴퓨팅의 진화와 미래

3 CUDA 프로그램 구조와 메모리 관리

4 고급 CUDA 스레딩 및 스케줄링

5 메모리 최적화 및 공유 메모리 타일링

6 성능 분석 및 SIMT 실행

7 부동소수점 산술과 수치 정확도

8 사례 연구: MRI 재구성의 병렬화

9 사례 연구: 분자 시각화와 멀티-GPU 실행

10 계산적 사고와 병렬 알고리즘 선택

11 OpenCL 프로그래밍 모델 소개

12 현대 GPU 기능과 미래 전망

수업

Lesson

1 Lesson 1

This lesson explores the evolution of parallel computing, highlighting the "Great Divergence" where GPUs surpassed CPUs in performance by prioritizing throughput-oriented architecture over sequential latency. Students will learn to differentiate between these processing models, understand the impact of the "Power Wall" on CPU design, and analyze how GPU transistor budgeting enables massive parallel computation.

2 Lesson 2

This lesson explores the evolution of GPU architecture, focusing on the "real-time imperative" that necessitated a shift from serial CPU processing to parallel hardware acceleration. Students will learn how early innovations like SLI and the "wide and slow" design philosophy enabled the high-throughput performance required to meet strict frame-time budgets in modern computing.

3 Lesson 3

This lesson explores the CUDA execution model, focusing on the architectural differences between the latency-optimized CPU (Host) and the throughput-optimized GPU (Device). Students will learn how to manage the lifecycle of a CUDA kernel, implement memory allocation using cudaMalloc and cudaMemcpy, and organize threads into grids and blocks to perform parallel computations.

4 Lesson 4

This lesson explores the fundamentals of CUDA kernel execution, focusing on the transition from CPU-based iteration to data-centric GPU parallelism. Students will learn to implement the global indexing formula, manage execution configurations for transparent scalability, and apply boundary guards to ensure safe memory access across multidimensional data.

5 Lesson 5

This lesson explores the "Memory Wall" in GPU computing, where computational throughput outpaces memory bandwidth, creating a significant performance bottleneck. Students will learn to mitigate these constraints by implementing shared memory tiling strategies, optimizing data reuse, and managing hardware resource limits to maximize occupancy.

6 Lesson 6

This lesson explores the SIMT execution model, focusing on how hardware organizes threads into 32-thread warps and linearizes them for efficient scheduling. Students will learn to evaluate performance through warp partitioning, branch divergence analysis, and memory access patterns to optimize GPU kernel utilization.

7 Lesson 7

This lesson explores how Excess Encoding (biased representation) enables high-speed hardware sorting by ensuring that bit patterns maintain a monotonic relationship with their numerical values. By replacing the sign-bit discontinuity of Two's Complement with this biased format, engineers can utilize simple, efficient unsigned comparators to perform rapid operations like Z-buffering in parallel processors.

8 Lesson 8

This lesson explores the computational challenges of non-Cartesian MRI reconstruction, where spiral trajectories require iterative solvers or gridding instead of standard Fast Fourier Transforms. Students will learn how to overcome these bottlenecks by leveraging GPU-based massive parallelism, specifically focusing on voxel-to-thread mapping to optimize reconstruction speed for time-sensitive clinical applications like Sodium MRI.

9 Lesson 9

This lesson explores the use of Direct Coulomb Summation (DCS) and GPU acceleration to generate electrostatic potential maps for molecular visualization. Students will learn to optimize rendering pipelines through techniques like loop unrolling and constant memory broadcasting to efficiently handle large-scale atomic data.

10 Lesson 10

This lesson explores the transition from sequential processing to parallel computing, emphasizing how computational thinking helps overcome the power wall and frequency limits. Students will learn to evaluate parallel algorithm performance, manage the trade-offs between numerical precision and execution speed, and apply problem decomposition to optimize distributed systems.

11 Lesson 11

This lesson introduces the OpenCL framework as a solution for managing heterogeneous computing environments, where a host CPU orchestrates tasks across diverse accelerators like GPUs and FPGAs. Students will learn to utilize the OpenCL platform layer for hardware discovery, understand the device model's hierarchy, and implement portable, efficient kernels that adapt to different architectural requirements.

12 Lesson 12

This lesson explores the evolution of GPU architecture from graphics-focused designs to the compute-first Fermi generation, which introduced unified memory hierarchies and IEEE 754-2008 compliance. Students will learn how these advancements, including hardware-managed caching and improved thread scheduling, enable complex scientific computing and general-purpose programming beyond traditional 2D grid tasks.

수업