Name: Learn CUDA Programming
Author: Jaegeun Han Bharatkumar Sharma
ISBN: 9781788991292

微信读书书城Learn CUDA Programming

首页我的书架登录

本书已下架

内容不再支持阅读

Learn CUDA Programming

Jaegeun Han Bharatkumar Sharma

扉页
coverpage
+书签
Title Page
Copyright and Credits
Learn CUDA Programming
Dedication
About Packt
Why subscribe?
Contributors
About the authors
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Introduction to CUDA Programming
The history of high-performance computing
Heterogeneous computing
Programming paradigm
Low latency versus higher throughput
Programming approaches to GPU
Technical requirements
Hello World from CUDA
Thread hierarchy
GPU architecture
Vector addition using CUDA
Experiment 1 – creating multiple blocks
Experiment 2 – creating multiple threads
Experiment 3 – combining blocks and threads
Why bother with threads and blocks?
Launching kernels in multiple dimensions
Error reporting in CUDA
Data type support in CUDA
Summary
CUDA Memory Management
Technical requirements
NVIDIA Visual Profiler
Global memory/device memory
Vector addition on global memory
Coalesced versus uncoalesced global memory access
Memory throughput analysis
Shared memory
Matrix transpose on shared memory
Bank conflicts and its effect on shared memory
Read-only data/cache
Computer vision – image scaling using texture memory
Registers in GPU
Pinned memory
Bandwidth test – pinned versus pageable
Unified memory
Understanding unified memory page allocation and transfer
Optimizing unified memory with warp per page
Optimizing unified memory using data prefetching
GPU memory evolution
Why do GPUs have caches?
Summary
CUDA Thread Programming
Technical requirements
CUDA threads, blocks, and the GPU
Exploiting a CUDA block and warp
Understanding CUDA occupancy
Setting NVCC to report GPU resource usages
The settings for Linux
Settings for Windows
Analyzing the optimal occupancy using the Occupancy Calculator
Occupancy tuning – bounding register usage
Getting the achieved occupancy from the profiler
Understanding parallel reduction
Naive parallel reduction using global memory
Reducing kernels using shared memory
Writing performance measurement code
Performance comparison for the two reductions – global and shared memory
Identifying the application s performance limiter
Finding the performance limiter and optimization
Minimizing the CUDA warp divergence effect
Determining divergence as a performance bottleneck
Interleaved addressing
Sequential addressing
Performance modeling and balancing the limiter
The Roofline model
Maximizing memory bandwidth with grid-strided loops
Balancing the I/O throughput
Warp-level primitive programming
Parallel reduction with warp primitives
Cooperative Groups for flexible thread handling
Cooperative Groups in a CUDA thread block
Benefits of Cooperative Groups
Modularity
Explicit grouped threads operation and race condition avoidance
Dynamic active thread selection
Applying to the parallel reduction
Cooperative Groups to avoid deadlock
Loop unrolling in the CUDA kernel
Atomic operations
Low/mixed precision operations
Half-precision operation
Dot product operations and accumulation for 8-bit integers and 16-bit data (DP4A and DP2A)
Measuring the performance
Summary
Kernel Execution Model and Optimization Strategies
Technical requirements
Kernel execution with CUDA streams
The usage of CUDA streams
Stream-level synchronization
Working with the default stream
Pipelining the GPU execution
Concept of GPU pipelining
Building a pipelining execution
The CUDA callback function
CUDA streams with priority
Priorities in CUDA
Stream execution with priorities
Kernel execution time estimation using CUDA events
Using CUDA events
Multiple stream estimation
CUDA dynamic parallelism
Understanding dynamic parallelism
Usage of dynamic parallelism
Recursion
Grid-level cooperative groups
Understanding grid-level cooperative groups
Usage of grid_group
CUDA kernel calls with OpenMP
OpenMP and CUDA calls
CUDA kernel calls with OpenMP
Multi-Process Service
Introduction to Message Passing Interface
Implementing an MPI-enabled application
Enabling MPS
Profiling an MPI application and understanding MPS operation
Kernel execution overhead comparison
Implementing three types of kernel executions
Comparison of three executions
Summary
CUDA Application Profiling and Debugging
Technical requirements
Profiling focused target ranges in GPU applications
Limiting the profiling target in code
Limiting the profiling target with time or GPU
Profiling with NVTX
Visual profiling against the remote machine
Debugging a CUDA application with CUDA error
Asserting local GPU values using CUDA assert
Debugging a CUDA application with Nsight Visual Studio Edition
Debugging a CUDA application with Nsight Eclipse Edition
Debugging a CUDA application with CUDA-GDB
Breakpoints of CUDA-GDB
Inspecting variables with CUDA-GDB
Listing kernel functions
Variables investigation
Runtime validation with CUDA-memcheck
Detecting memory out of bounds
Detecting other memory errors
Profiling GPU applications with Nsight Systems
Profiling a kernel with Nsight Compute
Profiling with the CLI
Profiling with the GUI
Performance analysis report
Baseline compare
Source view
Summary
Scalable Multi-GPU Programming
Technical requirements
Solving a linear equation using Gaussian elimination
Single GPU hotspot analysis of Gaussian elimination
GPUDirect peer to peer
Single node – multi-GPU Gaussian elimination
Brief introduction to MPI
GPUDirect RDMA
CUDA-aware MPI
Multinode – multi-GPU Gaussian elimination
CUDA streams
Application 1 – using multiple streams to overlap data transfers with kernel execution
Application 2 – using multiple streams to run kernels on multiple devices
Additional tricks
Benchmarking an existing system with an InfiniBand network card
NVIDIA Collective Communication Library (NCCL)
Collective communication acceleration using NCCL
Summary
Parallel Programming Patterns in CUDA
Technical requirements
Matrix multiplication optimization
Implementation of the tiling approach
Performance analysis of the tiling approach
Convolution
Convolution operation in CUDA
Optimization strategy
Filtering coefficients optimization using constant memory
Tiling input data using shared memory
Getting more performance
Prefix sum (scan)
Blelloch scan implementation
Building a global size scan
The pursuit of better performance
Other applications for the parallel prefix-sum operation
Compact and split
Implementing compact
Implementing split
N-body
Implementing an N-body simulation on GPU
Overview of an N-body simulation implementation
Histogram calculation
Compile and execution steps
Understanding a parallel histogram
Calculating a histogram with CUDA atomic functions
Quicksort in CUDA using dynamic parallelism
Quicksort and CUDA dynamic parallelism
Quicksort with CUDA
Dynamic parallelism guidelines and constraints
Radix sort
Two approaches
Approach 1 – warp-level primitives
Approach 2 – Thrust-based radix sort
Summary
Programming with Libraries and Other Languages
Linear algebra operation using cuBLAS
cuBLAS SGEMM operation
Multi-GPU operation
Mixed-precision operation using cuBLAS
GEMM with mixed precision
GEMM with TensorCore
cuRAND for parallel random number generation
cuRAND host API
cuRAND device API
cuRAND with mixed precision cuBLAS GEMM
cuFFT for Fast Fourier Transformation in GPU
Basic usage of cuFFT
cuFFT with mixed precision
cuFFT for multi-GPU
NPP for image and signal processing with GPU
Image processing with NPP
Signal processing with NPP
Applications of NPP
Writing GPU accelerated code in OpenCV
CUDA-enabled OpenCV installation
Implementing a CUDA-enabled blur filter
Enabling multi-stream processing
Writing Python code that works with CUDA
Numba – a high-performance Python compiler
Installing Numba
Using Numba with the @vectorize decorator
Using Numba with the @cuda.jit decorator
CuPy – GPU accelerated Python matrix library
Installing CuPy
Basic usage of CuPy
Implementing custom kernel functions
PyCUDA – Pythonic access to CUDA API
Installing PyCUDA
Matrix multiplication using PyCUDA
NVBLAS for zero coding acceleration in Octave and R
Configuration
Accelerating Octave s computation
Accelerating R s compuation
CUDA acceleration in MATLAB
Summary
GPU Programming Using OpenACC
Technical requirements
Image merging on a GPU using OpenACC
OpenACC directives
Parallel and loop directives
Data directive
Applying the parallel, loop, and data directive to merge image code
Asynchronous programming in OpenACC
Structured data directive
Unstructured data directive
Asynchronous programming in OpenACC
Applying the unstructured data and async directives to merge image code
Additional important directives and clauses
Gang/vector/worker
Managed memory
Kernel directive
Collapse clause
Tile clause
CUDA interoperability
DevicePtr clause
Routine directive
Summary
Deep Learning Acceleration with CUDA
Technical requirements
Fully connected layer acceleration with cuBLAS
Neural network operations
Design of a neural network layer
Tensor and parameter containers
Implementing a fully connected layer
Implementing forward propagation
Implementing backward propagation
Layer termination
Activation layer with cuDNN
Layer configuration and initialization
Implementing layer operation
Implementing forward propagation
Implementing backward propagation
Softmax and loss functions in cuDNN/CUDA
Implementing the softmax layer
Implementing forward propagation
Implementing backward propagation
Implementing the loss function
MNIST dataloader
Managing and creating a model
Network training with the MNIST dataset
Convolutional neural networks with cuDNN
The convolution layer
Implementing forward propagation
Implementing backward propagation
Pooling layer with cuDNN
Implementing forward propagation
Implementing backward propagation
Network configuration
Mixed precision operations
Recurrent neural network optimization
Using the CUDNN LSTM operation
Implementing a virtual LSTM operation
Comparing the performance between CUDNN and SGEMM LSTM
Profiling deep learning frameworks
Profiling the PyTorch model
Profiling a TensorFlow model
Summary
Appendix
Useful nvidia-smi commands
Getting the GPU s information
Getting formatted information
Power management mode settings
Setting the GPU s clock speed
GPU device monitoring
Monitoring GPU utilization along with multiple processes
Getting GPU topology information
WDDM/TCC mode in Windows
Setting TCC/WDDM mode
Performance modeling
The Roofline model
Analyzing the Jacobi method
Exploring container-based development
NGC configuration for a host machine
Basic usage of the NGC container
Creating and saving a new container from the NGC container
Setting the default runtime as NVIDIA Docker
Another Book You May Enjoy
Leave a review - let other readers know what you think