CUDA C Best Practices Guide - Nvidia

3y ago
19 Views
2 Downloads
2.34 MB
93 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Gia Hauser
Transcription

CUDA C Best Practices GuideDesign GuideDG-05603-001 v11.1 September 2020

Table of ContentsPreface.viiiWhat Is This Document?. viiiWho Should Read This Guide?.viiiAssess, Parallelize, Optimize, Deploy. ixAssess.ixParallelize.xOptimize.xDeploy. xRecommendations and Best Practices.xiChapter 1. Assessing Your Application.1Chapter 2. Heterogeneous Computing. 22.1. Differences between Host and Device.22.2. What Runs on a CUDA-Enabled Device?.3Chapter 3. Application Profiling. 53.1. Profile.53.1.1. Creating the Profile.53.1.2. Identifying Hotspots.63.1.3. Understanding Scaling. 63.1.3.1. Strong Scaling and Amdahl's Law.63.1.3.2. Weak Scaling and Gustafson's Law.73.1.3.3. Applying Strong and Weak Scaling.7Chapter 4. Parallelizing Your Application.8Chapter 5. Getting Started. 95.1. Parallel Libraries. 95.2. Parallelizing Compilers. 95.3. Coding to Expose Parallelism. 10Chapter 6. Getting the Right Answer. 116.1. Verification. 116.1.1. Reference Comparison.116.1.2. Unit Testing.116.2. Debugging.126.3. Numerical Accuracy and Precision.126.3.1. Single vs. Double Precision. 126.3.2. Floating Point Math Is not Associative.13CUDA C Best Practices GuideDG-05603-001 v11.1 ii

6.3.3. IEEE 754 Compliance. 136.3.4. x86 80-bit Computations. 13Chapter 7. Optimizing CUDA Applications. 14Chapter 8. Performance Metrics. 158.1. Timing. 158.1.1. Using CPU Timers.158.1.2. Using CUDA GPU Timers.168.2. Bandwidth. 168.2.1. Theoretical Bandwidth Calculation. 178.2.2. Effective Bandwidth Calculation. 178.2.3. Throughput Reported by Visual Profiler. 18Chapter 9. Memory Optimizations.199.1. Data Transfer Between Host and Device. 199.1.1. Pinned Memory.209.1.2. Asynchronous and Overlapping Transfers with Computation.209.1.3. Zero Copy. 239.1.4. Unified Virtual Addressing. 249.2. Device Memory Spaces.249.2.1. Coalesced Access to Global Memory. 269.2.1.1. A Simple Access Pattern.269.2.1.2. A Sequential but Misaligned Access Pattern. 279.2.1.3. Effects of Misaligned Accesses. 279.2.1.4. Strided Accesses.289.2.2. L2 Cache. 309.2.2.1. L2 Cache Access Window. 309.2.2.2. Tuning the Access Window Hit-Ratio. 319.2.3. Shared Memory. 349.2.3.1. Shared Memory and Memory Banks. 349.2.3.2. Shared Memory in Matrix Multiplication (C AB).359.2.3.3. Shared Memory in Matrix Multiplication (C AAT). 389.2.3.4. Asynchronous Copy from Global Memory to Shared Memory.409.2.4. Local Memory. 439.2.5. Texture Memory.439.2.5.1. Additional Texture Capabilities. 439.2.6. Constant Memory. 449.2.7. Registers. 449.2.7.1. Register Pressure.449.3. Allocation. 45CUDA C Best Practices GuideDG-05603-001 v11.1 iii

9.4. NUMA Best Practices. 45Chapter 10. Execution Configuration Optimizations.4610.1. Occupancy.4610.1.1. Calculating Occupancy. 4610.2. Hiding Register Dependencies. 4810.3. Thread and Block Heuristics.4910.4. Effects of Shared Memory.5010.5. Concurrent Kernel Execution. 5010.6. Multiple contexts. 51Chapter 11. Instruction Optimization. 5211.1. Arithmetic Instructions. 5211.1.1. Division Modulo Operations. 5211.1.2. Loop Counters Signed vs. Unsigned. 5211.1.3. Reciprocal Square Root. 5311.1.4. Other Arithmetic Instructions. 5311.1.5. Exponentiation With Small Fractional Arguments.5311.1.6. Math Libraries. 5411.1.7. Precision-related Compiler Flags. 5611.2. Memory Instructions. 56Chapter 12. Control Flow. 5712.1. Branching and Divergence. 5712.2. Branch Predication. 57Chapter 13. Deploying CUDA Applications. 59Chapter 14. Understanding the Programming Environment. 6014.1. CUDA Compute Capability. 6014.2. Additional Hardware Data. 6114.3. Which Compute Capability Target.6114.4. CUDA Runtime. 62Chapter 15. CUDA Compatibility and Upgrades. 6315.1. CUDA Runtime and Driver API Version. 6315.2. Standard Upgrade Path. 6415.3. Flexible Upgrade Path. 6515.4. CUDA Compatibility Platform Package.6615.5. Extended nvidia-smi.67Chapter 16. Preparing for Deployment.6816.1. Testing for CUDA Availability.6816.2. Error Handling.69CUDA C Best Practices GuideDG-05603-001 v11.1 iv

16.3. Building for Maximum Compatibility. 6916.4. Distributing the CUDA Runtime and Libraries. 7016.4.1. CUDA Toolkit Library Redistribution. 7116.4.1.1. Which Files to Redistribute. 7216.4.1.2. Where to Install Redistributed CUDA Libraries. 73Chapter 17. Deployment Infrastructure Tools.7517.1. Nvidia-SMI.7517.1.1. Queryable state.7517.1.2. Modifiable state. 7617.2. NVML.7617.3. Cluster Management Tools. 7617.4. Compiler JIT Cache Management Tools.7717.5. CUDA VISIBLE DEVICES. 77Appendix A. Recommendations and Best Practices.78A.1. Overall Performance Optimization Strategies. 78Appendix B. nvcc Compiler Switches.80B.1. nvcc.80CUDA C Best Practices GuideDG-05603-001 v11.1 v

List of FiguresFigure 1. Timeline comparison for copy and kernel execution .22Figure 2. Memory spaces on a CUDA device .25Figure 3. Coalesced access . 27Figure 4. Misaligned sequential addresses that fall within five 32-byte segments . 27Figure 5. Performance of offsetCopy kernel .28Figure 6. Adjacent threads accessing memory with a stride of 2 . 29Figure 7. Performance of strideCopy kernel . 30Figure 8. Mapping Persistent data accesses to set-aside L2 in sliding window experiment.32Figure 9. The performance of the sliding-window benchmark with fixed hit-ratio of 1.0 .33Figure 10. The performance of the sliding-window benchmark with tuned hit-ratio . 34Figure 11. Block-column matrix multiplied by block-row matrix .35Figure 12. Computing a row of a tile . 36Figure 13. Comparing Synchronous vs Asynchronous Copy from Global Memory to SharedMemory. 42Figure 14. Comparing Performance of Synchronous vs Asynchronous Copy from GlobalMemory to Shared Memory.42Figure 15. Using the CUDA Occupancy Calculator to project GPU multiprocessoroccupancy.48Figure 16. Sample CUDA configuration data reported by deviceQuery . 61Figure 17. Compatibility of CUDA Versions .64Figure 18. Standard Upgrade Path .65Figure 19. Flexible Upgrade Path . 66Figure 20. CUDA Compatibility Platform Package . 67CUDA C Best Practices GuideDG-05603-001 v11.1 vi

List of TablesTable 1. Salient Features of Device Memory . 25Table 2. Performance Improvements Optimizing C AB Matrix Multiply . 38Table 3. Performance Improvements Optimizing C AAT Matrix Multiplication . 40Table 4. Useful Features for tex1D(), tex2D(), and tex3D() Fetches .43Table 5. Formulae for exponentiation by small fractions . 54CUDA C Best Practices GuideDG-05603-001 v11.1 vii

PrefaceWhat Is This Document?T

Figure 8. Mapping Persistent data accesses to set-aside L2 in sliding window experiment.32 Figure 9. The performance of the sliding-window benchmark with fixed hit-ratio of 1.0.33 Figure 10. The performance of the sliding-window benchmark with tuned hit-ratio.34 Figure 11.

Related Documents:

CUDA-GDB runs on Linux and Mac OS X, 32-bit and 64-bit. CUDA-GDB is based on GDB 7.6 on both Linux and Mac OS X. 1.2. Supported Features CUDA-GDB is designed to present the user with a seamless debugging environment that allows simultaneous debugging of both GPU and CPU code within the same application.

www.nvidia.com CUDA Debugger DU-05227-042 _v5.5 3 Chapter 2. RELEASE NOTES 5.5 Release Kernel Launch Stack Two new commands, info cuda launch stack and info cuda launch children, are introduced to display the kernel launch stack and the children k

CUDA Toolkit Major Components www.nvidia.com NVIDIA CUDA Toolkit 10.0.153 RN-06722-001 _v10.0 2 ‣ cudadevrt (CUDA Device Runtime) ‣ cudart (CUDA Runtime) ‣ cufft (Fast Fourier Transform [FFT]) ‣ cupti (Profiling Tools Interface) ‣ curand (Random Number Generation) ‣ cusolver (Dense and Sparse Direct Linear Solvers and Eigen Solvers) ‣ cusparse (Sparse Matrix)

NVIDIA CUDA C Getting Started Guide for Microsoft Windows DU-05349-001_v03 1 INTRODUCTION NVIDIA CUDATM is a general purpose parallel computing architecture introduced by NVIDIA. It includes the CUDA Instruction Set Architecture (ISA) and the parallel compute engine in the GPU. To program to the CUDA architecture, developers can use

See the TotalView for HPC Installation guide for more information about setting up the license server. The updated licensing software is included in the distribution. CUDA 8 Support TotalView has been tested against the latest CUDA 8 release candidate and works as expected for CUDA debugging. We will revalidate TotalView's CUDA 8 support

Will Landau (Iowa State University) Introduction to GPU computing for statisticicans September 16, 2013 20 / 32. Introduction to GPU computing for statisticicans Will Landau GPUs, parallelism, and why we care CUDA and our CUDA systems GPU computing with R CUDA and our CUDA systems Logging in

Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C Based on industry-standard C/C Small set of extensions to enable heterogeneous programming Straightforward APIs to manage devices, memory etc. This session introduces CUDA C/C . Introduction to CUDA C/C

CUDA Compiler Driver NVCC TRM-06721-001_v11.8 1 Chapter 1. Introduction 1.1. Overview 1.1.1. CUDA Programming Model The CUDA Toolkit targets a class of applications whose control part runs as a process on a