GPU Architecture Presentation(1)

1y ago
8 Views
2 Downloads
944.35 KB
29 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Brady Himes
Transcription

Evolution of the NVIDIA GPU Architecture Jason Lowden Advanced Computer Architecture November 7, 2012

Agenda Introduction of the NVIDIA GPU Graphics Pipeline GPU Terminology Architecture of a GPU Computing Elements Memory Types Fermi Architecture Kepler Architecture GPUs as a Computational Device CUDA Programming Performance Comparison Relation to SMT, Vector Processors, and DSPs Summary

NVIDIA GPU History First GPU is released in 1999 Used for the purpose of graphics processing GeForce and Quadro CUDA Architecture released in 2006 Designed for use by industry and academia as a computing device Move towards commodity parallel processing Tesla GPU series released in 2007 Fermi Architecture released in 2009 Kepler Architecture released in 2012

Graphics Pipeline

Terminology Thread – The smallest grain of the hierarchy of device computation Block – A group of threads Grid – A group of blocks Warp – A group of 32 threads that are executed simultaneously on the device Kernel ‐ The creator of a grid for GPU execution

Architecture of a GPU Same components as a typical CPU However, More computing elements More types of memory Original GPUs had vertex and pixel shaders Specifically for graphics Modern GPUs are slightly different CUDA – Compute Unified Device Architecture

Computational Elements of a GPU Streaming Processor – Core of the design Place where all of the computation takes place Streaming Multiprocessor Groups of streaming multiprocessors In addition to the SPs, these also contain the Special Function Units and Load/Store Units Instructional Schedulers Complex Control Logic

Streaming Multiprocessor Architecture

Types of GPU Memory Global DRAM Slowest Performance Texture Cached Global Memory “Bound” at runtime Constant Cached Global Memory Shared Local to a block of threads

Architectural Memory Hierarchy

Fermi Architecture

Fermi Improvements Increase the number of SPs per SM Unified Request Path for load/store instructions Implementation of a cache hierarchy L1 cache per SM Configurable with Shared Memory L2 cache is shared globally Register Spilling Occurs when the register requirements of a thread exceed what is available on the device Previous Generation: Spill to DRAM (global memory) Fermi: Use of the L1 cache

Summary

Kepler SM Overview Goal: Improve GPU performance and power efficiency Improved to 3 times performance per watt over Fermi Increased to 192 SPs per SM 32 Special Floating Point units Improved Warp Scheduling 14

Kepler SM Design 15

Warp Scheduler 4 warp schedulers Each scheduler can issue up to 2 independent instructions when it is ready to issue. 16

Kepler Memory Architecture Shared Memory and L1 are still physically shared New configuration: 32K L1, 32K Shared Shared memory bandwidth is doubled compared with Fermi Increased the size of L2 Doubled the size Fermi, increasing it to 1536 KB Introduction of Read‐Only Cache Previously, this was used in Fermi for Texture cache 48 KB of storage 17

Warp Shuffle Instructions In Fermi, data could only be exchanged between threads using shared memory. Resulted in additional synchronization time Kepler allows the shuffle functions, which Exchange data between threads without using shared memory Handles the store‐and‐load operation as a single step Data can only be shared within the same warp In their example, an FFT algorithm saw 6% performance increase when using this instruction. 18

Kepler Hardware Features Dynamic Parallelism Any kernel can launch more kernels from within itself Takes additional load off of the CPU Hyper‐Q 32 hardware managed work queues Fermi had 1 queue Grid Management Unit Needed to manage the number of grids that are executed Introduction of the GMU to handle all of the grids that can be active at one time NVIDIA GPUDirectTM Ability for CUDA enabled GPUs to interact without the need for CPU intervention The GPU can interact directly with the NIC 19

Comparison of Kepler and Fermi 20

Use for Computation Historically, GPUs were used for graphics to offload CPU work Current trend – Combine CPU and GPU on a single core Due to the massively parallel computations of the work, GPUs are ideal for their number of processing cores. However, these are only ideal when there are few data dependencies. Introduction of CUDA and the Tesla GPUs

CUDA Programming Extensions to the C language With some C support Programming Support Windows – Visual Studio Linux/Mac – Eclipse Programming paradigm where each computation take place on a separate thread Requires NVIDIA GPU for acceleration Simulators are used for research purposes

Example – Vector Addition C for( int i 0; i SIZE; i ) { c[ i ] a[ i ] b[ i ]; } CUDA global void addVectors( float* a, float* b, float* c ) { int id threadIdx.x; if( id SIZE ) { c[ id ] a[ id ] b[ id ]; } }

Programming Requirements Explicit Memory Operations to allocate and copy data from the CPU to GPU Some exceptions do apply All kernels execute asynchronously of the CPU Explicit synchronization barriers between the processors

Synchronization and Performance To meet data dependencies, Synchronization Primitives syncthreads() – Synchronizes all threads in a block Atomic Operations – Depending on compute/CUDA version, these are possible on global and shared memory Performance is dictated by memory operations and synchronization cost Memory Coalescence Warp Divergence

Performance Comparison

Relation to Other Architectures SMT Many smaller cores, with less functionality, to compute results Each core has a hardware context for a thread that can be switched out Vector Processors Computation of results in parallel that could be done sequentially by a CPU Ability to access large chunks of data from memory at a given time Banks of shared memory ‐ could lead to bank conflicts Digital Signal Processors As with DSP algorithms, many applications could also use the MAC elements; these are built into the GPU by design

Conclusions GPUs are massively parallel devices that can be used for general purpose computing, in addition to graphics processing As the cost continues to decrease, these devices become off‐the‐shelf components that can be used to build larger system. In addition to compute capabilities, Kepler offers the benefit of additional performance per watt, making a more power efficient design. When used with other technologies, like OpenCL, GPUs can be used in heterogeneous platforms.

References http://www.nvidia.com/page/corporate timeline.html http://www.pcmag.com/encyclopedia term/0,2542,t graphics pipeline&i 43933,00.asp S. L. Alarcon, “CUDA Memories,” unpublished. NVIDIA. (2012 April 16). NVIDIA CUDA C Programming Guide. [Online]. Available: ne/docs/html/C/doc/CUDA C Progr amming Guide.pdf. NVIDIA. (2009). NVIDIA’s Next Generation CUDATM Compute Architecture: Fermi. [Online]. Available: http://www.nvidia.com/content/PDF/fermi white papers/NVIDIA Fermi Compute Archite cture Whitepaper.pdf. NVIDIA. (2012). NVIDIA’s Next Generation CUDATM Compute Architecture: KeplerTM GK110. [Online]. Available: Kepler‐GK110‐ Architecture‐Whitepaper.pdf. NVIDIA. (2012). NVIDIA GeForce GTX 680. [Online]. Available: http://www.geforce.com/Active/en US/en f

Architecture Jason Lowden Advanced Computer Architecture November 7, 2012. Introduction of the NVIDIA GPU Graphics Pipeline GPU Terminology Architecture of a GPU Computing Elements Memory Types Fermi Architecture Kepler Architecture GPUs as a Computational Device .

Related Documents:

OpenCV GPU header file Upload image from CPU to GPU memory Allocate a temp output image on the GPU Process images on the GPU Process images on the GPU Download image from GPU to CPU mem OpenCV CUDA example #include opencv2/opencv.hpp #include <

GPU Tutorial 1: Introduction to GPU Computing Summary This tutorial introduces the concept of GPU computation. CUDA is employed as a framework for this, but the principles map to any vendor’s hardware. We provide an overview of GPU computation, its origins and development, before presenting both the CUDA hardware and software APIs. New Concepts

limitation, GPU implementers made the pixel processor in the GPU programmable (via small programs called shaders). Over time, to handle increasing shader complexity, the GPU processing elements were redesigned to support more generalized mathematical, logic and flow control operations. Enabling GPU Computing: Introduction to OpenCL

Possibly: OptiX speeds both ray tracing and GPU devel. Not Always: Out-of-Core Support with OptiX 2.5 GPU Ray Tracing Myths 1. The only technique possible on the GPU is “path tracing” 2. You can only use (expensive) Professional GPUs 3. A GPU farm is more expensive than a CPU farm 4. A

Latest developments in GPU acceleration for 3D Full Wave Electromagnetic simulation. Current and future GPU developments at CST; detailed simulation results. Keywords: gpu acceleration; 3d full wave electromagnetic simulation, cst studio suite, mpi-gpu, gpu technology confere

transplant a parallel approach from a single-GPU to a multi-GPU system. One major reason is the lacks of both program-ming models and well-established inter-GPU communication for a multi-GPU system. Although major GPU suppliers, such as NVIDIA and AMD, support multi-GPUs by establishing Scalable Link Interface (SLI) and Crossfire, respectively .

NVIDIA vCS Virtual GPU Types NVIDIA vGPU software uses temporal partitioning and has full IOMMU protection for the virtual machines that are configured with vGPUs. Virtual GPU provides access to shared resources and the execution engines of the GPU: Graphics/Compute , Copy Engines. A GPU hardware scheduler is used when VMs share GPU resources.

development of the International Standard and its recent publication, now, is a good opportunity to reflect on the body of information and guidance that is available a wide range of organisations. Whether you are trying to make sense of the variety of views on the revised International Standard, prepare for your transition or to keep up with the latest developments in Environmental Management .