GPU Tutorial 1: Introduction To GPU Computing

3y ago

43 Views

3 Downloads

1.26 MB

6 Pages

Last View : 14d ago

Last Download : 3m ago

Upload by : Julia Hutchens

Report this link

Download PDF

Transcription

GPU Tutorial 1:Introduction to GPU ComputingSummaryThis tutorial introduces the concept of GPU computation. CUDA is employed as a framework forthis, but the principles map to any vendor’s hardware. We provide an overview of GPU computation,its origins and development, before presenting both the CUDA hardware and software APIs.New ConceptsGPU Computation, CUDA Hardware, CUDA SoftwareIntroductionIn this pair of tutorials, we shall discuss in some depth the nature of GPU computation. This is not tobe confused with rendering, which you have covered in the graphics module, but rather the exploitationof the GPUs vast floating-point throughput as a means of speeding up certain elements of our software.This is an area of growing interest in video game development. Two of the three current generationconsoles selected AMD SoC solutions with unified memory architectures, allowing the GPU and CPUto readily communicate and update data, in order to leverage the computational power of the GPUportion of the chip. Indeed, the fact that both of these hardware solutions featured octa-core set-upsbacked up with multi-compute-unit graphical solutions strongly suggests that multi- and many-corecomputation will be a significant area of games-related research for years to come.Our first tutorial shall discuss the origins of GPU computation, before introducing GPU architecture, focusing upon the specific hardware you shall be programming. Once we have introducedthe nature of the hardware model, we shall discuss its strengths, limitations, and the philosophieswhich underpin the deployment of a problem to the GPU. Lastly, in this session, we shall discuss thevariable types specific to CUDA programming, and how they map to the hardware, before moving onto implement some simple CUDA functions to test our understanding.1

GPU Computation OverviewThe concept of number-crunching on the GPU is, almost, as old as the GPU itself. Early solutionsrevolved around the idea of manipulating pixel data through shader language, as a means of performing simple floating-point calculations in a dummy graphics shell. Essentially, where in rendering weperform per-pixel operations in the context of colour space, early GPU computation used those colourcomponents to conceal the numerical data which needed processing. In some cases, just to make thisfunction the researchers then had to force the GPU to render something to get the results out theother end (generally two triangles).Around 2004, researchers began taking this idea very seriously. A lot of problems, particularly simulation problems, have significant amounts of physical data to consider; in computing terms, physicalpoints are often handled as three element vectors. It is not difficult to see how this mapped conveniently to the colour variables in rendering. Similarly, many of the problems research focused uponwere relatively straightforward mathematics and, where scale was a problem rather than complexity,it was believed that the GPU offered a cost-effective improvement to performance.Windows Vista changed the playing field with DirectX 10s unified shader model. Prior to this,shader cores had very specific tasks and were largely incapable of performing any other task (differentinstruction sets for different shader types). With this move towards unified shaders came an industrysea-change in favour of more generally capable shaders all-round if the instruction sets needed to begeneralised in terms of vertex, pixel and geometry shader need, why not generalise them as far aspossible beyond that?With the advent of CUDA, and later FireStream, researchers gained access to easily programmableAPIs (relative to performing GPU computation using shader language) and ever-more-capable hardware. The issue then became one of identifying problems that the GPU could solve well, and deployingthose solutions; similarly, avoiding deploying problems to the GPU which did not lend themselves toits strengths.Now, there are several well-established APIs for GPU computation. We list a sample of thesebelow, and categorise their more important features:NameCUDAOpenCLDirectComputeC AMPGLSL Compute ShadersTable 1: GPU Computation APIsEase of Programming Cross-Platform?HighNo (Hardware)MediumYesMediumNo (Software)HighestNo (Software)*LowestYesPerformance (Guide)HighMedium-HighLowLowestHighest* C AMP has received ongoing investigation from Intel (See: Shevlin Park) which suggested itcould be made far quicker than current benchmarks suggest (and OS-agnostic) with compiler optimisations that redirect to OpenCL/GLSL from DirectCompute. If that work is ever made public, C AMP might have claim to be both the most accessible cross-platform API available, but three yearson it seems unlikely.In these tutorials we focus on CUDA as it is the most straightforward API through which toimplement GPU computation without completely abstracting the GPU hardware (C AMP iseasier to write in, but does not require us to think about the machine were deploying our code on,OpenCL is less accessible to the novice GPU programmer, though you’re invited to explore the APIon your own time). The principles discussed in this lecture series, however, map to all contemporaryGPU computation APIs, as the issues faced in deploying code to the GPU do not change with vendor.2

CUDAHardwareIn this tutorial we outline the Kepler CUDA hardware architecture, which maps to the GTX 780Tigraphics processors present in most of the MSc machines. Some of you are using GTX 970 cards,which have a Maxwell-architecture chip in them - the principles do not change in the context of the tutorial, only cache ratios, and so on. Figure 1 illustrates an abstract overview of the Kepler architecture.Figure 1: The Kepler ArchitectureYou can see from Figure 1 (credit: NVIDIA, Kepler Whitepaper), that the GPU is subdividedinto several units (referred to in NVIDIA literature as streaming multiprocessors, or SMX). Theseunits share the L2 Cache and, through that, access to the VRAM (analogous to system memory whenprogramming for the GPU). Figure 2 (credit: NVIDIA, Kepler Whitepaper), illustrates the layout ofthe SMX itself.An SMX features 192 single-precision cores and 64 double-precision, along with 32 special functionunits (SFUs units optimised for common mathematical functions). You will note also the memoryarchitecture. 48KB of Read-Only Data Cache, and 64KB of memory labelled ”Shared Memory/L1Cache”.This 64KB is a pool of memory that you can, through the CUDA API, control to favour one or theother (L1 Cache, or Shared Memory) 16KB L1 and 48KB Shared; 16KB Shared and 48KB L1; or,32KB of each. Shared Memory is a store for variables that can be accessed and updated by any core inthe SMX, at any time. The L1 Cache pool is a shared cache pool which is used by every core in an SMX.The instruction cache for a single SMX is used by all cores in that SMX (meaning that all coreswill execute the same set of instructions). The Warp Scheduler handles the initiation of cores toexecute their ’instance’ of the instruction (the kernel instance, discussed later). If instruction setsbranch significantly (if-then conditions which make their completion time varied), the warp schedulerwill not be able to leverage maximum efficiency from the cores in the SMX.3

Figure 2: The Kepler ArchitectureIt should be obvious at this point that the architecture of the GPU is a very different beast to thatof the CPU. The CPUs in your desktop have as much L1 cache per core as is allocated by default toall 192 single precision cores in the SMX combined. They also enjoy more versatile instruction cache,optimised for resolving cache misses more rapidly (not something the GPU can claim, regrettably).This makes sense, however, when we consider exactly what the GPU is intended to do: it executesshaders, which are themselves very simple functions (in terms of instructions if not theory), acrossall cores simultaneously. Its memory architecture is optimised towards that purpose. And if we aregoing to leverage this hardware to perform computationally intensive tasks for us, we need to keepthat firmly in mind.SoftwareCUDA is NVIDIAs hardware and software architecture; when we refer to CUDA in these tutorials,we are normally referring to the software API. In that context, CUDA is a C-styled language thatpermits the deployment of programs on the GPU. CUDAs syntax is relatively straightforward (anddocumented in the CUDA API).You can integrate your CUDA functions with your existing C projects through the use of external functions (the extern compiler instruction). This enables you to add CUDA functionality toyour codebase, rather than rewriting your codebase into a VS2012 CUDA project.The CUDA programming model is built on the idea of a grid execution; within the grid are anumber of blocks; within a block, are a number of threads. A thread is a single instance of a kernel.It accepts a set of variables, and performs a set of instructions using those variables. A thread hasa block ID within its thread block and grid; this is used to determine the threads unique ID, whichnormally maps to the data element it is accessing. IE, threadID 103 accesses the 103rd element of the4

arrays that have been sent to the GPU.A block is a set of concurrently executing threads. These threads cooperate with each otherthrough barrier synchronisation and shared memory. A block as a block ID within its grid. A grid isan array of thread blocks that execute the same kernel. The grid reads in inputs from global memory,writes results back out to global memory, and synchronises between multiple, dependent kernel calls.You can consider initiating a kernel function as generating a grid, whose size is determined bythe number of elements you have instructed the GPU to process. A constant in CUDA is stored inconstant memory accessible by all threads. Arrays cannot be stored in constant memory. Sharedmemory is accessible to all threads in a block; arrays can be stored there. Similarly, read-only memoryis accessible to all threads in a block.Figure 3: Memory Hierarchy - Grids, Blocks, ThreadsFigure 3 (credit: Nvidia) summarises this graphically. It also helps illustrate the hierarchy ofthreads, blocks and grids. In this figure, multiple grids will be executed; communication betweenthem, as indicated, can only occur via global memory.Program FlowA CUDA program requires the declaration of memory on the GPU (Video Memory). The size of thismemory chunk is determined by the function you intend to execute, and is declared through cudaMalloc at the beginning of your program loop. As in C programming, that memory must be freed (usingcudaFree) when your program loop ends.When you call an externalised CUDA function, you will pass in array references to the variablesyou wish to be processed by the kernel. This data will be copied to the GPU memory using cudaMemcpy (of kind cudaMemcpyHostToDevice), before the kernel is executed. The kernel will executeon this data. On completion of the kernels execution, you will copy (cudaMemcpyDeviceToHost) theresults back to system memory.This emphasises the role of the GPU as a batch-based number-cruncher. You send it a chunkof data from system memory, perform a parallelisable operation on that data, and it kicks updatedinformation back to system memory (or feeds it forward into some other, GPU-related process, suchas rendering).5

ParadigmsWhen we consider a problem for deployment to the GPU, there are four factors we need to keep inmind: Memory footprint per instruction set execution. Our GPU has limited cache resources sharedbetween a large number of cores. It is far more vulnerable to cache misses than any CPUarchitecture. If our problem has a large memory footprint per execution (such as heuristic pathplanning), we might need to restructure it to best fit the GPU programming model, or notdeploy it to the GPU. Of course, a large memory footprint overall poses no issue - so long asthe memory footprint per execution is small. Parallelisation. The GPU excels at solving embarrassingly parallel problems (problems with nocommunication between threads, and no perfect execution order). If we add communicationbetween threads, our program will slow down. We should be mindful of this, when selectingalgorithms for deployment on the GPU. Host-Device Communication. The GPU can only act on variables we pass to the GPU. Ifvariables are stored in system memory, they cannot form part of the kernels instructions. Instead,they must be duplicated to the GPU. Overhead. Every CUDA call requires memcopy operations these are costly, and can slow downour program significantly. Also, if we use our GPU for rendering, as well as computation, weshould try to avoid deploying draw instructions at the same time as we execute CUDA kernels.This triggers context switching, which can be a costly process in terms of frame-rate as eachcontext switch can cost around 10 microseconds.In the context of game engineering, this fourth issue is of key importance - because, in a game,our GPU is meant to be rendering an attractive scene. If we’re shunting work to it that distractsfrom that task, it must be for some meaningful reason - not simply because we want to use GPUcomputation. Normally the sorts of problems you would outsource to the GPU are those where aquality improvement overall makes the loss of GPU cycles acceptable - or a situation where a CPUsolution creates such a bottleneck that outsourcing the task to the GPU actually increases frame-rate.ImplementationExplore the sample software in the CUDA SDK, to understand the demarcation between tasks performed by the Host, tasks instigated by the Host but performed on the Device, and tasks instigatedby the Device and performed by the Device.6

Related Documents:

OpenCV on a GPU

OpenCV GPU header file Upload image from CPU to GPU memory Allocate a temp output image on the GPU Process images on the GPU Process images on the GPU Download image from GPU to CPU mem OpenCV CUDA example #include opencv2/opencv.hpp #include <

155 Views

2y ago

Take GPU processing power beyond graphics with GPU ...

limitation, GPU implementers made the pixel processor in the GPU programmable (via small programs called shaders). Over time, to handle increasing shader complexity, the GPU processing elements were redesigned to support more generalized mathematical, logic and flow control operations. Enabling GPU Computing: Introduction to OpenCL

66 Views

3y ago

GPU Ray Tracing - GPU Technology Conference 2012

Possibly: OptiX speeds both ray tracing and GPU devel. Not Always: Out-of-Core Support with OptiX 2.5 GPU Ray Tracing Myths 1. The only technique possible on the GPU is “path tracing” 2. You can only use (expensive) Professional GPUs 3. A GPU farm is more expensive than a CPU farm 4. A

40 Views

2y ago

GPU Computing Advances in 3D Electromagnetic Simulation

Latest developments in GPU acceleration for 3D Full Wave Electromagnetic simulation. Current and future GPU developments at CST; detailed simulation results. Keywords: gpu acceleration; 3d full wave electromagnetic simulation, cst studio suite, mpi-gpu, gpu technology confere

32 Views

2y ago

Load Balanced Parallel GPU Out-of-Core for Continuous LOD Model ...

transplant a parallel approach from a single-GPU to a multi-GPU system. One major reason is the lacks of both program-ming models and well-established inter-GPU communication for a multi-GPU system. Although major GPU suppliers, such as NVIDIA and AMD, support multi-GPUs by establishing Scalable Link Interface (SLI) and Crossﬁre, respectively .

14 Views

1y ago

NVIDIA Multi-Instance GPU and NVIDIA Virtual Compute Server

NVIDIA vCS Virtual GPU Types NVIDIA vGPU software uses temporal partitioning and has full IOMMU protection for the virtual machines that are configured with vGPUs. Virtual GPU provides access to shared resources and the execution engines of the GPU: Graphics/Compute , Copy Engines. A GPU hardware scheduler is used when VMs share GPU resources.

19 Views

1y ago

Introduction to GPU Computing

Introduction to GPU Computing . CPU GPU Add GPUs: Accelerate Science Applications . Small Changes, Big Speed-up Application Code GPU Use GPU to Parallelize CPU Compute-Intensive Functions Rest of Sequential CPU Code . 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming

35 Views

3y ago

ACADEMIC WRITING: KEY FEATURES

Academic writing is explicit in its signposting of the organisation of the ideas in the text: ever built in Britain. However, even by the end Partly this was because the current control of the land. Similarly, Marx was interested his own family. In addition, he has a between orders and bishops. For example, in the Northern context. Explicitness Academic writing is explicit .

52 Views

3y ago

Recent Views

Chapter 15 Rooming Houses - MassLegalHelp

Individual renters usually have their own separate room and their own agreement with the landlord. For example, you may stay for just a few days, but another renter may stay for 3 months. Rooming houses with 4 or more renters at the same time must be licensed. Some cities and towns have local protections for renters in rooming houses. Rooming House

2y ago

356 Views

Americans rent, buy, sell and think about home.

median rent among Generation X is 1,062 per month. The youngest renters, Generation Z, are typically paying the least at 882 per month.9 This echoes the notion that Generation Z renters are opting to rent the smallest apartments or homes, which translates to lower monthly rental payments. Approximately half of renters (47 percent) are paying for

1y ago

174 Views

Disaster assistance process overview

A guide through the post-disaster recovery process. KEY ASSISTANCE SOURCES TIPS HOMEOWNERS/RENTERS INSURANCE If you have homeowners or renters insurance, this provides you funds to repair or replace property damaged as a result of covered perils during a disaster. Additional types of insurance, such as auto or other peril-specific

1y ago

109 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

Texas Demographic Trends and Projections and the 2020 Census

Income disparities place African Americans and Latinos at greater risk during times of income loss. Renters, renters w/low incomes, Blacks, and households w/children face greater risk of eviction. Persistently low health insurance coverage in the state increases vulnerability of Texans with employer based insurance.

1y ago

137 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Texas - milestonepnc

State Auto - Homeowners TEXAS 05/2017 State Auto Insurance Company UG-1.0 I - UNDERWRITING GUIDELINES A. Entire State Eligibility Guidelines Premier Protection Plus Standard Available Forms HO0004 - Renters HO0005 - Homeowner Expanded HO0006 - Condominium HO0003 - Homeowner HO0004 - Renters HO0005 - Homeowner Expanded

1y ago

112 Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

Industry Observations Insurance Industry

Jun 30, 2019 · 6/17/2019 Commercial Insurance Branch of Extraco Banks, N.A. Higginbotham Insurance Group, Inc. Insurance Brokers NA 6/13/2019 Links Insurance Services, LLC World Insurance Associates LLC Property and Casualty Insurance NA 6/13/2019 Abram Interstate Insurance Services, Inc. Risk Placement Services,

2y ago

619 Views

Life Insurance Buyer's Guide Life Insurance - National Association of .

Life Insurance uers uide Naional ssociaion of Insurance Commissioners Compare the Different Types of Insurance Policies There are many types of life insurance pol-icies. You should choose a policy with fea-tures that fit your individual needs. Some things to consider are: Term Insurance vs. Cash Value In-surance. Term insurance is intended to

1y ago

520 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

18.01.41 - REPLACEMENT OF LIFE INSURANCE AND ANNUITIES - Idaho

Department of Insurance Replacement of Life Insurance and Annuities. Page 3. 04. Existing Life Insurance or Annuity. "Existing Life Insurance or Annuity" means any life insurance or annuity in force, including life insurance under a binding or conditional receipt or a lif e insurance policy or annuity that is within an unconditional refund period.

1y ago

407 Views

EXAMINATION REPORT OF THE ADMIRAL INSURANCE COMPANY AS OF . - Delaware

Berkley Regional Specialty Insurance Comp 31295 DE Carolina Casualty Insurance Company 10510 IA Clermont Insurance Company 33480 IA Continental Western Insurance Company 10804 IA Firemen's Insurance Com pany of Wash, D.C. 21784 DE Gemini Insurance Company 10833 DE Great Divide Insurance Company 25224 ND

1y ago

258 Views

American International Group, Inc. - Federal Reserve

American General Life Insurance Company AGL U.S. Life Insurance Company AGC Life Insurance Company AGC Life U.S. Life Insurance Company The United States Life Insurance Company in the City of New York U.S. Life U.S. Life Insurance Company The Variable Annuity Life Insurance Company VALIC U.S. Life Insurance Company

1y ago

269 Views

Japan's Insurance Market - Toa Re

with 61.6% of net premiums written, of which automobile insurance totaled 48.8% and compulsory automobile liability insurance totaled 12.8%. Fire insurance accounted for 13.7%, miscellaneous casualty insurance including liability insurance accounted for 11.6%, accident insurance accounted for 9.8%, and marine insurance accounted for 3.2%.

1y ago

179 Views

GPU Tutorial 1: Introduction To GPU Computing

It looks like you're using an ad-blocker