Testing GPU Memory Models

2y ago
11 Views
2 Downloads
1.53 MB
51 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Warren Adams
Transcription

Testing GPU Memory ModelsDaniel PoetzlJoint work withJade Alglave (UCL), Mark Batty (Cambridge), Alastair Donaldson(Imperial), Ganesh Gopalakrishnan (Utah), Tyler Sorensen (Utah),John Wickerson (Imperial)

Outline1. Introduction2. GPU Architectures3. Weak Memory 1014. Testing GPU Memory Models5. Results2/31

Introduction3/31

Graphics Processing Units (GPUs)IGPUs have traditionally beendesigned to accelerate graphicsapplicationsIII3D gamesVideo processingGeneral-purpose computing on GPUs (GPGPU) is becomingincreasingly widespreadIRegular applications:IIIIrregular applications:IIWeather forecastingBrute-force password crackingGraph traversalNumerous papers are published each year that aim toaccelerate traditional algorithms using GPUs4/31

Graphics Processing Units (GPUs)GPUs have found their way into many types of computer systems:IDesktops and LaptopsIGame consolesMobile devices:IIIIiPhone 5SSamsung Galaxy SCars:IAudi self-driving carIIITesla Motors Model SIIVideo processingSafety-critical (!)Infotainment systemSupercomputers5/31

GPUs in SupercomputersIGreen500 list of mostenergy-efficient supercomputersIAll top ten places are occupied bysystems using lkesHA-PACSPiz -ArdenneFor comparison:IITianhe-2 (Guangzhou, 1st in Top500): 1.9 GFLOPS/WStampede (Austin, 7th in Top500): 1.1 GFLOPS/W6/31

GPU ResearchThe number of publications/year on GPU programming hascontinuously grown over the last years (Source: 0720062020050.520Number of PublicationsI7/31

GPU architectures8/31

GPU ArchitecturesNvidia’s Maxwell (2014)Streaming Multiprocessor (SM)Processing BlockProcessing BlockWarp SchedulerPE1PE2.Warp SchedulerPEnPE1PE2.PEnL1 Cachex5Processing BlockProcessing BlockWarp SchedulerPE1PE2.Warp SchedulerPEnPE1PE2.PEnL1 CacheShared MemoryL2 CacheDRAM9/31

Programming ModelCUDAICUDA is Nvidia’s framework for general-purpose computingon GPUsIThreads are hierarchically organized:KernelBlockIBlockWarpWarpWarpWarpT0 . . . T31T32 . . . T63T64 . . . T95T96 . . .T127Different memory spaces: global, shared, local, constant,texture, parameter10/31

Vector AdditionCPU implementationISumming two vectors of size N in C:void add(int *a, int *b, int *c) {for (int i 0; i N; i ) {c[i] a[i] b[i];}}IRuntime: O(N)11/31

Vector AdditionCUDA C implementationISumming two vectors of size N in CUDA C:global void add(int *a, int *b, int *c) {int tid blockIdx.x;if (tid N)c[tid] a[tid] b[tid];}IIf number of processing elements is greater or equal to NIRuntime: O(1)12/31

CPUs vs. GPUsIKey differences between CPUs and GPUs:CoresCore complexityCachesMemory bandwidthContext switchesExplicit concurrency hierarchyDifferent memory ighFastYesYes13/31

Weak Memory 10114/31

Interleaved ExecutionIA simple model of concurrency is Lamport’s sequentialconsistency (SC), i. e. interleaved execution12123// I n i tdata f l a g 0// P r o d u c e rdata 0x7fflag 1123// Consumerwhile (flag 0) {}assert(data ! 0)15/31

Interleaved ExecutionIA simple model of concurrency is Lamport’s sequentialconsistency (SC), i. e. interleaved execution12123I// P r o d u c e rdata 0x7fflag 1123// Consumerwhile (flag 0) {}assert(data ! 0)Example interleaving:12I// I n i tdata f l a g 0data 0x7f1 flag 0 ?flag 11 flag 0 ?2 assert(data ! 0)Assertion is satisfied on all interleavings.15/31

Interleaved ExecutionIA simple model of concurrency is Lamport’s sequentialconsistency (SC), i. e. interleaved execution12123I// I n i tdata f l a g 0// P r o d u c e rdata 0x7fflag 1123// Consumerwhile (flag 0) {}assert(data ! 0)Multi- and manycore processors exhibit weak memoryconsistency:IIIOut-of-order executionSpeculative executionCachingIAssertion can fail on those systems!ISynchronization algorithms (Dekker, Peterson, . . . ) we’vebeen taught in school do not work on multicore systems15/31

Weak Memory ConsistencyCachingCore 0CacheExecutionCore 1Memorydata : 0flag : 0CacheExecution// Initflag data 0// Producerdata 0x7fflag 1// Consumerwhile (flag 0) {}assert(data ! 0)16/31

Weak Memory ConsistencyCachingCore 0CacheExecutiondata : 0x7fCore 1Memorydata : 0flag : 0CacheExecution// Initflag data 0// Producerdata 0x7fflag 1// Consumerwhile (flag 0) {}assert(data ! 0)16/31

Weak Memory ConsistencyCachingCore 0CacheExecutionCore 1Memorydata : 0x7fdata : 0flag : 1flag : 0CacheExecution// Initflag data 0// Producerdata 0x7fflag 1// Consumerwhile (flag 0) {}assert(data ! 0)16/31

Weak Memory ConsistencyCachingCore 0CacheExecutionCore 1Memorydata : 0x7fdata : 0flag : 1flag : 1CacheExecution// Initflag data 0// Producerdata 0x7fflag 1// Consumerwhile (flag 0) {}assert(data ! 0)Cache coherency protocol commits flag before data to main memory.16/31

Weak Memory ConsistencyCachingCore 0CacheExecutionCore 1Memorydata : 0x7fdata : 0flag : 1flag : 1Cacheflag : 1Execution// Initflag data 0// Producerdata 0x7fflag 1// Consumerwhile (flag 0) {}assert(data ! 0)16/31

Weak Memory ConsistencyCachingCore 0MemoryCachedata : 0x7fdata : 0data : 0flag : 1flag : 1flag : 1CacheExecutionCore 1Execution// Initflag data 0// Producerdata 0x7fflag 1// Consumerwhile (flag 0) {}assert(data ! 0) 16/31

Memory BarriersICPUs/GPUs provide memory barrier instructions to enforceordering constraints on memory accesses.IExpensive: 100s of clock cyclesIDifferent types of barriersIFix for the example (on Nvidia GPUs; assuming the producerand consumer are in different blocks):1234// P r o d u c e rdata 0 x7fasm (”membar.gl”)flag 11234// Consumerw h i l e ( f l a g 0 ) {}asm (”membar.gl”)a s s e r t ( d a t a ! 0 )17/31

Axiomatic Memory ModelsI12Executions are not represented as interleavings, but asexecution graphs:data 0x7f1 flag 0 ?flag 11 flag 0 ?2 assert(data ! 0)read flag: 1data 0x7ffrpoflag 1rfporead data: 0IAn execution graph is acyclic if and only if it corresponds toan interleavingIAxiomatic memory models: Give a set of formal rules definingwhich executions are possible on a certain architectureFull details:IIHerding Cats. Alglave et al. TOPLAS ’1418/31

Testing GPU Memory Models19/31

Weak Memory ModelsWhich behaviors can be observed when threads concurrently accessshared memory?IIAs we’ve seen, we cannot expect sequential consistency(interleaved execution) on GPUsBut what exactly can we expect?IIConsult the manual: prose, ambiguous, little detail, sometimesplain wrongWe want a formal memory model!20/31

Weak Memory ModelsWhich behaviors can be observed when threads concurrently accessshared memory?IIAs we’ve seen, we cannot expect sequential consistency(interleaved execution) on GPUsBut what exactly can we expect?IIIConsult the manual: prose, ambiguous, little detail, sometimesplain wrongWe want a formal memory model!Formal memory model based on:IIIVendor documentationTestingDiscussion with industry contacts20/31

Test FrameworkIWe extended the diy and litmus tools to generate and runGPU litmus testsIdiy to generate testsIIIShort assembly code snippets called litmus testsTest generation based on an axiomatic modeling frameworklitmus to run testsIIRuns tests produced by diy many timesAdds additional code to create noise (“incantations”) to makeweak behaviors appearCUDA codediylitmus testslitmusOpenCL code21/31

Test GenerationdiyIExecutions are represented as directed graphsINon-SC executions have cyclesRy1Rx1rfpoWy1IrfpoWx1Which non-SC executions are possible on a certain chip?22/31

Test GenerationdiyIExecutions are represented as directed graphsINon-SC executions have cyclesRy1Rx1rfpoWy1rfpoWx1IWhich non-SC executions are possible on a certain chip?IKey idea of diy:IIEnumerate non-SC executions (i. e. cyclic execution graphs)From each such graph, generate a test such that one of itsexecutions is the execution from which it was generated22/31

Test GenerationExampleIExecution graph:Ry1Rx1rfpoWy1IrfpoWx1Generated litmus test:P0P123/31

Test GenerationExampleIExecution graph:Ry1Rx1rfpoWy1IrfpoWx1Generated litmus test:P0r1 xP123/31

Test GenerationExampleIExecution graph:Ry1Rx1rfpoWy1IrfpoWx1Generated litmus test:P0r1 xy 1P123/31

Test GenerationExampleIExecution graph:Ry1Rx1rfpoWy1IrfpoWx1Generated litmus test:P0r1 xy 1P1r2 y23/31

Test GenerationExampleIExecution graph:Ry1Rx1rfpoWy1IrfpoWx1Generated litmus test:P0r1 xy 1P1r2 yx 123/31

Test GenerationExampleIExecution graph:Ry1Rx1rfpoWy1IrfpoWx1Generated litmus test:P0r1 xy 1P1r2 yx 1r1 123/31

Test GenerationExampleIExecution graph:Ry1Rx1rfpoWy1IrfpoWx1Generated litmus test:P0P1r1 xr2 yy 1x 1r1 1 r2 123/31

Running TestslitmusINow that we can generate tests, we want to run them on thehardware!ITo make the weak behaviors appear, we need “incantations”:IIIIIPut variables on different cache linesNoise maker threads that write random memory locationsRandom launch parametersTrigger bank conflicts.24/31

Running TestsBank ConflictsIMemory is divided into banksBanks are interleaved, not contiguousAccesses to the same bank are serializedINo bank conflict:IIAddressBankThread 00x00Thread 10x11Thread 20x22Thread 30x330x400x510x620x7325/31

Running TestsBank ConflictsIMemory is divided into banksBanks are interleaved, not contiguousAccesses to the same bank are serializedIBank conflict:IIAddressBankThread 00x00Thread 10x11Thread 20x220x330x400x510x620x73Thread 325/31

Running TestsBank ConflictsIAccesses to the same bank are serializedIAccesses can be delayed due to a bank conflictP0P1x 1r3 yy 1r4 xr3 1 r4 0IIf x 1 has a bank conflict, it may be delayed leading to y 1 being executed firstIThe order in which accesses to the same bank are serialized isunspecified26/31

Test Results27/31

Read-Read-CoherenceIConsider the following test, with P0 and P1 in differentblocks, and initially x 0:P0x 1P1r1 xr2 xr1 1 r2 028/31

Read-Read-CoherenceIConsider the following test, with P0 and P1 in differentblocks, and initially x 0:P0x 1P1r1 xr2 xr1 1 r2 0IRunning this test 100,000 times with litmus on the GeForceGTX 660 yields the following histogram:T e s t CoRR A l l o w e dHistogram (4 s t a t e s )59875 : 1: r 0 0; 1 : r 2 0;828 1: r 0 1; 1 : r 2 0;2422 : 1: r 0 0; 1 : r 2 1;36875 : 1: r 0 1; 1 : r 2 1;28/31

Read-Read-CoherenceIConsider the following test, with P0 and P1 in differentblocks, and initially x 0:P0x 1P1r1 xr2 xr1 1 r2 0IBehavior is considered a bug:IIIDoes not guarantee what is typically required by programminglanguage standards (OpenCL, C 11)OpenCL and C 11 require that there is a total order on allwrites to a memory location (coherence order)No thread shall read values that contradict this orderIBug occured in all Nvidia chips of the Fermi and Keplergenerations we testedIFixed in the new Maxwell architecture28/31

Read-Read-CoherenceIConsider the following test, with P0 and P1 in differentblocks, and initially x 0:P0x 1P1r1 xr2 xr1 1 r2 0IGPUs are fairly deterministic (compared to CPUs)IBy fixing the random test parameters, we can make the bugdeterministically show up (on Fermi and Kepler GPUs):T e s t CoRR A l l o w e dHistogram (1 s t a t e )100000 1: r 0 1; 1 : r 2 0;28/31

Compare-and-swapMutex idiomIConsider the following test, with P0 and P1 in differentblocks, and initially x 0 and mutex 1:P0x 1membar.glmutex 0P1b cas(&mutex, 0, 1)r xb true r 0IP0 : Write data to x, then unlock the mutexIP1 : Attempt to lock the mutex; read x if successfulCan P1 read a stale value from x when the CAS succeeds?I29/31

Compare-and-swapMutex idiomIConsider the following test, with P0 and P1 in differentblocks, and initially x 0 and mutex 1:P0x 1membar.glmutex 0P1b cas(&mutex, 0, 1)r xb true r 0IP0 : Write data to x, then unlock the mutexIP1 : Attempt to lock the mutex; read x if successfulCan P1 read a stale value from x when the CAS succeeds?IIIIYes! (on Fermi and Kepler)CAS does not imply a memory fence on Nvidia GPUsSeveral papers assume this and are thus wrong (among themthe textbook CUDA by Example)29/31

SummaryIdiy to generate GPU litmus testsIlitmus to run GPU litmus testsITesting the hardware is a necessary first step towards buildinga formal memory model:IDocumentation is insufficient:IIIambiguouslittle detailsometimes wrongISide effect: We find bugs in the hardwareITest results serve as a basis for communication with industrycontacts30/31

Thank you!31/31

Apr 17, 2014 · 21/31 Test Framework I We extended the diy and litmus tools to generate and run GPU litmus tests I diy to generate tests I Short assembly code snippets calledlitmus tests I Test generation based on an axiomatic modeling framework I litmus to run tests I Runs tests produced by diy many times I Adds additi

Related Documents:

OpenCV GPU header file Upload image from CPU to GPU memory Allocate a temp output image on the GPU Process images on the GPU Process images on the GPU Download image from GPU to CPU mem OpenCV CUDA example #include opencv2/opencv.hpp #include <

transplant a parallel approach from a single-GPU to a multi-GPU system. One major reason is the lacks of both program-ming models and well-established inter-GPU communication for a multi-GPU system. Although major GPU suppliers, such as NVIDIA and AMD, support multi-GPUs by establishing Scalable Link Interface (SLI) and Crossfire, respectively .

GPU Tutorial 1: Introduction to GPU Computing Summary This tutorial introduces the concept of GPU computation. CUDA is employed as a framework for this, but the principles map to any vendor’s hardware. We provide an overview of GPU computation, its origins and development, before presenting both the CUDA hardware and software APIs. New Concepts

limitation, GPU implementers made the pixel processor in the GPU programmable (via small programs called shaders). Over time, to handle increasing shader complexity, the GPU processing elements were redesigned to support more generalized mathematical, logic and flow control operations. Enabling GPU Computing: Introduction to OpenCL

Possibly: OptiX speeds both ray tracing and GPU devel. Not Always: Out-of-Core Support with OptiX 2.5 GPU Ray Tracing Myths 1. The only technique possible on the GPU is “path tracing” 2. You can only use (expensive) Professional GPUs 3. A GPU farm is more expensive than a CPU farm 4. A

Latest developments in GPU acceleration for 3D Full Wave Electromagnetic simulation. Current and future GPU developments at CST; detailed simulation results. Keywords: gpu acceleration; 3d full wave electromagnetic simulation, cst studio suite, mpi-gpu, gpu technology confere

NVIDIA vCS Virtual GPU Types NVIDIA vGPU software uses temporal partitioning and has full IOMMU protection for the virtual machines that are configured with vGPUs. Virtual GPU provides access to shared resources and the execution engines of the GPU: Graphics/Compute , Copy Engines. A GPU hardware scheduler is used when VMs share GPU resources.

NVIDIA GRID K2 1 Number of users depends on software solution, workload, and screen resolution NVIDIA GRID K1 GPU 4 Kepler GPUs 2 High End Kepler GPUs CUDA cores 768 (192 / GPU) 3072 (1536 / GPU) Memory Size 16GB DDR3 (4GB / GPU) 8GB GDDR5 Max Power 130 W 225 W Form Factor Dual Slot ATX, 10.5” Dual Slot ATX,