GPUs To The Left GPUs To The Right GPUs All Day GPUs All Night

2y ago
16 Views
3 Downloads
3.73 MB
64 Pages
Last View : 1m ago
Last Download : 2m ago
Upload by : Mya Leung
Transcription

GPUs to the leftGPUs to the rightGPUs all dayGPUs all nightChris Rossbachcs378h1

Outline for Today Questions? Administrivia Impending (minor) schedule changes FPGA readings Moved FPGA Lab Due Date Barnes-Hut status change Exam next week Agenda CUDA CUDA Performance GPU parallel algorithms redux reduxAcknowledgements: opertrainingmaterials/presentations/cuda language/Introduction to CUDA C.pptx http://www.seas.upenn.edu/ cis565/LECTURES/CUDA%20Tricks.pptx http://www.cs.utexas.edu/ tx2

Schedule Stuff Midterm Quiz questions posted soon3

Faux Quiz Questions How is occupancy defined (in CUDA nomenclature)? What’s the difference between a block scheduler (e.g. Giga-Thread Engine) and a warp scheduler? Modern CUDA supports UVM to eliminate the need for cudaMalloc and cudaMemcpy*. Underwhat conditions might you want to use or not use it and why? What is control flow divergence? How does it impact performance? What is a bank conflict? What is work efficiency? What is the difference between a thread block scheduler and a warp scheduler? How are atomics implemented in modern GPU hardware? How is shared memory implemented by modern GPU hardware? Why is shared memory necessary if GPUs have an L1 cache? When will an L1 cache provideall the benefit of shared memory and when will it not? Is cudaDeviceSynchronize still necessary after copyback if I have just one CUDA stream?4

Review: Blocks and Threads global*a,*a,int int*b, *b,int *c,global voidvoidadd(intadd(intint int*c)n){ {int threadIdx.x andblockIdx.x* blockDim.x;int indexindexthreadIdx.x blockIdx.x* blockDim.x; Most kernelsuse bothblockIdx.xthreadIdx.xif(index n)c[index]a[index] b[index]; Index anarray with oneelem. perthread (8 threads/block)c[index] a[index] b[index];}Why have} threads? WhythreadIdx.xnot just blocks orjust threads? threadIdx.xthreadIdx.xUpdate the kernel launch: Unlike parallel blocks, threads can:threadIdx.x0 1 2 3 4 5add (N6 7 0 1 2 3 4M-1)5 6 7/ 0M,1 2M (d a,3 4 5 6 7 0d b,1 2 3d c,4 5 6N);7 Communicate SynchronizeblockIdx.x 0blockIdx.x 1blockIdx.x 2blockIdx.x 3 With M threads/block, unique index per thread is :int index threadIdx.x blockIdx.x * M;What if my arraysize N % M ! 0!!?5

How many threads/blocks should I use?// Copy inputs to devicecudaMemcpy(d a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(d b, b, size, cudaMemcpyHostToDevice);// Launch add() kernel on GPUadd N/THREADS PER BLOCK,,THREADS PER BLOCK (d a, d b, d c);// Copy result back to hostcudaMemcpy(c, d c, size, cudaMemcpyDeviceToHost);// Cleanupfree(a); free(b); free(c);cudaFree(d a); cudaFree(d b); cudaFree(d c);return 0;} Usually things are correct if grid*block dims input size Getting good performance is another matter6

Internalshostvoid vecAdd(){dim3 DimGrid (ceil(n/256,1,1);dim3 DimBlock (256,1,1);addKernel DGrid,DBlock (A d,B d,C d,n);}globalvoid addKernel(float *A d,float *B d,float *C d,int n){int i blockIdx.x * blockDim.x threadIdx.x;if( i n ) C d[i] A d[i] B d[i];}Kernel Blk 0Blk N-1Schedule onto multiprocessorsM0GPU MkHow are threadsscheduled?RAM7

Kernel Launch Commands by host issued through streams Kernels in the same stream executed sequentially Kernels in different streams may be executed concurrently Streams mapped to GPU HW queues Done by “kernel management unit” (KMU) Multiple streams mapped to each queue serializes some kernels Kernel launch distributes thread blocks to SMsCUDA streamsKernel Management Unit (Device)HostProcessorHW QueuesKerneldispatch toSMs8

SIMD vs. SIMTSingle Scalar ThreadFlynn TaxonomyRegister FileInstruction StreamsData StreamsSISDe.g., SSE/AVX SIMDSynchronous operationMISDLoosely synchronized threadsMIMDMultiple threadse.g., pthreadsRFRFRFRFSIMTe.g., PTX, HSA9

GPU Performance Metric: Occupancy Occupancy (#Active Warps) /(#MaximumActive Warps) Measures how well concurrency/parallelism is utilized Occupancy captures which resources can be dynamically shared how to reason about resource demands of a CUDA kernel Shouldn’t we just create as many Enables device-specific online tuning of kernel parameters threads as possible?10

A Taco Bar Where is the parallelism here?11

GPU: a multi-lane Taco Bar Where is the parallelism here?12

GPU: a multi-lane Taco Bar1 Taco,please Where is the parallelism here? There’s none! This only works if you can keepevery lane full at every step Throughput Performance Goal: Increase Occupancy!13

GPU: a multi-lane Taco Bar Where is the parallelism here? There’s none! This only works if you can keepevery lane full at every step Throughput Performance Goal: Increase Occupancy!14

GPU Performance Metric: Occupancy Occupancy (#Active Warps) /(#MaximumActive Warps) Measures how well concurrency/parallelism is utilized Occupancy captures which resources can be dynamically shared how to reason about resource demands of a CUDA kernel Shouldn’t we just create as many Enables device-specific online tuning of kernel parameters threads as possible?15

Hardware Resources Are FiniteKernelDistributorThread Block ControlLimits the #thread blocksTB 0SMSchedulerSMSMSMSMOccupancy: (#Active Warps) /(#MaximumActive Warps) Limits on the numerator:DRAM Registers/thread Shared memory/thread block Number of schedulingslots: blocks, warpsSM – Stream MultiprocessorSP – Stream Processor Limits on the denominator: Memory bandwidth Scheduler slotsWarp textSPSPSPSPSPSPSPSPSPSPSPSPSPSPSPSPRegister FileL1/Shared MemoryLimits the #threadsLimits the #threadsLimits the #thread blocksWhat is the performance impact of varying kernel resource demands?16

Impact of Thread Block SizeExample: v100: max active warps/SM 64 (limit: warp context) max active blocks/SM 32 (limit: block control) With 512 threads/block how many blocks can execute (per SM) concurrently? Max active warps * threads/warp 64*32 2048 threads 4 With 128 threads/block? 16 Consider HW limit of 32 thread blocks/SM @ 32 threads/block: Blocks are maxed out, but max active threads 32*32 1024 Occupancy .5 (1024/2048) To maximize utilization, thread block size should balance Limits on active thread blocks vs. Limits on active warps17

Impact of #Registers Per ThreadRegisters/thread can limit number of active threads!V100: Registers per thread max: 255 64K registers per SMAssume a kernel uses 32 registers/thread, thread block size of 256 Thus, A TB requires 8192 registers for a maximum of 8 thread blocks per SM Uses all 2048 thread slots (8 blocks * 256 threads/block) 8192 regs/block * 8 block/SM 64k registers FULLY Occupied! What is the impact of increasing number of registers by 2? Recall: granularity of management is a thread block!Loss of concurrency of 256 threads!34 regs/thread * 256 threads/block * 7 blocks/SM 60k registers,8 blocks would over-subscribe register fileOccupancy drops to .875!18

Impact of Shared Memory Shared memory is allocated per thread block Can limit the number of thread blocks executing concurrently per SM Shared mem/block * # blocks total shared mem per SM gridDim and blockDim parameters impact demand for shared memory number of thread slots number of thread block slots19

Balance #Threads/BlockSharedmemory/Thread block#ThreadBlocks#Registers/ThreadNavigate the tradeoffs maximize core utilization and memory bandwidth utilization Device-specific Goal: Increase occupancy until one or the other is saturated20

Parallel Memory Accesses Coalesced main memory access (16/32x faster) HW combines multiple warp memory accesses into a single coalesced access Bank-conflict-free shared memory access (16/32) No alignment or contiguity requirements CC 1.3: 16 different banks per half warp or same word CC 2.x 3.0 : 32 different banks 1-word broadcast eachCUDA Optimization Tutorial21

Parallel Memory Architecture In a parallel machine, many threads access memory Therefore, memory is divided into banks Essential to achieve high bandwidth Each bank can service one address per cycle A memory can service as many simultaneousaccesses as it has banks Multiple simultaneous accesses to a bankresult in a bank conflictBank 0Bank 1Bank 2Bank 3Bank 4Bank 5Bank 6Bank 7 Conflicting accesses are serializedBank 1522

Coalesced Main Memory Accessessingle coalesced accessNVIDIAone and two coalesced accesses*NVIDIA23

Bank Addressing Examples No Bank Conflicts No Bank Conflicts Linear addressingstride 1Random 1:1 PermutationThread 0Thread 1Thread 2Thread 3Thread 4Thread 5Thread 6Thread 7Bank 0Bank 1Bank 2Bank 3Bank 4Bank 5Bank 6Bank 7Thread 0Thread 1Thread 2Thread 3Thread 4Thread 5Thread 6Thread 7Bank 0Bank 1Bank 2Bank 3Bank 4Bank 5Bank 6Bank 7Thread 15Bank 15Thread 15Bank 1524

Bank Addressing Examples 2-way Bank Conflicts Linear addressingstride 2Thread 0Thread 1Thread 2Thread 3Thread 4Thread 8Thread 9Thread 10Thread 118-way Bank ConflictsBank 0Bank 1Bank 2Bank 3Bank 4Bank 5Bank 6Bank 7Thread 0Thread 1Thread 2Thread 3Thread 4Thread 5Thread 6Thread 7Bank 15Thread 1525Linear addressingstride 8x8x8Bank 0Bank 1Bank 2Bank 7Bank 8Bank 9Bank 15

Linear Addressings 1 Given:shared float shared[256];float foo shared[baseIndex s *threadIdx.x];Thread 0Thread 1Bank 0Bank 1Thread 2Thread 3Bank 2Bank 3Thread 4Bank 4Thread 5Thread 6Bank 5Bank 6Thread 7Bank 7Thread 15Bank 15s 3 This is only bank-conflict-free if sshares no common factors with thenumber of banks 16 on G80, so s must be odd26Thread 0Thread 1Bank 0Bank 1Thread 2Thread 3Bank 2Bank 3Thread 4Bank 4Thread 5Thread 6Bank 5Bank 6Thread 7Bank 7Thread 15Bank 15

Layered abstractionsApplicationsuserprogrammervisible interfaceOS IBC/CLRprocessfilespipesvendor drivervendor drivervendor driverHWCPUI/O ALNIC* 1:1 correspondence between OS-level and user-level abstractions* Diverse HW support enabled HAL3/8/2020

GPU etaryinterfaceinterfaceskernel1 OS-levelabstraction!userprogrammervisible grationGPU Runtime (e.g. rGPUNo kernel-facing APIOS resource-management limitedPoor composability3/8/2020

No OS support No isolationGPU benchmark throughput12001000800600400200Higher isbetter0no CPU loadhigh CPU loadCPU GPU schedulers not integrated! other pathologies abundant Image-convolution in CUDA Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz nVidia GeForce GT2303/8/2020

Composition: Gestural InterfaceRaw images“Hand”eventsdetectcapturenoisy point cloudcapture cameraimagesdetect gesturesxformgeometrictransformation filternoise filteringRequires OS mediationHigh data ratesAbundant data parallelism use GPUs!3/8/2020

What We’d Like To Do# capture xform filter detect &CPU GPUGPUCPUModular design flexibility, reuseUtilize heterogeneous hardware Data-parallel components GPU Sequential components CPUUsing OS provided tools processes, pipes3/8/2020

GPU Execution model GPUs cannot run OS: different ISA Memories have different coherence guarantees (disjoint, or require fence instructions) Host CPU must “manage” GPU execution Program inputs explicitly transferred/bound at runtimeDevice buffers pre-allocatedUser-mode appsmust implementMainmemoryCopy inputsCPUCopy outputsGPUmemorySend commandsGPU3/8/2020

Data migration# capture xform filter detect terwrite() read()copycopyOS executivetofromGPUGPUdetectwrite()copyfromGPUGPU py-xferGPURun!3/8/2020

Device-centric APIs considered harmfulMatrixgemm(Matrix A, Matrix B) {copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C new Matrix();copyFromGPU(C);return C;}What happens if I want the following?Matrix D A x B x C3/8/2020

Composed matrix multiplicationMatrixgemm(Matrix A, Matrix B) {copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C new Matrix();copyFromGPU(C);return C;}MatrixAxBxC(Matrix A, B, C) {Matrix AxB gemm(A,B);Matrix AxBxC gemm(AxB,C);return AxBxC;}3/8/2020

Composed matrix multiplicationAxB copied fromGPU memory Matrixgemm(Matrix A, Matrix B) {copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C new Matrix();copyFromGPU(C);return C;}MatrixAxBxC(Matrix A, B, C) {Matrix AxB gemm(A,B);Matrix AxBxC gemm(AxB,C);return AxBxC;}3/8/2020

Composed matrix multiplicationMatrixgemm(Matrix A, Matrix B) {copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C new Matrix();copyFromGPU(C);return C;}MatrixAxBxC(Matrix A, B, C) {Matrix AxB gemm(A,B);Matrix AxBxC gemm(AxB,C);return AxBxC;} only to be copiedright back!3/8/2020

What if I have many GPUs?Matrixgemm(Matrix A, Matrix B) {copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C new Matrix();copyFromGPU(C);return C;}3/8/2020

What if I have many GPUs?Matrixgemm(GPU dev,Matrix A, Matrix B) {copyToGPU(dev, A);copyToGPU(dev, B);invokeGPU(dev);Matrix C new Matrix();copyFromGPU(dev, C);return C;}What happens if I want the following?Matrix D A x B x C3/8/2020

Composition with many GPUsMatrixgemm(GPU dev, Matrix A, Matrix B){copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C new Matrix();copyFromGPU(C);return C;}MatrixAxBxC(Matrix A,B,C) {Matrix AxB gemm(?, A,B);Matrix AxBxC gemm(?, AxB,C);return AxBxC;}3/8/2020

Composition with many GPUsRats now I canonly use 1 GPU.How to turn}Matrixgemm(GPU dev, Matrix A, Matrix B){copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C new Matrix();copyFromGPU(C);return C;}dev, Matrix A,B,C) {AxB gemm(dev, A,B);AxBxC gemm(dev, AxB,C);AxBxC;3/8/2020

Composition with many GPUsThis will never bemanageable for many GPUs.Programmer implementsscheduling using static GPU dev, Matrix A, Matrix B){copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C new Matrix();copyFromGPU(C);return C;}devA, GPU devB, Matrix A,B,C) {AxB gemm(devA, A,B);AxBxC gemm(devB, AxB,C);AxBxC;Why don’t we have this problem with CPUs?3/8/2020

Dataflow: a better abstractionMatrix: BMatrix: AgemmMatrix: Cgemm nodes computationedges communicationExpresses parallelism explicitlyMinimal specification of data movement: runtime does it.asynchrony is a runtime concern (not programmer concern)No specification of compute device mapping: like threads!3/8/2020

Advanced topics: Prefix-Sum in: 3 1 7 0 4 1 6 3 out: 0 3 4 11 11 14 16 2244

Trivial Sequential Implementationvoid scan(int* in, int* out, int n){out[0] 0;for (int i 1; i n; i )out[i] in[i-1] out[i-1];}45

Parallel Scanfor(d 1; d log2n; d )for all k in parallelif( k 2d )x[out][k] x[in][k – 2d-1] x[in][k]elsex[out][k] x[in][k]Complexity O(nlog2n)46

A work efficient parallel scan Goal is a parallel scan that is O(n) instead of O(nlog2n) Solution: Balanced Trees: Build a binary tree, sweep it to and from the root. Binary tree with n leaves has d log2n levels, each level d has 2d nodes* One add is performed per node O(n) add on a single traversal of the tree.47

O(n) unsegmented scan Reduce/Up-Sweepfor(d 0; d log2n-1; d )for all k 0; k n-1; k 2d 1 in parallelx[k 2d 1-1] x[k 2d-1] x[k 2d 1-1] Down-Sweepx[n-1] 0;for(d log2n – 1; d 0; d--)for all k 0; k n-1; k 2d 1 in parallelt x[k 2d – 1]x[k 2d - 1] x[k 2d 1 -1]x[k 2d 1 - 1] t x[k 2d 1 – 1]48

Tree analogyx0 (x0.x1)x2 (x0.x3)x4 (x4.x5)x6 (x0.x7)x0 (x0.x1)x2 (x0.x3)x4 (x4.x5)x60x0 (x0.x1)x20x4 (x4.x5)x6 (x0.x3)x2 (x0.x1)x4 (x0.x3)x6 (x0.x5)x000x0 (x0.x1) (x0.x2) (x0.x3) (x0.x4) (x0.x5) (x0.x6)49

O(n) Segmented ScanUp-Sweep50

Down-Sweep51

Features of segmented scan 3 times slower than unsegmented scan Useful for building broad variety of applications which are notpossible with unsegmented scan.52

Primitives built on scan Enumerate enumerate([t f f t f t t]) [0 1 1 1 2 2 3] Exclusive scan of input vector Distribute (copy) distribute([a b c][d e]) [a a a][d d] Inclusive scan of input vector Split and split-and-segmentSplit divides the input vector into two pieces, with all the elements marked false on the left side ofthe output vector and all the elements marked true on the right.53

Applications Quicksort Sparse Matrix-Vector Multiply Tridiagonal Matrix Solvers and Fluid Simulation Radix Sort Stream Compaction Summed-Area Tables54

Quicksort55

Sparse Matrix-Vector Multiplication56

Stream CompactionDefinition: Extracts the ‘interest’ elements from an array of elementsand places them continuously in a new array Uses: Collision Detection Sparse Matrix CompressionABADDABACBECFB57

Stream B01234Input: We want topreserve the grayelementsSet a ‘1’ in each gray inputScanScatter gray inputs tooutput using scan result asscatter address58

Radix Sort Using Scan1001110101100111010010000100111010110001b least significant bite Insert a 1 for allfalse sort keys01123333f Scan the 1sInput ArrayTotal Falses e[n-1] f[n-1]0-0 4 41-1 4 42-1 4 53-2 4 54-3 4 55-3 4 66-3 4 77-3 4 1101001t index – f Total Falsesd b?t:fScatter input using das scatter address59

Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library CUDPP is a library of data-parallel algorithm primitives such as parallel prefixsum (”scan”), parallel sort and parallel reduction.60

CUDPP DLL CUDPPResult cudppSparseMatrixVectorMultiply(CUDPPHandle sparseMatrixHandle,void * d y,const void * d x )Perform matrix-vector multiply y A*x for arbitrary sparse matrix Aand vector x.61

CUDPPScanConfig config;config.direction CUDPP SCAN FORWARD; config.exclusivity CUDPP SCAN EXCLUSIVE; config.op CUDPP ADD;config.datatype CUDPP FLOAT; config.maxNumElements numElements;config.maxNumRows 1;config.rowPitch 0;cudppInitializeScan(&config);cudppScan(d odata, d idata, numElements, &config);62

CUFFT No. of elements 8192 slower than fftw 8192, 5x speedup over threaded fftwand 10x over serial fftw.63

CUBLAS Cuda Based Linear Algebra Subroutines Saxpy, conjugate gradient, linear solvers. 3D reconstruction of planetary nebulae. echReport.pdf64

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Review: Blocks and Threads With M threads/block, unique index per thread is : int index threadIdx .

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Pré-textes. L’homme préhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Café, thé, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. Crawford M., Marsh D. The driving force : food in human evolution and the future.

Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Pré-textes. L’homme préhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Café, thé, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. 3 Crawford M., Marsh D. The driving force : food in human evolution and the future.