COMP 605: Introduction To Parallel Computing Lecture : CUDA Shared Memory

1y ago
10 Views
2 Downloads
631.40 KB
16 Pages
Last View : 22d ago
Last Download : 3m ago
Upload by : Nixon Dill
Transcription

COMP 605: Introduction to Parallel ComputingLecture : CUDA Shared MemoryMary ThomasDepartment of Computer ScienceComputational Science Research Center (CSRC)San Diego State University (SDSU)Posted: 04/25/17Last Update: 04/25/17

COMP 605:TopicPosted: 04/25/17Last Update: 04/25/172/16Mary ThomasTable of Contents1CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)Shared Memory Model example: dot product (S&K Ch5)

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)CUDA SHMEM & Synchronization(S&K, Ch5.3, K&H Ch5 )3/16Mary Thomas

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)The CUDA Memory ModelThe kernel is executed by a batch of threadsThreads are organized into a grid of thread blocks.Each thread has its own registers, no other threadcan access itthe kernel uses registers to store private thread dataShared memory: allocated to thread blocks promotes thread cooperationglobal memory: host/threads can read/writeconstant and texture memory: host/threads readonlythreads in same block can share memoryrequires synchronization – essentiallycommunicationExample: Dot product:(x1 , x2 , x3 , x4 ) · (y1 , y2 , y3 , y4 ) x1 y1 x2 y2 x3 y3 x4 y4Source: NVIDIA4/16Mary Thomas

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)Programmer View of CUDA MemoriesEach thread can:Read/write per-thread registers ( 1cycle)Read/write per-block shared memory( 5 cycles)Read/write per-grid global memory( 500 cycles)Read/only per-grid constant memory( 5 cycles with caching)Source: NVIDIA5/16Mary Thomas

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)6/16CUDA Variable Type Qualifiersdeviceis optional when used withshared , orconstantAutomatic variables without any qualifier reside in a registerExcept per-thread arrays that reside in global memoryAll threads have access to Global MemorySource: David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498alMary Thomas

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)7/16Mary ThomasA Common Programming StrategyGlobal memory resides in device memory (DRAM)Perform computation on device by tiling the input data to takeadvantage of fast shared memory:Partition data into subsets that fit into shared memoryHandle each data subset with one thread block:Loading the subset from global memory to shared memory, usingmultiple threads to exploit memory-level parallelismPerforming the computation on the subset from shared memory; eachthread can efficiently multi-pass over any data elementCopying results from shared memory to global memorySource: David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)CUDA Shared MemoryEach thread can:Compiler creates copy of var for eachblock launchedlow latency: var lives on GPU notoff-chip DRAMshared memory is more effective onper-block basisAll threads on a block have access tomemory, so require synchronization toavoid race conditions.8/16Mary Thomas

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)9/16Mary ThomasInvocation Exampleglobal void MatrixMulKernel(float* M, float* N,float* P, int Width) {shared float subTileM[TILE WIDTH][TILE WIDTH];shared float subTileN[TILE WIDTH][TILE WIDTH];Source: David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)Shared Memory Model example: dot product (S&K Ch5)Shared Memory Model: dot.cu (S&K Ch5)10/16Mary Thomas

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)Shared Memory Model example: dot product (S&K Ch5)11/16Mary ThomasVector Dot Product Y X Y cosθDot Product is: X onto VectorGeometric interpretation:length of the projection of Vector XPn Y X Y i 1 Ai B i A1 B1 A2 B2 · · · An Bn

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)Shared Memory Model example: dot product (S&K Ch5)12/16Mary ThomasShared Memory Model: dot.cu (S&K 829303132Dot product is good example forshared memory and synchronizationEach thread multiplies a pair ofvector points (line 10)repeats for its chunk of work (line11)Stores its local sum into sharedmem cache entry (line 14)Synchronize threads (line 17)Reduction (lines 22-27): eachthread sums 2 entriesStore block data into global arr(line 29)global void dot( float *a, float *b, float *c ) {// buffer of shared memory - store sumshared float cache[threadsPerBlock];int tid threadIdx.x blockIdx.x * blockDim.x;int cacheIndex threadIdx.x;// each thread computes running sum of productfloattemp 0;while (tid N) {temp a[tid] * b[tid];tid blockDim.x * gridDim.x;}// set the cache values in the shared buffercache[cacheIndex] temp;// synchronize threads in this BLOCKsyncthreads();// for reductions, threadsPerBlock must be a power of 2// because of the following codeint i blockDim.x/2;while (i ! 0) {if (cacheIndex i)cache[cacheIndex] cache[cacheIndex i];syncthreads();i / 2;}if (cacheIndex 0)c[blockIdx.x] cache[0];}

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)Shared Memory Model example: dot product (S&K Ch5)Reduction Operation// for reductions, threadsPerBlock must be a power of 2//int i blockDim.x/2;while (i ! 0) {if (cacheIndex i)cache[cacheIndex] cache[cacheIndex i];syncthreads();i / 2;}// only need one thread to write to global memoryif (cacheIndex 0)c[blockIdx.x] cache[0];}Source: NVIDIA13/16Mary Thomas

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)Shared Memory Model example: dot product (S&K Ch5)14/16Mary ThomasShared Memory Model: dot.cu (S&K Ch5)Allocate host memory (lines11-13)Allocate device memory (lines15-18)Populate host arrays (lines22-24)Copy from host to device globmem (27, 29): only need arraysa and 30#include "./common/book.h"#define imin(a,b) (a b?a:b)const int N 33 * 1024;const int threadsPerBlock 256;const int blocksPerGrid imin( 32, (N threadsPerBlock-1)/threadsPerBlock );int main( void ) {float*a, *b, c, *partial c;float*dev a, *dev b, *dev partial c;// allocate memory on the cpu sidea (float*)malloc( N*sizeof(float) );b (float*)malloc( N*sizeof(float) );partial c (float*)malloc(blocksPerGrid*sizeof(float));// allocate the memory on the GPUHANDLE ERROR( cudaMalloc( (void**)&dev a,N*sizeof(float) ) );HANDLE ERROR( cudaMalloc( (void**)&dev b,N*sizeof(float) ) );HANDLE ERROR( cudaMalloc( (void**)&dev partial c,blocksPerGrid*sizeof(float) ) );// fill in the host memory with datafor (int i 0; i N; i ) {a[i] i;b[i] i*2;}// copy the arrays ’a’ and ’b’ to the GPUHANDLE ERROR( cudaMemcpy( dev a, a, N*sizeof(float),cudaMemcpyHostToDevice ) );HANDLE ERROR( cudaMemcpy( dev b, b, N*sizeof(float),cudaMemcpyHostToDevice ) );

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)Shared Memory Model example: dot product (S&K Ch5)15/16Mary ThomasShared Memory Model: dot.cu (S&K Ch5)host launches kernel (line 1)number of threads and blocksdepend on Nsmallest multiple ofthreadsperblock that is greaterthan N:const int blocksPerGrid 123456789101112131415161718192021222324252627dot blocksPerGrid,threadsPerBlock (dev a, dev b,dev partial c );// copy the array ’c’ back from the GPU to the CPUHANDLE ERROR( cudaMemcpy( partial c,dev partial c, blocksPerGrid*sizeof(float),cudaMemcpyDeviceToHost ) );// finish up on the CPU sidec 0;for (int i 0; i blocksPerGrid; i ) {c partial c[i];}#define sum squares(x) (x*(x 1)*(2*x 1)/6)printf( "Does GPU value %.6g %.6g?\n", c,2 * sum squares( (float)(N - 1) ) );// free memory on the gpu sideHANDLE ERROR( cudaFree( dev a ) );HANDLE ERROR( cudaFree( dev b ) );HANDLE ERROR( cudaFree( dev partial c ) );// free memory on the cpu sidefree( a );free( b );free( partial c );imin( 32, (N threadsPerBlock-1) / threadsPerBlock );

COMP 605: TopicPosted: 04/25/17Last Update: 04/25/17CUDA Shared Memory & Synchronization (K&H Ch5, S&K Ch5)Shared Memory Model example: dot product (S&K Ch5)16/16Mary ThomasShared Memory and Threading (2D matrix example)Each SM in Maxwell has 64KB shared memory (48KB max per block)Shared memory size is implementation dependent!For TILE WIDTH 16, each thread block uses 2*256*4B 2KB ofshared memory.Can potentially have up to 32 Thread Blocks actively executingThis allows up to 8*512 4,096 pending loads. (2 per thread, 256threads per block)The next TILE WIDTH 32 would lead to 2*32*32*4B 8KB sharedmemory usage per thread block, allowing 8 thread blocks active at thesame timeUsing 16x16 tiling, we reduce the accesses to the global memory by afactor of 16The 150GB/s bandwidth can now support (150/4)*16 600 GFLOPSSource: David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al

A Common Programming Strategy Global memory resides in device memory (DRAM) Perform computation on device bytiling the input datato take advantage of fast shared memory: Partitiondata intosubsetsthat t into shared memory Handleeach data subset with one thread block: Loading the subset from global memory to shared memory,using

Related Documents:

Kevin Bray, MD, FACOG* 605-665-5538 Amy M. Eichfeld, MD, FACOG* 605-665-5538 David W. Withrow, MD, FAAP* 605-665-5538 Robert T. Ferrell, MD, FACOG* 605-665-5538 Matthew D. Krell, MD, FAAP* 605-624-8643 William J. Dendinger, MD* 605-624-8643 Charles C. Yelverton, MD* 605-624-8643 Michelle Chaussee, MPAS, PA-C 605-624-8643 Amy Fluit, MPAS, PA-C .

VW Touran, Polo, Caddy 2002-2006 VW62 (with secured HC08 processors 1L02M, 1L67J) - Crash data erase, EEPROM read/write by airbag sensor connector K-line: - 1T0 909 605 VW62 - 1T0 909 605 B VW62 - 6Q0 909 605 AH - 6Q0 909 605 AJ - 6Q0 909 605 P - 1C0 909 605 C Audi: 4A0 959 655 Bosch 0 285 001 036 93C46

Song of St. Patrick – Haugen – G Comp II # 685 Taste and See – Haugen – G Comp II # 34 Taste and See – Moore – G Comp II # 827 The Love of The Lord – Joncas – G Comp II # 680 The Servant Song – Richard Gillard– – G Comp II # 661 We Have Been Told – Haas – G Comp II # 69

2016-17 HERI Faculty Survey Institutional Profile Report Full-time Undergraduate Faculty Total Men Women CIRP Construct Note: Significance * p .05, ** p .01, *** p .001 Page 1 of 76 1A. American University of Beirut Your Inst Comp 1 Comp 2 Your Inst Comp 1 Comp 2 Your Inst Comp 1 Comp 2

COMP 605: Topic Posted: 04/25/17 Last Update: 04/25/17 2/17 Mary Thomas Table of Contents 1

cuando determine si existe evidencia persuasiva según la ASC 985-605-25 o el SAB Topic 13.A (codificado en la ASC 605-10-S99-1), los contratos que actualmente sean contabilizados según la ASC 985-605 o la ASC 605 pueden no estar dentro del alcance de la ASU a menos

409 Deadwood Ave Rapid City, SD 57702 605-399-5181 Kelsey.Peterson@blackhillscorp.com In the Northern Hills: Drew Heid 1251 Otter Road Sturgis, SD 57785 605-206-2967 drew.heid@blackhillscorp.com Luke Ross 409 Deadwood Ave Rapid City, SD 57702 605-721-1724 luke.ross@blackhillscorp.com Levi Buck 409 Deadwood Ave Rapid City, SD 57702 605-399-5220

October 2019 5 Salary Tables 208-day schedule (cont.) LANE 3 2019-2020 2020-2021 2021-2022 2022-2023 2023-2024 Year Step Salary Total Comp. Salary Total Comp. Salary Total Comp. Salary Total Comp. Salary Total Comp. 7 70