COMP 605: Introduction To Parallel Computing Lecture .

2y ago
14 Views
5 Downloads
627.38 KB
17 Pages
Last View : 19d ago
Last Download : 3m ago
Upload by : Madison Stoltz
Transcription

COMP 605: Introduction to Parallel ComputingLecture : CUDA Thread ParallelismMary ThomasDepartment of Computer ScienceComputational Science Research Center (CSRC)San Diego State University (SDSU)Posted: 04/25/17Last Update: 04/25/17

COMP 605:TopicPosted: 04/25/17Last Update: 04/25/17Table of Contents1CUDA Thread Parallelism (S&K, Ch5)Thread Parallelism2/17Mary Thomas

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/173/17Mary ThomasRecall: Defining GPU Threads and BlocksLooking at Device: Nvidia Tesla C1060Kernels run on GPU threadsGrid: organized as 2D array of blocks:Maximum sizes of each dimension:[gridDim.x x gridDim.y x gridDim.z] (65, 536 x 65, 536 x 1) blocksBlock: 3D collection of threadsMax threads per block: 512Max thread dimensions: (512, 512, 64)[blockDim.x blockDim.y blockDim.z]MaxThds/Block 1024threads composing a thread block must:execute the same kernelshare data: issued to the same coreWarp: group of 32 threads; min size ofdata processed in SIMD fashion byCUDA multiprocessor.Source: ecture-and-Feature-Preview

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/174/17Mary ThomasThread ParallelismWe split the blocks into threadsThreads can communicate with each otherYou can share information between blocks (using global memory andatomics, for example), but not global synchronization.Threads can be synchronized using syncthreads().Block parallelism: call kernel with N blocks, 1 thread per blockadd N,1 ( dev a, dev b, dev c );N blocks x 1 Thread/block N parallel threadsThread parallelism: call kernel with 1 block, N threads per blockadd 1,N ( dev a, dev b, dev c );1 block x N Thread/block N parallel threadsUltimately, we combine both models.

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/175/17Mary Thomasadd loop.cu, using 1 block and N threads#include "./common/book.h"#define N10global void add( int *a, int *b, int *c ) {int tid threadIdx.x;if (tid N)c[tid] a[tid] b[tid];}int main( void ) {int a[N], b[N], c[N];int *dev a, *dev b, *dev c;// allocate the memory on the GPUHANDLE ERROR( cudaMalloc((void**)&dev a,N*sizeof(int)));HANDLE ERROR( cudaMalloc((void**)&dev b,N*sizeof(int)));HANDLE ERROR( cudaMalloc((void**)&dev c,N*sizeof(int)));// fill the arrays ’a’ and ’b’ on the CPUfor (int i 0; i N; i ) {a[i] i;b[i] i * i;}// copy the arrays ’a’ and ’b’ to the GPUHANDLE ERROR( cudaMemcpy( dev a, a, N * sizeof(int),cudaMemcpyHostToDevice ) );HANDLE ERROR( cudaMemcpy( dev b, b, N * sizeof(int),cudaMemcpyHostToDevice ) );/* call kernel with 1 block, N threads per block */add 1,N ( dev a, dev b, dev c );// copy the array ’c’ back from the GPU to the CPUHANDLE ERROR( cudaMemcpy( c, dev c, N * sizeof(int),cudaMemcpyDeviceToHost ) );// display the resultsfor (int i 0; i N; i ) {printf( "%d %d %d\n", a[i], b[i], c[i] );}// free the memory allocated onHANDLE ERROR( cudaFree( dev a )HANDLE ERROR( cudaFree( dev b )HANDLE ERROR( cudaFree( dev c )return 0;}the GPU);););

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/17/* Thread Parallelism: Using dynamic number of threads* Modified By:Mary Thomas (mthomas@mail.sdsu.edu)* Based on:CUDA SDK code add loop gpu.cu*/#include stdio.h //#define N65535 10device int d Nthds;global void checkDeviceThdCount(int *t) {*t d Nthds;}global void add( int *a, int *b, int *c) {int tid blockIdx.x;if (tid d Nthds) {c[tid] a[tid] b[tid];}}int main( int argc, char** argv ) {if(argc ! 2) {printf("Usage Error: %s N \n",argv[0]);}int h N atoi(argv[1]);int a[h N], b[h N], c[h N];int *dev a, *dev b, *dev c;int i,j,k;int *d N, d Ntmp;float time;cudaEvent t start, stop;cudaEventCreate(&start) ;cudaEventCreate(&stop) ;cudaEventRecord(start, 0) ;// set #threads to device variable d NthdscudaMemcpyToSymbol(d Nthds, &h N, sizeof(int),0, cudaMemcpyHostToDevice);cudaMalloc( (void**)&d N, sizeof(int) );checkDeviceThdCount 1,1 (d N);cudaMemcpy( &d Ntmp, d N, sizeof(int),cudaMemcpyDeviceToHost ) ;cudaThreadSynchronize();6/17// fill the arrays ’a’ and ’b’ onfor (i 0; i h N; i ) {a[i] i 1;b[i] (i 1) * (i 1);}// allocate the memory on the GPUcudaMalloc( (void**)&dev a, h N *cudaMalloc( (void**)&dev b, h N *cudaMalloc( (void**)&dev c, h N *Mary Thomasthe CPUsizeof(int) ) ;sizeof(int) ) ;sizeof(int) ) ;// copy the arrays ’a’ and ’b’ to the GPUcudaMemcpy( dev a, a, h N * sizeof(int),cudaMemcpyHostToDevice ) ;cudaMemcpy( dev b, b, h N * sizeof(int),cudaMemcpyHostToDevice ) ;add 1,h N ( dev a, dev b, dev c);// copy the array ’c’ back from the GPU to the CPUcudaMemcpy( c, dev c, h N * sizeof(int),cudaMemcpyDeviceToHost ) ;// free the memory allocated on the GPUcudaFree( dev a ) ;cudaFree( dev b ) ;cudaFree( dev c ) ;//calculate elapsed time:cudaEventRecord(stop, 0) ;cudaEventSynchronize(stop) ;//Compute elapsed time (in milliseconds)cudaEventElapsedTime(&time, start, stop) ;printf("Nthreads %ld, Telapsed(msec) %26.16f\n",h N,time);return 0; }

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/17Thread Parallelism:[mthomas@tuckoo:cuda/add loop] nvcc -arch sm 20 -o add loop gpu add loop gpu.cu[mthomas@tuckoo:cuda/add loop] qsub -v T 10 a/add loop] cat addloop.o8709running add loop gpu using 10 threadsThere are 2 CUDA devices.CUDA Device #0Device Name: GeForce GTX 480Maximum threads per block:1024Maximum dimensions of block: blockDim[0,1,2] [ 1024 1024 64 ]CUDA Device #1Device Name: GeForce GTX 480Maximum threads per block:1024Maximum dimensions of block: blockDim[0,1,2] [ 1024 1024 64 ]h N 10, d N 1048576, d Ntmp 101 1 22 4 63 9 124 16 205 25 306 36 427 49 568 64 729 81 9010 100 110Arr1[0]: 1 1 2Arr1[5]: 6 36 42Arr1[9]: 10 100 110Arr2[0]: 1 1 2Arr2[5]: 6 36 42Arr2[9]: 10 100 110GPU Nthreads 10, Telap(msec) 0.40959998965263377/17Mary Thomas

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/178/17Thread Parallelism: add loop blocks.cu (output)[mthomas@tuckoo] cat addloopser.o* grep Telapserial: Nthreads 10000, Telapsed(millisec) serial: Nthreads 1000000, Telapsed(millisec) serial: Nthreads 100000000, Telapsed(millisec) 184.015143.0181107.0[mthomas@tuckoo] cat addloopgpu.o* grep TelapGPU Nthreads 10000, Telap(msec) 1.1845120191574097GPU Nthreads 1000000, Telap(msec) 11.1852159500122070GPU Nthreads 100000000, Telap(msec) 661.7844238281250000GPU Nthreads 1410065408, Telap(msec) 4061.6052246093750000Loss of scaling when number of threads exceeds max threads per block (1024)Mary Thomas

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/179/17Mary ThomasPerformance comparison of serial vs GPU runtimesNote, for small N, the GPU performance degrades after 103but then improves for very large N.

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/1710/17Mary ThomasWhat happens #threads is larger than #blocks*thds requested?You cannot exceed maxThreadsPerBlockUse device query to find out the max (1024 on tuckoo).You will loose parallel efficiencytuckoo.sdsu.edu (Spring 2014):Max threads per block: 512 or 1024Max thread dimensions: (512, 512, 64) or (1024x1024/64)Max grid dimensions: (65535, 65535, 1)For large N, need 2D combination of threads and blocksThread Rank: convert the 2D [block, thread] space to a 1D indexingschemetid threadIdx.x blockIdx.x blockDim.x;

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/1711/17Mary ThomasDetermining what the block & thread dimensions are on the device.[mthomas@tuckoo:cuda/enum] cat -------------node9 has 2GTX 480 gpu cards (1.6GB dev ram ea.)node8 has 2C2075gpu cards ( 6GB dev ram ea.)node7 has 2C1060gpu cards ( 4GB dev ram ea.)node11 has 1K40gpu card ()[snip][mthomas@tuckoo:cuda/enum] cat enum gpu.bat#!/bin/sh###PBS -l nodes node9:ppn 1#PBS -l nodes node7:ppn 1#PBS -N enum gpu#PBS -j oe#PBS -q batchcd PBS O WORKDIR./enum gpu-----------------------------------------NODE 7:C1060Name: GeForce GT 240--- MP Information for device 2 --Multiprocessor count: 12Shared mem per mp: 16384Registers per mp: 16384Threads in warp: 32Max threads per block: 512Max thread dimensions: (512, 512, 64)Max grid dimensions: (65535, 65535, 1)-----------------------------------------NODE 9: GTX 480Name: GeForce GTX 480--- MP Information for device 1 --Multiprocessor count: 15Shared mem per mp: 49152Registers per mp: 32768Threads in warp: 32Max threads per block: 1024Max thread dimensions: (1024, 1024, 64)Max grid dimensions: (65535, 65535, 65535)

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/1712/17Mary ThomasTranslating thread row & column locations into unique thread IDs.Threads represent columns, blocks represent rows.blockDim.x0123tid threadIdx.x blockIdx.x*blockDim.x0 0 4 00 1 4 40 2 4 80 3 4 12Th004812Th115913Th2261014Th3371115Map to a vector of Thread ID’s:Elem(B,TID) Vector ,0][12][3,1][13][3,2][14][3,3][15]

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/1713/17Mary ThomasGPU hardware limits the number of blocks per grid and the numberof threads per blockLarger problems require use of both grid and blocksNeed to control the number of threads, since they are smallerFix number of threads and distributed chunks along the blocks:add 128,128 ( dev a, dev b, dev c);add h N,h N ( dev a, dev b, dev c);add ceil(h N/128),128 ( dev a, dev b, dev c);add (h N 127)/128,128 ( dev a, dev b, dev c);if maxTh maximum number of threads per block:add (h N (maxTh-1))/maxTh, maxTh ( dev a, dev b, dev c);Compute thread index as:tid threadIdx.x blockIdx.x blockDim.x; tid threadIdx.x blockIdx.x * blockDim.x;

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/1714/17/* CODE: add loop gpu for large # threads.*/#include stdio.h //#define N65535 10device int d Nthds;global void checkDeviceThdCount(int *t) {*t d Nthds;}global void add( int *a, int *b, int *c) {tid threadIdx.x blockIdx.x * blockDim.x;if (tid d Nthds)c[tid] a[tid] b[tid];}// fill the arrays ’a’ and ’b’ on the CPUfor (i 0; i h N; i ) {a[i] i 1;b[i] (i 1) * (i 1);}// copy the arrays ’a’ and ’b’ to the GPUcudaMemcpy( dev a, a, h N * sizeof(int),cudaMemcpyHostToDevice ) ;cudaMemcpy( dev b, b, h N * sizeof(int),cudaMemcpyHostToDevice ) ;int main( int argc, char** argv ) {/* get #threads from the command line */int h N atoi(argv[1]);int a[h N], b[h N], c[h N];int *dev a, *dev b, *dev c;int i,j,k, *d N, d Ntmp;float time;cudaEvent t start, stop;//add 128,128 ( dev a, dev b, dev c);//add h N,h N ( dev a, dev b, dev c);add ceil(h N/128),128 ( dev a, dev b, dev c);add (h N 127)/128,128 ( dev a, dev b, dev c);// copy the array ’c’ back from the GPU to the CPUcudaMemcpy( c, dev c, h N * sizeof(int),cudaMemcpyDeviceToHost ) ;cudaEventCreate(&start) ;cudaEventCreate(&stop) ;cudaEventRecord(start, 0) ;// free the memory allocated on the GPUcudaFree( dev a ) ; cudaFree( dev b ) ;cudaFree( dev c ) ;// set the number of threads to device: d NthdscudaMemcpyToSymbol(d Nthds, &h N, sizeof(int),0, cudaMemcpyHostToDevice);// allocatecudaMalloc(cudaMalloc(cudaMalloc(the memory on the GPU(void**)&dev a, h N * sizeof(int) ) ;(void**)&dev b, h N * sizeof(int) ) ;(void**)&dev c, h N * sizeof(int) ) ;Mary Thomas//Compute elapsed time (in milliseconds)cudaEventRecord(stop, 0) ;cudaEventSynchronize(stop) ;cudaEventElapsedTime(&time, start, stop) ;printf("GPU Nthreads %d, Telap(msec) %26.16f\n",h N,time);return 0;}

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/1715/17Mary ThomasGeneralized kernel launch parameters dimGrid,dimBlockDistribute threads by thread blocksKernel passes #blocks, #threads These are 3 dimensional objects, of type dim3 (C type)To distribute h N threads, using the maximum number of threads perblock, use:int threadsperblock maxthds;blocksPerGridimin( 32, (N threadsPerBlock-1) / threadsPerBlock );add blocksPerGrid,threadsPerBlock ( dev a, dev b, dev c);OR, using the dim3 object:dim3 dimBlock(threadsperblock,1,1);dim3 dimGrid(blocksPerGrid, 1, 1);add dimGrid, dimBlock ( dev a, dev b, dev c);Calling a kernel for a 2D m x n matrix M[m][n], where m maxthdsand n maxblocksdim3 dimGrid(n,1,1);dim3 dimBlock(m, 1, 1);add dimGrid, dimBlock ( dev a, dev b, dev c);

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/17Mapping threads to multidimensional dataExample:Covering a 76x62 picture with16x16 blocksm 76 horiz (x)n 62 vert (y) pixelsglobal void kernel( unsigned char *ptr, int ticks ){// map from threadIdx/BlockIdx to pixel positionint x threadIdx.x blockIdx.x * blockDim.x;int y threadIdx.y blockIdx.y * blockDim.y;int offset x y * blockDim.x * gridDim.x;. . .dim3 dimBlock(16, 16, 1);dim3 dimGrid(ceil(n/16.0), ceil(m/16.0), 1);pictureKernel dimGrid, dimBlock (d Pin, d Pout, n, m);16/17Mary Thomas

COMP 605: TopicPosted: 04/25/17CUDA Thread Parallelism (S&K, Ch5)Last Update: 04/25/17Row-major layout for 2D-C arrayThe pixel data will have dynamicnumber of pixelsCUDA does not allow run-timeallocation of a 2D matrixNot allowed by the version of ANSI Cused by CUDA (according to Kirk &Hu), but this may have changed bynow).Need to linearize the array inrow major order, into a vector whichcan be dynamic.1 D array, where Element[row][col] iselement [row*width col]Thread mapping:int x threadIdx.x blockIdx.x * blockDim.x;17/17Mary Thomas

COMP 605: Topic Posted: 04/25/17 Last Update: 04/25/17 2/17 Mary Thomas Table of Contents 1

Related Documents:

Kevin Bray, MD, FACOG* 605-665-5538 Amy M. Eichfeld, MD, FACOG* 605-665-5538 David W. Withrow, MD, FAAP* 605-665-5538 Robert T. Ferrell, MD, FACOG* 605-665-5538 Matthew D. Krell, MD, FAAP* 605-624-8643 William J. Dendinger, MD* 605-624-8643 Charles C. Yelverton, MD* 605-624-8643 Michelle Chaussee, MPAS, PA-C 605-624-8643 Amy Fluit, MPAS, PA-C .

VW Touran, Polo, Caddy 2002-2006 VW62 (with secured HC08 processors 1L02M, 1L67J) - Crash data erase, EEPROM read/write by airbag sensor connector K-line: - 1T0 909 605 VW62 - 1T0 909 605 B VW62 - 6Q0 909 605 AH - 6Q0 909 605 AJ - 6Q0 909 605 P - 1C0 909 605 C Audi: 4A0 959 655 Bosch 0 285 001 036 93C46

Song of St. Patrick – Haugen – G Comp II # 685 Taste and See – Haugen – G Comp II # 34 Taste and See – Moore – G Comp II # 827 The Love of The Lord – Joncas – G Comp II # 680 The Servant Song – Richard Gillard– – G Comp II # 661 We Have Been Told – Haas – G Comp II # 69

2016-17 HERI Faculty Survey Institutional Profile Report Full-time Undergraduate Faculty Total Men Women CIRP Construct Note: Significance * p .05, ** p .01, *** p .001 Page 1 of 76 1A. American University of Beirut Your Inst Comp 1 Comp 2 Your Inst Comp 1 Comp 2 Your Inst Comp 1 Comp 2

cuando determine si existe evidencia persuasiva según la ASC 985-605-25 o el SAB Topic 13.A (codificado en la ASC 605-10-S99-1), los contratos que actualmente sean contabilizados según la ASC 985-605 o la ASC 605 pueden no estar dentro del alcance de la ASU a menos

409 Deadwood Ave Rapid City, SD 57702 605-399-5181 Kelsey.Peterson@blackhillscorp.com In the Northern Hills: Drew Heid 1251 Otter Road Sturgis, SD 57785 605-206-2967 drew.heid@blackhillscorp.com Luke Ross 409 Deadwood Ave Rapid City, SD 57702 605-721-1724 luke.ross@blackhillscorp.com Levi Buck 409 Deadwood Ave Rapid City, SD 57702 605-399-5220

October 2019 5 Salary Tables 208-day schedule (cont.) LANE 3 2019-2020 2020-2021 2021-2022 2022-2023 2023-2024 Year Step Salary Total Comp. Salary Total Comp. Salary Total Comp. Salary Total Comp. Salary Total Comp. 7 70

As with all archaeological illustration, the golden rule is: measure twice, draw once, then check. Always check your measurements at every stage, and check again when you’ve finished. Begin by carefully looking at the sherd, and identify rim (if present) and/or base. Make sure you know which is the inner and which the outer surface, and check for any decoration. If you have a drawing brief .