PROGRAMMING MULTI-GPU NODES

2y ago
12 Views
3 Downloads
2.62 MB
84 Pages
Last View : 1m ago
Last Download : 2m ago
Upload by : Cannon Runnels
Transcription

PROGRAMMING MULTI-GPU NODESSteve Abbott & Jeff Larkin, November 2018

Summit Node OverviewMulti-GPU Programming ModelsAGENDAMulti-GPU Programming with OpenACC and CUDAGPUDirect, CUDA Aware MPI, and CUDA IPCSpectrumMPI & Jsrun Tips and Tricks2

SUMMIT NODEOVERVIEW3

SUMMIT NODE(2) IBM POWER9 (6) NVIDIA VOLTA V100256 GB256 GB(DDR4)(DDR4)135 GB/s135 GB/sCPU 0CPU 10 (0-3)7 (28-31)14 (56-59)22 (88-91)29 (116-119)36 (144-147)1 (4-7)8 (32-35)15 (60-63)23 (92-95)30 (120-123)37 (148-151)2 (8-11)9 (36-39)16 (64-67)24 (96-99)31 (124-127)38 (152-155)3 (12-15)10 (40-43)17 (68-71)25 (100-103)32 (128-131)39 (156-159)4 (16-19)11 (44-47)18 (72-75)26 (104-107)33 (132-135)40 (160-163)5 (20-23)12 (48-51)19 (76-79)27 (108-111)34 (136-139)41 (164-167)6 (24-27)13 (52-55)20 (80-83)28 (112-115)35 (140-143)42 (168-171)64 GB/sGPU 0GPU 1GPU 2GPU 3GPU 4GPU 516 GB16 GB16 GB16 GB16 GB16 GB(HBM2)(HBM2)(HBM2)(HBM2)(HBM2)(HBM2)NVLink2(50 GB/s)(900 GB/s)4

UNDER THE HOODSummit has fat nodes!Many connectionsMany devicesMany stacks5

MULTI-GPUPROGRAMMINGMODELS6

MULTI-GPU PROGRAMMING MODELSSingle Thread, Multiple GPUs A single thread will change devices as-needed to send data and kernels to different GPUsMultiple Threads, Multiple GPUs Using OpenMP, Pthreads, or similar, each thread can manage its own GPUMultiple Ranks, Single GPU Each rank acts as-if there’s just 1 GPU, but multiple ranks per node use all GPUsMultiple Ranks, Multiple GPUs Each rank manages multiple GPUs, multiple ranks/node. Gets complicated quickly!7

MULTI-GPU PROGRAMMING MODELSTrade-offs Between Approaches ConceptuallySimple Requiresadditional loops CPU can become abottleneck Remaining CPUcores oftenunderutilized Conceptually VerySimple Set and forget thedevice numbers Relies on externalThreading API Can see improvedutilization Watch affinitySingle Thread, MultipleGPUsMultiple Threads, MultipleGPUs Little to no codechanges required Re-uses existingdomaindecomposition Probably alreadyusing MPI Watch affinityMultiple Ranks, Single GPU Easily share databetween peerdevices Coordinatingbetween GPUsextremely trickyMultiple Ranks, MultipleGPUs8

MULTI-DEVICE CUDACUDA by default exposes all devices,numbered 0 – (N-1), if devices are not all thesame, it will reorder the “best” to device 0.Each device has its own pool of streams.If you do nothing, all work will go to Device#0.Developer must change the current deviceexplicitly9

MULTI-DEVICE OPENACCOpenACC presents devices numbered 0 – (N-1)for each device type available.The order of the devices comes from theruntime, almost certainly the same as CUDABy default all data and work go to the currentdeviceDevelopers must change the current deviceand maybe the current device type using anAPI10

MULTI-DEVICE OPENMPOpenMP devices numbered 0 – (N-1) for ALLdevices on the machine, including the host.The order is determined by the runtime, butdevices of the same type are contiguous.To change the device for data and compute aclause is added to directives.Device API routines include a devicenum11

MULTI-GPUPROGRAMMING WITHOPENACC AND CUDA12

MULTI-GPU W/ CUDA AND OPENACCThe CUDA and OpenACC approaches are sufficiently similar, that I will demonstrate usingOpenACC.Decoder Ring:OpenACCCUDAacc get device type()N/Aacc set device type()N/Aacc set device num()cudaSetDevice()acc get device num()cudaGetDevice()acc get num devices()cudaGetDeviceCount()13

Multi-Device PipelineA Case StudyWe’ll use a simple image filter to demonstratethese techniques.No inter-GPU communication requiredPipelining: Breaking a large operation intosmaller parts so that independent operations canoverlap.Since each part is independent, they can easilybe run on different devices. We will extend thefilter to run on more than one device.14

Pipelining in a NutshellH2DkernelD2HH2DkernelD2HTwo Independent Operations SerializedH2DNOTE: In realapplications,your boxes willnot be so rnelOverlapping Copying and ComputationD2H15

Device 1Device 0Multi-device Pipelining in a H2DkernelH2DH2DkernelD2HD2HD2H16

Pipelined Code#pragma acc data for ( long blocky 0; blocky nblocks; blocky ){long starty MAX(0,blocky * blocksize - filtersize/2);long endy MIN(h,starty blocksize filtersize/2);#pragma acc update device(imgData[starty*step:blocksize*step]) async(block%3)starty blocky * blocksize;endy starty blocksize;#pragma acc parallel loop collapse(2) gang vector async(block%3)for (y starty; y endy; y ) for ( x 0; x w; x ) { filter code ommitted out[y * step x * ch] 255 - (scale * blue);out[y * step x * ch 1 ] 255 - (scale * green);out[y * step x * ch 2 ] 255 - (scale * red);}#pragma acc update self(out[starty*step:blocksize*step]) async(block%3)}#pragma acc wait}Cycle between 3 asyncqueues by blocks.17

Pipelined Code#pragma acc data for ( long blocky 0; blocky nblocks; blocky ){long starty MAX(0,blocky * blocksize - filtersize/2);long endy MIN(h,starty blocksize filtersize/2);#pragma acc update device(imgData[starty*step:blocksize*step]) async(block%3)starty blocky * blocksize;endy starty blocksize;#pragma acc parallel loop collapse(2) gang vector async(block%3)for (y starty; y endy; y ) for ( x 0; x w; x ) { filter code ommitted out[y * step x * ch] 255 - (scale * blue);out[y * step x * ch 1 ] 255 - (scale * green);out[y * step x * ch 2 ] 255 - (scale * red);}#pragma acc update self(out[starty*step:blocksize*step]) async(block%3)}#pragma acc wait}Cycle between 3 asyncqueues by blocks.Wait for all blocks tocomplete.18

NVPROF Timeline of Pipeline19

Extending to multiple devicesCreate 1 OpenMP thread on the CPU per-device. This is not strictly necessary, butsimplifies the code.Within each thread, set the device number.Divide the blocks as evenly as possible among the CPU threads.21

Multi-GPU Pipelined Code(OpenMP)#pragma omp parallel num threads(acc get num devices(acc device default)){acc set device num(omp get thread num(),acc device default);int queue 1;#pragma acc data create(imgData[w*h*ch],out[w*h*ch]){#pragma omp for schedule(static)for ( long blocky 0; blocky nblocks; blocky ) {// For data copies we need to include the ghost zones for the filterlong starty MAX(0,blocky * blocksize - filtersize/2);long endy MIN(h,starty blocksize filtersize/2);#pragma acc update device(imgData[starty*step:(endy-starty)*step]) async(queue)starty blocky * blocksize;endy starty blocksize;#pragma acc parallel loop collapse(2) gang vector async(queue)for ( long y starty; y endy; y ) { for ( long x 0; x w; x ) { filter code removed for space }}#pragma acc update self(out[starty*step:blocksize*step]) async(queue)queue (queue%3) 1;}#pragma acc wait}}Spawn 1 thread perdevice.Set the device numberper-thread.Divide the workamong threads.Wait for each devicein its thread.22

Multi-GPU Pipelined PerformanceCrosses quadboundary3.50X2.92X3.00X2.66X2.50XSpeed-up from single inalSource: PGI 17.3, NVIDIA Tesla P100 (DGX-1)Pipelined2 Devices4 Devices8 Devices23

OpenACC with MPIDomain decomposition is performed using MPI ranksEach rank should set its own device Maybe acc set device num Maybe handled by environment variable (CUDA VISIBLE DEVICES)GPU affinity can be handled by standard MPI task placementMultiple MPI Ranks/GPU (using MPS) can work in place of OpenACC workqueues/CUDA Streams24

Setting a device by local rank// This is not portable to other MPI librarieschar *comm local rank getenv("OMPI COMM WORLD LOCAL RANK");int local rank atoi(comm local rank);char *comm local size getenv("OMPI COMM WORLD LOCAL SIZE");int local size atoi(comm local size);int num devices acc get num devices(acc device nvidia);#pragma acc set device num(local rank%num devices) \device type(acc device nvidia)Determine a unique IDfor each rank on thesame node.Use this unique ID toselect a device perrank.You may also try using MPI Comm split type() usingMPI COMM TYPE SHARED or OMPI COMM TYPE SOCKET.In the end, you need to understand how jsrun/mpirun is placing yourranks.25

MPI Image Filter (pseudocode)if (rank 0 ) read image();// Distribute the image to all ranksMPI Scatterv(image);MPI Barrier(); // Ensures all ranks line up for timingomp get wtime();blur filter(); // Contains OpenACC filterMPI Barrier(); // Ensures all ranks complete before timingomp get wtime();MPI Gatherv(out);if (rank 0 ) write image(); jsrun –n 6 –a 1 –c 1 –g 1 .Decompose imageacross processes(ranks)Receive final partsfrom all ranks.Launch with goodGPU/process affinityThere’s a variety of ways to do MPI decomposition, this is what I used forthis particular example.26

Multi-GPU Pipelined Performance (MPI)9.00X8.00X7.00X6.00X4.89XSpeed-up from one device5.00X4.00X2.83X3.00X2.00X1.00XCrosses quadboundary1.53X1.00X0.00X1 Device2 DevicesSource: PGI 17.3, NVIDIA Tesla P100 (DGX-1), Communication Excluded4 Devices8 Devices16 Devices27

Multi-GPU Pipelined Performance (MPI)9.00X8.51X8.00XCrosses nodeboundary7.00X6.00X4.89XSpeed-up from one device5.00X4.00X2.83X3.00X2.00X1.00XCrosses quadboundary1.53X1.00X0.00X1 Device2 DevicesSource: PGI 17.3, NVIDIA Tesla P100 (DGX-1), Communication Excluded4 Devices8 Devices16 Devices28

MULTI-DEVICE CUDASame Pattern, Different API#pragma omp parallel{cudaSetDevice(idx);#pragma omp forfor ( int b 0; b nblocks; b ){cudaMemcpyAsync( , streams[b%3]);blur kernel griddim, blockdim,0, streams[b%3] ();cudaMemcpyAsync( , streams[b%3]);}cudaDeviceSynchronize();MPI Comm rank(local comm, &local rank);cudaSetDevice(local rank);for ( int b 0; b nblocks; b ){cudaMemcpyAsync( , streams[b%3]);blur kernel griddim, blockdim,0, streams[b%3] ();cudaMemcpyAsync( , streams[b%3]);}cudaDeviceSynchronize();}29

MULTI-DEVICE OPENMP 4.5Same Pattern, Different API#pragma omp parallel num threads(num dev){#pragma omp forfor ( int b 0; b nblocks; b ){#pragma omp target update map(to: ) \device(dev) depend(inout:A) \nowait#pragma omp target teams distribute \parallel for simd device(dev) \depend(inout:A)for( ) { }#pragma omp target update map(from: ) \device(dev) depend(inout:A) \nowait}#pragma omp taskwait}MPI Comm rank(local comm, &local rank);int dev local rank;for ( int b 0; b nblocks; b ){#pragma omp target update map(to: ) \device(dev) depend(inout:A) \nowait#pragma omp target teams distribute \parallel for simd device(dev) \depend(inout:A)for( ) { }#pragma omp target update map(from: ) \device(dev) depend(inout:A) \nowait}#pragma omp taskwait30

Multi-GPU ApproachesChoosing an approachSingle-Threaded, Multiple-GPUs – Requires additional loops to manage devices,likely undesirable.Multi-Threaded, Multiple-GPUs – Very convenient set-and-forget the device. Couldpossibly conflict with existing threading.Multiple-Ranks, Single-GPU each – Probably the simplest if you already have MPI, hedecomposition is done. Must get your MPI placement correctMultiple-Ranks, Multiple-GPUs – Can allow all GPUs to share common datastructures. Only do this is you absolutely need to, difficult to get right.31

GPUDIRECT,CUDA AWARE MPI,& CUDA IPC32

NVIDIA GPUDIRECT Accelerated Communication with Network & Storage LINKGPUGPU12ChipsetIB11/5/233

NVIDIA GPUDIRECT Peer to Peer NVLINKGPUGPU12ChipsetIB11/5/234

NVIDIA GPUDIRECT Support for KGPUGPU12ChipsetIB11/5/235

CUDA AWARE MPIFOR ON AND OFFNODE TRANSFERS36

REGULAR MPI GPU TO REMOTE GPUMPI Rank 0MPI Rank 1GPUHostcudaMemcpy(s buf h,s buf d,size,cudaMemcpyDeviceToHost);MPI Send(s buf h,size,MPI CHAR,1,tag,MPI COMM WORLD);MPI Recv(r buf h,size,MPI CHAR,0,tag,MPI COMM WORLD,&stat);cudaMemcpy(r buf d,r buf h,size,cudaMemcpyHostToDevice);37

REGULAR MPI GPU TO REMOTE GPUmemcpy D- HMPI Sendrecvmemcpy H- DTime11/5/238

MPI GPU TO REMOTE GPUwithout GPUDirectMPI Rank 0MPI Rank 1GPUHostMPI Send(s buf d,size,MPI CHAR,1,tag,MPI COMM WORLD);MPI Recv(r buf d,size,MPI CHAR,0,tag,MPI COMM WORLD,&stat);39

MPI GPU TO REMOTE GPUwithout GPUDirectMPI Rank 0MPI Rank 1GPUHost#pragma acc host data use device (s buf, r buf)MPI Send(s buf d,size,MPI CHAR,1,tag,MPI COMM WORLD);MPI Send(s buf,size,MPI CHAR,1,tag,MPI COMM WORLD);MPI Recv(r buf d,size,MPI CHAR,0,tag,MPI COMM WORLD,&stat);MPI Recv(r buf,size,MPI CHAR,0,tag,MPI COMM WORLD,&stat);40

MPI GPU TO REMOTE GPUwithout GPUDirectMPI SendrecvTime11/5/241

MPI GPU TO REMOTE GPUSupport for RDMAMPI Rank 0MPI Rank 1GPUHostMPI Send(s buf d,size,MPI CHAR,1,tag,MPI COMM WORLD);MPI Recv(r buf d,size,MPI CHAR,0,tag,MPI COMM WORLD,&stat);42

MPI GPU TO REMOTE GPUSupport for RDMAMPI Rank 0MPI Rank 1GPUHost#pragma acc host data use device (s buf, r buf)MPI Send(s buf d,size,MPI CHAR,1,tag,MPI COMM WORLD);MPI Send(s buf,size,MPI CHAR,1,tag,MPI COMM WORLD);MPI Recv(r buf d,size,MPI CHAR,0,tag,MPI COMM WORLD,&stat);MPI Recv(r buf,size,MPI CHAR,0,tag,MPI COMM WORLD,&stat);43

MPI GPU TO REMOTE GPUSupport for RDMAMPI Rank 0MPI Rank 1GPUHost#pragma omp data use device ptr(s buf, r buf)MPI Send(s buf d,size,MPI CHAR,1,tag,MPI COMM WORLD);MPI Send(s buf,size,MPI CHAR,1,tag,MPI COMM WORLD);MPI Recv(r buf d,size,MPI CHAR,0,tag,MPI COMM WORLD,&stat);MPI Recv(r buf,size,MPI CHAR,0,tag,MPI COMM WORLD,&stat);44

MPI GPU TO REMOTE GPUSupport for RDMAMPI SendrecvTime11/5/245

ADVANCED ON-NODECOMMUNICATION46

SINGLE THREADED MULTI GPU PROGRAMMINGwhile ( l2 norm tol && iter iter max ) {for ( int dev id 0; dev id num devices; dev id ) {const int top dev id 0 ? dev id - 1 : (num devices-1); const int bottom (dev id 1)%num devices;cudaSetDevice( dev id );cudaMemsetAsync(l2 norm d[dev id], 0 , sizeof(real) );jacobi kernel dim grid,dim block ( a new[dev id], a[dev id], l2 norm d[dev id],iy start[dev id], iy end[dev id], nx );cudaMemcpyAsync( l2 norm h[dev id], l2 norm d[dev id], sizeof(real), cudaMemcpyDeviceToHost );cudaMemcpyAsync( a new[top] (iy end[top]*nx), a new[dev id] iy start[dev id]*nx, nx*sizeof(real), .);cudaMemcpyAsync( a new[bottom], a new[dev id] (iy end[dev id]-1)*nx, nx*sizeof(real), .);}l2 norm 0.0;for ( int dev id 0; dev id num devices; dev id ) {cudaSetDevice( dev id ); cudaDeviceSynchronize();l2 norm *(l2 norm h[dev id]);}l2 norm std::sqrt( l2 norm );for ( int dev id 0; dev id num devices; dev id ) std::swap(a new[dev id],a[dev id]);iter ;}47

GPUDIRECT P2PEnable P2Pfor ( int dev id 0; dev id num devices; dev id ) {cudaSetDevice( dev id );const int top dev id 0 ? dev id - 1 : (num devices-1);int canAccessPeer 0;cudaDeviceCanAccessPeer ( &canAccessPeer, dev id, top );if ( canAccessPeer )cudaDeviceEnablePeerAccess ( top, 0 );const int bottom (dev id 1)%num devices;if ( top ! bottom ) {cudaDeviceCanAccessPeer ( &canAccessPeer, dev id, bottom );if ( canAccessPeer )cudaDeviceEnablePeerAccess ( bottom, 0 );}}48

EXAMPLE JACOBITop/Bottom HalocudaMemcpyAsync(a new[top] (iy end[top]*nx),a new[dev id] iy start[dev id]*nx, nx*sizeof(real), .);49

EXAMPLE JACOBITop/Bottom HalocudaMemcpyAsync(a new[top] (iy end[top]*nx),1a new[dev id] iy start[dev id]*nx, nx*sizeof(real), .);150

EXAMPLE JACOBITop/Bottom HalocudaMemcpyAsync(a new[top] (iy end[top]*nx),21a new[dev id] iy start[dev id]*nx, nx*sizeof(real), .);cudaMemcpyAsync(a new[bottom],a new[dev id] (iy end[dev id]-1)*nx, nx*sizeof(real), .);2151

MULTIPLE PROCESS, SINGLE GPU W/O MPI!while ( l2 norm tol && iter iter max ) {const int top dev id 0 ? dev id - 1 : (num devices-1); const int bottom (dev id 1)%num devices;cudaSetDevice( dev id );cudaMemsetAsync(l2 norm d[dev id], 0 , sizeof(real) );jacobi kernel dim grid,dim block ( a new[dev id], a[dev id], l2 norm d[dev id],iy start[dev id], iy end[dev id], nx );cudaMemcpyAsync( l2 norm h[dev id], l2 norm d[dev id], sizeof(real), cudaMemcpyDeviceToHost );cudaMemcpyAsync( a new[top] (iy end[top]*nx), a new[dev id] iy start[dev id]*nx, nx*sizeof(real), .);cudaMemcpyAsync( a new[bottom], a new[dev id] (iy end[dev id]-1)*nx, nx*sizeof(real), .);l2 norm 0.0;for ( int dev id 0; dev id num devices; dev id ) {l2 norm *(l2 norm h[dev id]);}l2 norm std::sqrt( l2 norm );std::swap(a new[dev id],a[dev id]);iter ;}52

GPUDIRECT P2PEnable CUDA Intra-Process Communication (IPC)!cudaSetDevice( dev id );// Allocate and fill my device buffercudaMalloc((void **) &myBuf, nbytes);cudaMemcpy((void *) myBuf, (void*) buf, nbytes, cudaMemcpyHostToDevice);// Get my IPC handlecudaIpcMemHandle t myIpc;cudaIpcGetMemHandle(&myIpc, myBuf);53

GPUDIRECT P2PEnable CUDA Intra-Process Communication (IPC)!cudaSetDevice( dev id );// Allocate and fill my device buffercudaMalloc((void **) &myBuf, nbytes);cudaMemcpy((void *) myBuf, (void*) buf, nbytes, cudaMemcpyHostToDevice);// Get my IPC handlecudaIpcMemHandle t myIpc;cudaIpcGetMemHandle(&myIpc, myBuf);myBuf54

GPUDIRECT P2PEnable CUDA Intra-Process Communication (IPC)!cudaSetDevice( dev id );// Allocate and fill my device buffercudaMalloc((void **) &myBuf, nbytes);cudaMemcpy((void *) myBuf, (void*) buf, nbytes, cudaMemcpyHostToDevice);// Get my IPC handlecudaIpcMemHandle t myIpc;cudaIpcGetMemHandle(&myIpc, myBuf);myBufmyBuf55

GPUDIRECT P2PEnable CUDA Intra-Process Communication (IPC)!cudaSetDevice( dev id );// Allocate and fill my device buffercudaMalloc((void **) &myBuf, nbytes);cudaMemcpy((void *) myBuf, (void*) buf, nbytes, cudaMemcpyHostToDevice);// Get my IPC handlecudaIpcMemHandle t myIpc;cudaIpcGetMemHandle(&myIpc, myBuf);Process 1myBufmyBufProcess 2Process 356

GPUDIRECT P2PEnable CUDA Intra-Process Communication (IPC)!cudaSetDevice( dev id );// Allocate and fill my device buffercudaMalloc((void **) &myBuf, nbytes);cudaMemcpy((void *) myBuf, (void*) buf, nbytes, cudaMemcpyHostToDevice);// Get my IPC handlecudaIpcMemHandle t myIpc;cudaIpcGetMemHandle(&myIpc, myBuf);Process 1myBufmyBufProcess 2Process 357

EXAMPLE JACOBITop/Bottom Halo// Open their Ipc Handle onto a pointercudaIpcOpenMemHandle((void **) &a new[top], topIpc,cudaIpcMemLazyEnablePeerAccess); cudaCheckError();cudaMemcpyAsync(a new[top] (iy end[top]*nx),a new[dev id] iy start[dev id]*nx, nx*sizeof(real), .);58

EXAMPLE JACOBITop/Bottom HalocudaIpcOpenMemHandle((void **) &a new[top], topIpc,cudaIpcMemLazyEnablePeerAccess); cudaCheckError();cudaMemcpyAsync(a new[top] (iy end[top]*nx),1a new[dev id] iy start[dev id]*nx, nx*sizeof(real), .);159

EXAMPLE JACOBITop/Bottom HalocudaIpcOpenMemHandle((void **) &a new[top], topIpc,cudaIpcMemLazyEnablePeerAccess); cudaCheckError();cudaMemcpyAsync(cudaMemcpyAsync(a new[top] (iy end[top]*nx),a new[top] (iy end[top]*nx),21a new[dev id] iy start[dev id]*nx,a new[dev id] iy start[dev id]*nx, nx*sizeof(real),nx*sizeof(real), .);.);cudaIpcOpenMemHandle((void **) &a new[bottom], bottomIpc,cudaIpcMemLazyEnablePeerAccess); cudaCheckError();cudaMemcpyAsync(a new[bottom],a new[dev id] (iy end[dev id]-1)*nx, nx*sizeof(real), .);2160

GPU TO GPU COMMUNICATIONCUDA aware MPI functionally portableOpenACC/MP interoperablePerformance may vary between on/off node, socket, HW support for GPU DirectWARNING: Unified memory support varies wildly between implementations!Single-process, multi-GPUEnable peer access for straight forward on-node transfersMulti-process, single-gpuPass CUDA IPC handles for on-node copiesCombine for more flexibility/complexity!61

SPECTRUMMPI &JSRUN TIPS ANDTRICKS62

UNDER THE HOODSummit has fat nodes!Many connectionsMany devicesMany stacks63

ESSENTIAL TOOLS64

OLCF JSRUNVISUALIZERFor (most of) your layout html65

JSRUN/SMPI GPU OPTIONSTo enable CUDA aware MPI, use jsrun --smpiargs “-gpu”To run GPU code without MPI, use jsrun --smpiargs “off”66

11/5/2018PROFILING MPI CUDA APPLICATIONSUsing nvprof NVVPNew since CUDA 9Embed MPI rank in output filename, process name, and context name (OpenMPI)jsrun args nvprof --output-profile profile.%q{OMPI COMM WORLD RANK}\--process-name "rank %q{OMPI COMM WORLD RANK}“\--context-name "rank %q{OMPI COMM WORLD RANK}“\--annotate-mpi openmpiAlternatives:Only save the textual output (--log-file)MVAPICH2: MV2 COMM WORLD RANK--annotate-mpi mpichCollect data from all processes that run on a node (--profile-all-processes)67

PROFILING MPI CUDA APPLICATIONSUsing nvprof NVVP68

11/5/2018PROFILING NVLINK USAGEUsing nvprof NVVPRun nvprof multiple times to collect metricsjsrun args nvprof --output-profile profile. metric .%q{OMPI COMM WORLD RANK}\--aggregate-mode off --event-collection-mode continuous \--metrics metric –fUse --query-metrics and --query-events for full list of metrics (-m) or events (-e)Combine with an MPI annotated timeline file for full picture69

11/5/2018PROFILING NVLINK USAGEUsing nvprof NVVP70

11/5/2018PROFILING NVLINK USAGEUsing nvprof NVVP71

11/5/2018PROFILING NVLINK USAGEUsing nvprof NVVP72

EXAMPLES73

SIMPLE MPI PING-PONG CODEstart MPI Wtime();for (i 0; i NLOOPS; i ) {send func(cubuf, buf, nbytes, 1, 1000 i);recv func(cubuf, buf, nbytes, 1, 2000 i);}stop MPI Wtime();voidstagedSend(void *cubuf, void *hostbuf, size t nbytes, int dest, int tag){cudaMemcpy(hostbuf, cubuf, nbytes, cudaMemcpyDeviceToHost); cudaCheckError();MPI Send(hostbuf, nbytes, MPI BYTE, dest, tag, MPI COMM WORLD);}voidnakedSend(void *cubuf, void *hostbuf, size t nbytes, int dest, int tag){MPI Send(cubuf, nbytes, MPI BYTE, dest, tag, MPI COMM WORLD);}74

ON SOCKET TRANSFERSjsrun -n 1 –c 2 -g 2 -a 2 -d packed -b packed:1 [--smpiargs “-gpu”]75

WHAT DOES DATA MOVEMENT LOOK LIKE?NVLinks provide alternate pathsStaged through the hostCUDA Aware MPIWith CUDA IPCCUDA Aware MPIWithout CUDA IPC76

ON SOCKET TRANSFERSjsrun -n 1 –c 2 -g 2 -a 2 -d packed -b packed:1 [--smpiargs “-gpu”]77

OFF SOCKET, ON NODE TRANSFERSexport CUDA VISIBLE DEVICES 0,3jsrun -n 1 -c 42 -g 6 -a 2 -d packed -b packed:21 [--smpiargs “-gpu”]78

OFF NODE TRANSFERSjsrun -n 2 -c 42 -g 6 -a 1 -d packed -b packed:42[--smpiargs “-gpu”]79

KNOWN ISSUESThings to watch out forNo CUDA IPC across resource sets:[1]Error opening IPC Memhandle from peer:0, invalid argumentOne WAR: set PAMI DISABLE IPC 1One (more complicated) WAR: bsub –step cgroup n and swizzle CUDA VISIBLE DEVICES [0,1,2] & [1,0,2] & [2,1,0]Avoid CUDA Managed Memory or MPI Derived Types in GPU sends!80

CLOSING SUMMARY81

MULTI-GPU APPROACHESChoosing an approachSingle-Threaded, Multiple-GPUs – Requires additional loops to manage devices, likelyundesirable.Multi-Threaded, Multiple-GPUs – Very convenient set-and-forget the device. Could possiblyconflict with existing threading.Multiple-Ranks, Single-GPU each – Probably the simplest if you already have MPI, hedecomposition is done. Must get your MPI placement correctMultiple-Ranks, Multiple-GPUs – Can allow all GPUs to share common data structures. Only dothis is you absolutely need to, difficult to get right.82

GPU TO GPU COMMUNICATIONCUDA aware MPI functionally portableOpenACC/MP interoperablePerformance may vary between on/off node, socket, HW support for GPU DirectWARNING: Unified memory support varies wildly between implementations!Single-process, multi-GPUEnable peer access for straight forward on-node transfersMulti-process, single-gpuPass CUDA IPC handles for on-node copiesCombine for more flexibility/complexity!83

ESSENTIAL TOOLS AND TRICKPick on-node layout with OLCF jsrun dex.htmlSelect MPI/GPU interaction with jsrun --smpiargs“-gpu” for CUDA aware, “off” for pure GPU without MPIProfile MPI and NVLinks with nvprofGood performance will require experimentation!84

Single Thread, Multiple GPUs A single thread will change devices as-needed to send data and kernels to different GPUs Multiple Threads, Multiple GPUs Using OpenMP, Pthreads, or similar, each thread can manage its own GPU Multiple Ranks, Single GPU Each rank acts as-if there’s just 1 GPU, but multiple ranks per node use all GPUs

Related Documents:

transplant a parallel approach from a single-GPU to a multi-GPU system. One major reason is the lacks of both program-ming models and well-established inter-GPU communication for a multi-GPU system. Although major GPU suppliers, such as NVIDIA and AMD, support multi-GPUs by establishing Scalable Link Interface (SLI) and Crossfire, respectively .

OpenCV GPU header file Upload image from CPU to GPU memory Allocate a temp output image on the GPU Process images on the GPU Process images on the GPU Download image from GPU to CPU mem OpenCV CUDA example #include opencv2/opencv.hpp #include <

NVIDIA vCS Virtual GPU Types NVIDIA vGPU software uses temporal partitioning and has full IOMMU protection for the virtual machines that are configured with vGPUs. Virtual GPU provides access to shared resources and the execution engines of the GPU: Graphics/Compute , Copy Engines. A GPU hardware scheduler is used when VMs share GPU resources.

GPU Tutorial 1: Introduction to GPU Computing Summary This tutorial introduces the concept of GPU computation. CUDA is employed as a framework for this, but the principles map to any vendor’s hardware. We provide an overview of GPU computation, its origins and development, before presenting both the CUDA hardware and software APIs. New Concepts

limitation, GPU implementers made the pixel processor in the GPU programmable (via small programs called shaders). Over time, to handle increasing shader complexity, the GPU processing elements were redesigned to support more generalized mathematical, logic and flow control operations. Enabling GPU Computing: Introduction to OpenCL

Possibly: OptiX speeds both ray tracing and GPU devel. Not Always: Out-of-Core Support with OptiX 2.5 GPU Ray Tracing Myths 1. The only technique possible on the GPU is “path tracing” 2. You can only use (expensive) Professional GPUs 3. A GPU farm is more expensive than a CPU farm 4. A

Latest developments in GPU acceleration for 3D Full Wave Electromagnetic simulation. Current and future GPU developments at CST; detailed simulation results. Keywords: gpu acceleration; 3d full wave electromagnetic simulation, cst studio suite, mpi-gpu, gpu technology confere

plify development of HPC applications, they can increase the difficulty of tuning GPU kernels (routines compiled for offloading to a GPU) for high performance by separating developers from many key details, such as what GPU code is generated and how it will be executed. To harness the full power of GPU-accelerated nodes, application