Introduction To GPU Computing

3y ago
34 Views
4 Downloads
3.03 MB
60 Pages
Last View : 27d ago
Last Download : 3m ago
Upload by : Maxton Kershaw
Transcription

INTRODUCTION TO GPU COMPUTING

Add GPUs: Accelerate Science ApplicationsCPUGPU2

ACCELERATED COMPUTING IS GROWING RAPIDLY11x GPU Developers450 Applications Accelerated500Available Everywhere 730M485450400350CUDA Enabled GPUs300250615,00020015010045,0005002011 2012 2013 2014 2015 2016 201720122017 2200Universities Teaching CUDA3

SMALL CHANGES, BIG SPEED-UPApplication CodeCompute-Intensive FunctionsGPURest of SequentialCPU Code5% of Code CPU4

3 WAYS TO ACCELERATE y AccelerateApplicationsMaximumFlexibility5

3 WAYS TO ACCELERATE y AccelerateApplicationsMaximumFlexibility6

LIBRARIES: EASY, HIGH-QUALITY ACCELERATIONEASE OF USE Using libraries enables GPU acceleration without in-depthknowledge of GPU programming“DROP-IN” Many GPU-accelerated libraries follow standard APIs, thusenabling acceleration with minimal code changesQUALITY Libraries offer high-quality implementations of functionsencountered in a broad range of applicationsPERFORMANCE NVIDIA libraries are tuned by experts7

GPU ACCELERATED LIBRARIES“Drop-in” Acceleration for Your ApplicationsDEEP LEARNINGcuDNNTensorRTSIGNAL, IMAGE & VIDEODeepStream SDKLINEAR ALGEBRAcuBLAScuSPARSECUDAMath librarycuFFTNVIDIA NPPCODEC SDKPARALLEL ALGORITHMScuSOLVERnvGRAPHNCCLcuRAND8

3 STEPS TO CUDA-ACCELERATED APPLICATIONStep 1: Substitute library calls with equivalent CUDA library callssaxpy ( )cublasSaxpy ( )Step 2: Manage data locality- with CUDA:- with CUBLAS:cudaMalloc(), cudaMemcpy(), etc.cublasAlloc(), cublasSetVector(), etc.Step 3: Rebuild and link the CUDA-accelerated librarygcc myobj.o –l cublas9

DROP-IN ACCELERATION (STEP 1)int N 1 20;// Perform SAXPY on 1M elements: y[] a*x[] y[]saxpy(N, 2.0, d x, 1, d y, 1);10

DROP-IN ACCELERATION (STEP 1)int N 1 20;// Perform SAXPY on 1M elements: d y[] a*d x[] d y[]cublasSaxpy(N, 2.0, d x, 1, d y, 1);Add “cublas” prefixand use device variables11

DROP-IN ACCELERATION (STEP 2)int N 1 20;cublasInit();Initialize cuBLAS// Perform SAXPY on 1M elements: d y[] a*d x[] d y[]cublasSaxpy(N, 2.0, d x, 1, d y, 1);cublasShutdown();Shut down cuBLAS12

DROP-IN ACCELERATION (STEP 3)int N 1 20;cublasInit();cublasAlloc(N, sizeof(float), (void**)&d x);cublasAlloc(N, sizeof(float), (void*)&d y);Allocate device vectors// Perform SAXPY on 1M elements: d y[] a*d x[] d y[]cublasSaxpy(N, 2.0, d x, 1, d y, 1);cublasFree(d x);cublasFree(d y);cublasShutdown();Deallocate device vectors13

DROP-IN ACCELERATION (STEP 4)int N 1 20;cublasInit();cublasAlloc(N, sizeof(float), (void**)&d x);cublasAlloc(N, sizeof(float), (void*)&d y);cublasSetVector(N, sizeof(x[0]), x, 1, d x, 1);cublasSetVector(N, sizeof(y[0]), y, 1, d y, 1);Transfer data to GPU// Perform SAXPY on 1M elements: d y[] a*d x[] d y[]cublasSaxpy(N, 2.0, d x, 1, d y, 1);cublasGetVector(N, sizeof(y[0]), d y, 1, y, 1);Read data back GPUcublasFree(d x);cublasFree(d y);cublasShutdown();14

ACCELERATING OCTAVEScientific Programming LanguageMathematics-oriented syntaxDrop-in compatible with many MATLAB scriptsBuilt-in plotting and visualization toolsRuns on GNU/Linux, macOS, BSD, and WindowsFree SoftwareSource: http://www.gnu.org/software/octave/15

NVBLASDrop-in GPU AccelerationRoutinesTypesOperationgemmS,D,C,Z Multiplication of 2 matricessyrkS,D,C,Z Symmetric rank-k updateherkC,Zsyr2kS,D,C,Z Symmetric rank-2k pdateher2kC,ZtrsmS,D,C,Z Triangular solve, mult right-handtrmmS,D,C,Z Triangular matrix-matrix multsymmS,D,C,Z Symmetric matrix-matrix multhemmC,ZHermitian rank-k updateHemitian rank-2k updateHermitian matrix-matrix mult16

PERFORMANCE COMPARISONCPU (openblas) vs GPU (NVBLAS)Dell C4130 128 GB 36-core, E5-2697 v4 @ 2.30GHz 4x NVIDIA Tesla P100-SXM2 NVLINKSGEMM (GFLOPS)DGEMM 40060030040020020010000N 2048CPUN 4096GPU with NVBLAS libraryN 8192N 2048CPUN 4096N 8192GPU with NVBLAS library17

3 WAYS TO ACCELERATE y AccelerateApplicationsMaximumFlexibility18

OpenACC is a directivesbased programming approachto parallelcomputingdesigned for performanceand portability on CPUsAdd Simple Compiler Directivemain(){ serial code #pragma acc kernels{ parallel code }}and GPUs for HPC.19

TOP HPC APPS ADOPTING OPENACCOpenACC – Performance Portability And Ease of ProgrammingANSYS FluentANSYS Fluent R18.0 Radiation SolverVASPGaussian3 of Top 10 ORB5ORNL55ORNLCAAR CAARCodes5 CSCS CodesCodes5 CSCS CodesCPU: (Haswell EP) Intel(R) Xeon(R) CPU E5-2695 v3 @2.30GHz, 2 sockets, 28 coresGPU: Tesla K80 12 12 GB, Driver 346.4620

CFD12X speedupin 1 weekMedicalImaging10X faster kernels2X faster app40 days to2 hours3X speedupNekCEMComputationalElectromagnetics2.5X speedup60% less DAstrophysics40X speedup3X energy efficiency4X speedupSingle CPU/GPU code4.4X speedup4 weeks effort21

2 BASIC STEPS TO GET STARTEDStep 1: Annotatesource code with directives:! acc data copy(util1,util2,util3) copyin(ip,scp2,scp2i)! acc parallel loop ! acc end parallel! acc end dataStep 2: Compile& run:pgf90 -ta nvidia -Minfo accel file.f22

OpenACC DIRECTIVES EXAMPLE! acc data copy(A,Anew)iter 0do while ( err tol .and. iter iter max )Copy arrays into GPU memorywithin data regioniter iter 1err 0. fp kind! acc kernelsdo j 1,mdo i 1,nAnew(i,j) .25 fp kind *( A(i 1,j ) A(i-1,j ) & A(i ,j-1) A(i ,j 1))err max( err, Anew(i,j)-A(i,j))end doend do! acc end kernelsIF(mod(iter,100) 0 .or. iter 1)A Anewend do! acc end dataParallelize code inside regionClose off parallel regionprint *, iter, errClose off data region,copy data back23

HETEROGENEOUS ARCHITECTURESUnified MemoryGPU 0GPU 1GPU 2GPU 0MEMGPU 1MEMGPU 2MEMCPUSYS MEM24

OPENACC FOR EVERYONENew PGI Community Edition Now AvailableFREEPROGRAMMING MODELSOpenACC, CUDA Fortran, OpenMP,C/C /Fortran Compilers and ToolsPLATFORMSX86, OpenPOWER, NVIDIA GPUUPDATES1-2 times a year6-9 times a year6-9 times a yearSUPPORTUser ForumsPGI SupportPGI ite25

RESOURCESFREE CompilerSuccess storiesGuidesTutorialsVideosCoursesCode SamplesTalksBooks SpecificationTeaching MaterialsSlack&StackOverflowSuccess stories: https://www.openacc.org/success-storiesResources: https://www.openacc.org/resourcesFree Compiler: https://www.pgroup.com/products/community.htm26

CUDA PROGRAMMING LANGUAGES27

GPU PROGRAMMING LANGUAGESNumerical analyticsMATLAB, Mathematica, LabVIEW, OctaveFortranCUDA Fortran, OpenACCC, C CUDA C , OpenACCPythonCUDA Python, PyCUDA, Numba, PyCulibC#OtherAltimesh Hybridizer, Alea GPUR, Julia28

CUDA CStandard C CodeParallel C Code}globalvoid saxpy parallel(int n,float a,float *x,float *y){int i blockIdx.x*blockDim.x threadIdx.x;if (i n) y[i] a*x[i] y[i];}// Perform SAXPY on 1M elementssaxpy serial(4096*256, 2.0, x, y);// Perform SAXPY on 1M elementssaxpy parallel 4096,256 (n,2.0,x,y);void saxpy serial(int n,float a,float *x,float *y){for (int i 0; i n; i)y[i] a*x[i] y[i];http://developer.nvidia.com/cuda-toolkit29

CUDA C : DEVELOP GENERIC PARALLEL CODECUDA C features enablesophisticated and flexibleapplications and middlewareClass hierarchiesdevice methodsTemplatesOperator overloadingFunctors (function objects)Device-side new/deleteMore http://developer.nvidia.com/cuda-toolkittemplate typename T struct Functor {device Functor( a) : a( a) {}device T operator(T x) { return a*x; }T a;}template typename T, typename Oper global void kernel(T *output, int n) {Oper op(3.7);output new T[n]; // dynamic allocationint i blockIdx.x*blockDim.x threadIdx.x;if (i n)output[i] op(i); // apply functor}30

CUDA FORTRAN Program GPU using Fortran Key language for HPC Simple language extensions Kernel functions Thread / block IDs Device & datamanagement Parallel loop directives Familiar syntax Use allocate, deallocate Copy CPU-to-GPU withassignment ( )http://developer.nvidia.com/cuda-fortranmodule mymodule containsattributes(global) subroutine saxpy(n,a,x,y)real :: x(:), y(:), a,integer n, iattributes(value) :: a, ni threadIdx%x (blockIdx%x-1)*blockDim%xif (i n) y(i) a*x(i) y(i);end subroutine saxpyend module mymoduleprogram mainuse cudafor; use mymodulereal, device :: x d(2**20), y d(2**20)x d 1.0; y d 2.0call saxpy 4096,256 (2**20,3.0,x d,y d,)y y dwrite(*,*) 'max error ', maxval(abs(y-5.0))end program main31

PYTHON Numba, a just-in-timecompiler for Pythonfunctions (open-source!)import numpy as npfrom numba import vectorize Numba runs inside thestandard Pythoninterpreter@vectorize(['float32(float32, float32)'], target 'cuda')def Add(a, b):return a b Can compile for GPU orCPU!#NABC Includes PyculibInitialize arrays 100000 np.ones(N, dtype np.float32) np.ones(A.shape, dtype A.dtype) np.empty like(A, dtype A.dtype)# Add arrays on GPUC Add(A, B)NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.32

import numpy as np from numba import vectorize @vectorize(['float32(float32, float32)'], target 'cuda') def Add(a, b): return a b # Initialize arrays N 100000 A np.ones(N, dtype np.float32) Bimport numpy as np from numba import vectorize @vectorize(['float32(float32, float32)'], target 'cuda') def Add(a, b): return a b # Initialize arrays N 100000 A np.ones(N, dtype np.float32)PYTHON - PYCULIB Python interface to CUDAlibraries: cuBLAS (dense linearalgebra), cuFFT (FastFourier Transform), andcuRAND (random numbergeneration) Code generates a millionuniformly distributedrandom numbers on theGPU using the “XORWOW”pseudorandom numbergeneratorimport numpy as npfrom pyculib import rand as curandprng curand.PRNG(rndtype curand.PRNG.XORWOW)rand np.empty(100000)prng.uniform(rand)print rand[:10]NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.33

JULIA Up and coming scientific language Cross between Python and Matlab Interpreter (like Python) –or- compiled (like C/Fortran) New approach to multi-processing/multi-node Or use MPI Easy to combine it with other languages Works with Jupyter Notebooks!34

import numpy as np from numba import vectorize @vectorize(['float32(float32, float32)'], target 'cuda') def Add(a, b): return a b # Initialize arrays N 100000 A np.ones(N, dtype np.float32) Bimport numpy as np from numba import vectorize @vectorize(['float32(float32, float32)'], target 'cuda') def Add(a, b): return a b # Initialize arrays N 100000 A np.ones(N, dtype np.float32)JULIA – SIMPLE EXAMPLE Simple matrix multiplicationexample (integers) Double precision (Int64) Can also do elementwisemultiplication (just likeMatlab) A .* Bjulia A [1 2 ; 3 4]2x2 Array{Int64,2}:1 23 4julia B [10 11 ; 12 13]2x2 Array{Int64,2}:10 1112 13julia A * B2x2 Array{Int64,2}:34 3778 85NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.35

JULIA – GPU EXAMPLEusing CUDAdrv, CUDAnative Options: JuliaGPU (github) Native CUDA (new) GPUarrays Simple native GPU examplefunction kernel vadd(a, b, c)# from CUDAnative: (implicit) CuDeviceArray type,#and thread/block intrinsicsi (blockIdx().x-1) * blockDim().x threadIdx().xc[i] a[i] b[i]return nothingenddev CuDevice(0)ctx CuContext(dev)# generate some datalen 512a rand(Int, len)b rand(Int, len)# allocate & upload on the GPUd a CuArray(a)d b CuArray(b)d c similar(d a)# execute and fetch results@cuda (1,len) kernel vadd(d a, d b, d c)c Array(d c)# from CUDAnative.jlusing Base.Test@test c a bdestroy(ctx)NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.36

JULIA – GPU EXAMPLE GPUarrays example Convolutionusing GPUArrays, Colors, FileIO, ImageFilteringusing CLArraysusing GPUArrays: synchronize threadsimport GPUArrays: LocalMemoryusing CLArraysimg pg"));a CLArray(img);out similar(a);k c similar(img)####convolution!(a, out, k);Array(out)outc similar(img)copy!(outc, out)NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.37

R Very popular Statistics language Used heavily in Machine Learning gpuR package38

import numpy as np from numba import vectorize @vectorize(['float32(float32, float32)'], target 'cuda') def Add(a, b): return a b # Initialize arrays N 100000 A np.ones(N, dtype np.float32) Bimport numpy as np from numba import vectorize @vectorize(['float32(float32, float32)'], target 'cuda') def Add(a, b): return a b # Initialize arrays N 100000 A np.ones(N, dtype np.float32)R - GPUR gpuR package Simple integer addition oftwo vectors with 1,000valuesA B gpuAgpuBseq.int(from 0, to 999)seq.int(from 1000, to 1) - gpuVector(A) - gpuVector(B)C - A BgpuC - gpuA gpuBall(C gpuC)NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.39

MATLAB Native support for most operations/functions Next speaker will cover this 40

GET STARTED TODAYThese languages are supported on all CUDA-capable GPUs.You might already have a CUDA-capable GPU in your laptop or desktop PC!CUDA C/C CUDA ://developer.nvidia.com/how-to-cuda-pythonThrust C Template LibraryCUDA -opencl-support/41

THANK YOUdeveloper.nvidia.com42

SIX WAYS TO SAXPYProgramming Languages for GPU Computing43

SINGLE PRECISION ALPHA X PLUS Y (SAXPY)Part of Basic Linear Algebra Subroutines (BLAS) Library𝒛 𝛼𝒙 𝒚x, y, z : vector : scalarGPU SAXPY in multiple languages and librariesA menagerie* of possibilities, not a tutorial44*technically, a program chrestomathy: http://en.wikipedia.org/wiki/Chrestomathy

OpenACC COMPILER DIRECTIVESParallel C CodeParallel Fortran Codevoid saxpy(int n,float a,float *x,float *y){#pragma acc kernelsfor (int i 0; i n; i)y[i] a*x[i] y[i];}subroutine saxpy(n, a, x, y)real :: x(:), y(:), ainteger :: n, i! acc kernelsdo i 1,ny(i) a*x(i) y(i)enddo! acc end kernelsend subroutine saxpy.// Perform SAXPY on 1M elementssaxpy(1 20, 2.0, x, y);.! Perform SAXPY on 1M elementscall saxpy(2**20, 2.0, x d, y d).http://developer.nvidia.com/openacc or http://openacc.org45

cuBLAS LIBRARYSerial BLAS CodeParallel cuBLAS Codeint N 1 20;int N 1 20;.// Use your choice of blas librarycublasInit();cublasSetVector(N, sizeof(x[0]), x, 1, d x, 1);cublasSetVector(N, sizeof(y[0]), y, 1, d y, 1);// Perform SAXPY on 1M elementsblas saxpy(N, 2.0, x, 1, y, 1);// Perform SAXPY on 1M elementscublasSaxpy(N, 2.0, d x, 1, d y, 1);cublasGetVector(N, sizeof(y[0]), d y, 1, y, 1);cublasShutdown();You can also call cuBLAS from Fortran,C , Python, and other languageshttp://developer.nvidia.com/cublas46

Standard Cvoid saxpy(int n, float a,float *x, float *y){for (int i 0; i n; i)y[i] a*x[i] y[i];}CUDA CParallel Cglobalvoid saxpy(int n, float a,float *x, float *y){int i blockIdx.x*blockDim.x threadIdx.x;if (i n) y[i] a*x[i] y[i];}int N 1 20;int N 1 20;cudaMemcpy(d x, x, N, cudaMemcpyHostToDevice);cudaMemcpy(d y, y, N, cudaMemcpyHostToDevice);// Perform SAXPY on 1M elementssaxpy(N, 2.0, x, y);// Perform SAXPY on 1M elementssaxpy 4096,256 (N, 2.0, d x, d y);cudaMemcpy(y, d y, N, m/cuda-toolkit47

THRUST C TEMPLATE LIBRARYSerial C CodeParallel C Codewith STL and Boostint N 1 20;std::vector float x(N), y(N);int N 1 20;thrust::host vector float x(N), y(N);.thrust::device vector float d x x;thrust::device vector float d y y;// Perform SAXPY on 1M elementsstd::transform(x.begin(), x.end(),y.begin(), y.end(),2.0f * 1 2);www.boost.org/libs/lambda// Perform SAXPY on 1M elementsthrust::transform(d x.begin(), d x.end(),d y.begin(),d y.begin(),2.0f * 1 2)http://thrust.github.com48

CUDA FORTRANStandard FortranParallel Fortranmodule mymodule containssubroutine saxpy(n, a, x, y)real :: x(:), y(:), ainteger :: n, ido i 1,ny(i) a*x(i) y(i)enddoend subroutine saxpyend module mymodulemodule mymodule containsattributes(global) subroutine saxpy(n, a, x, y)real :: x(:), y(:), ainteger :: n, iattributes(value) :: a, ni threadIdx%x (blockIdx%x-1)*blockDim%xif (i n) y(i) a*x(i) y(i)end subroutine saxpyend module mymoduleprogram mainuse mymodulereal :: x(2**20), y(2**20)x 1.0, y 2.0! Perform SAXPY on 1M elementscall saxpy(2**20, 2.0, x, y)end program mainprogram mainuse cudafor; use mymodulereal, device :: x d(2**20), y d(2**20)x d 1.0, y d 2.0! Perform SAXPY on 1M elementscall saxpy 4096,256 (2**20, 2.0, x d, y d)end program mainhttp://developer.nvidia.com/cuda-fortran49

Standard PythonPYTHONNumba Parallel Pythonimport numpy as npimport numpy as npfrom numba import vectorizedef saxpy(a, x, y):return [a * xi yifor xi, yi in zip(x, y)]@vectorize(['float32(float32, float32,float32)'], target 'cuda')def saxpy(a, x, y):return a * x yx np.arange(2**20, dtype np.float32)y np.arange(2**20, dtype np.float32)cpu result saxpy(2.0, x, y)N 1048576#ABCInitialize arrays np.ones(N, dtype np.float32) np.ones(A.shape, dtype A.dtype) np.empty like(A, dtype A.dtype)# Add arrays onGPUC saxpy(2.0, X, Y)http://numpy.scipy.orghttps://numba.pydata.org50

ENABLING ENDLESS WAYS TO SAXPYCUDAC, C , FortranNew LanguageSupport Build front-ends for Java, Python, R, DSLs Target other processors like ARM, FPGA,GPUs, x86CUDA Compiler Contributed toOpen Source LLVMLLVM CompilerFor CUDANVIDIAGPUsx86CPUsNew ProcessorSupport51

GPU-ACCELERATED LIBRARIES52

cuBLASDense Linear Algebra on GPUsUp To 5x Faster DeepBench SGEMMThan CPUComplete BLAS Library Plus ExtensionsSupports all 152 standard routines for single, double, complex,and double complexSupports half-precision (FP16) and integer (INT8) matrixmultiplication operationsBatched routines for higher performance on small problem sizesHost and device-callable interfaceXT interface supports distributed computations across multipleGPUshttps://developer.nvidia.com/cublas CUDA 8 (cuBLAS 8.0.88); Driver 375.66; P100 (PCIe, 16GB, Base Clocks). ECC OFF Host System: Intel Xeon Broadwell Dual E5-2690v4 with Ubuntu 14.04.5 and 256GB DDR4memory MKL 2017.3, Compiler v17.0.4; FP32 Input, Output and Compute CPU system; Intel Xeon Broadwell Dual E5-2699v4 (Turbo Enabled) with Ubuntu 14.04.5 and256GB DDR4 memory53

cuFFTComplete Fast Fourier Transforms Library2x Faster Image & Signal Processing thanCUDA 8Complete Multi-Dimensional FFT Library“Drop-in” replacement for CPU FFTW libraryReal and complex, single- and double-precision data typesIncludes 1D, 2D and 3D batched transformsSupport for half-precision (FP16) data typesSupports flexible input and output data layoutsXT interface now supports up to 8 GPUsSpeed up Vs. CUDA 81D2D3D2.5x2.0x1.5x1.0x0.5x0.0x164163844194304Data Size* V100 and CUDA 9 (r384); Intel Xeon Broadwell, dual socket, E5-2698 v4@ 2.6GHz, 3.5GHzTurbo with Ubuntu 14.04.5 x86 64 with 128GB System Memory* P100 and CUDA 8 (r361); For cublas CUDA 8 (r361): Intel Xeon Haswell, single-socket, 16-coreE5-2698 v3@ 2.3GHz, 3.6GHz Turbo with CentOS 7.2 x86-64 with 128GB System Memoryhttps://developer.nvidia.com/cufft54

NPPNVIDIA Performance Primitives LibraryGPU-accelerated Building Blocks for Image, VideoProcessing & Computer VisionOver 2500 image, signal processing and computer visionroutinesColor transforms, geometric transforms, move operations,linear filters, image & signal statistics, image & signalarithmetic, building blocks, image segmentation, median filter,BGR/YUV conversion, 3D LUT c

Introduction to GPU Computing . CPU GPU Add GPUs: Accelerate Science Applications . Small Changes, Big Speed-up Application Code GPU Use GPU to Parallelize CPU Compute-Intensive Functions Rest of Sequential CPU Code . 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming

Related Documents:

the gpu computing era gpu computing is at a tipping point, becoming more widely used in demanding consumer applications and high-performance computing.this article describes the rapid evolution of gpu architectures—from graphics processors to massively parallel many-core multiprocessors, recent developments in gpu computing architectures, and how the enthusiastic

GPU Tutorial 1: Introduction to GPU Computing Summary This tutorial introduces the concept of GPU computation. CUDA is employed as a framework for this, but the principles map to any vendor’s hardware. We provide an overview of GPU computation, its origins and development, before presenting both the CUDA hardware and software APIs. New Concepts

OpenCV GPU header file Upload image from CPU to GPU memory Allocate a temp output image on the GPU Process images on the GPU Process images on the GPU Download image from GPU to CPU mem OpenCV CUDA example #include opencv2/opencv.hpp #include <

Introduction to GPU computing Felipe A. Cruz Nagasaki Advanced Computing Center Nagasaki University, Japan. Felipe A. Cruz Nagasaki University The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU has recently evolved towards a more flexible architecture.

GPU Computing in Matlab u Included in the Parallel Computing Toolbox. u Extremely easy to use. To create a variable that can be processed using the GPU, use the gpuArray function. u This function transfers the storage location of the argument to the GPU. Any functions which use this argument will then be computed by the GPU.

limitation, GPU implementers made the pixel processor in the GPU programmable (via small programs called shaders). Over time, to handle increasing shader complexity, the GPU processing elements were redesigned to support more generalized mathematical, logic and flow control operations. Enabling GPU Computing: Introduction to OpenCL

Will Landau (Iowa State University) Introduction to GPU computing for statisticicans September 16, 2013 20 / 32. Introduction to GPU computing for statisticicans Will Landau GPUs, parallelism, and why we care CUDA and our CUDA systems GPU computing with R CUDA and our CUDA systems Logging in

Latest developments in GPU acceleration for 3D Full Wave Electromagnetic simulation. Current and future GPU developments at CST; detailed simulation results. Keywords: gpu acceleration; 3d full wave electromagnetic simulation, cst studio suite, mpi-gpu, gpu technology confere