Languages, APIs And Development Tools For GPU Computing - Nvidia

1y ago
15 Views
2 Downloads
7.69 MB
57 Pages
Last View : 15d ago
Last Download : 3m ago
Upload by : Sabrina Baez
Transcription

Languages, APIs and Development Toolsfor GPU ComputingWill Ramey Sr. Product Manager for GPU ComputingSan Jose Convention Center, CA September 20–23, 2010

“GPU Computing” Using all processors in the system for the things they arebest at doing— Evolution of CPUs makes them good at sequential, serial tasks— Evolution of GPUs makes them good at parallel processing

MathematicalPackagesLibrariesConsultants, Training& CertificationResearch & EducationDirectXIntegratedDevelopment EnvironmentParallel Nsight for MS Visual StudioTools & PartnersGPU Computing EcosystemLanguages & API’sFortranNVIDIA ConfidentialAll Major Platforms

CUDA - NVIDIA’s Architecture for GPU ComputingBroad AdoptionOver 250M installedCUDA-enabled GPUsGPU Computing ApplicationsOver 650k CUDA Toolkitdownloads in last 2 YrsWindows, Linux andMacOS PlatformssupportedGPU Computing spansHPC to Consumer350 Universitiesteaching GPU Computingon the CUDA ArchitectureCUDAC/C OpenCLOver 100k developersRunning in Productionsince 2008SDK Libs VisualProfiler and DebuggerCommercial OpenCLConformant DriverPublic Availabilityacross all CUDAArchitecture GPU’sSDK Visual ProfilerDirectComputeMicrosoft API forGPU ComputingSupports all CUDAArchitecture GPUs(DX10 and DX11)FortranPGI AcceleratorPGI CUDA FortranNVIDIA GPUwith the CUDA Parallel Computing ArchitectureOpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.Python,Java, .NET, PyCUDAGPU.NETjCUDA

GPU Computing Software StackYour GPU Computing ApplicationApplication Acceleration Engines (AXEs)Middleware, Modules & Plug-insFoundation LibrariesLow-level Functional LibrariesDevelopment EnvironmentLanguages, Device APIs, Compilers, Debuggers, Profilers, etc.CUDA Architecture

Languages & APIs NVIDIA Corporation 2010

Many Different Approaches Application level integration High level, implicit parallel languages Abstraction layers & API wrappers High level, explicit language integration Low level device APIs

GPUs for MathWorks Parallel Computing Toolbox and Distributed Computing Server WorkstationMATLAB Parallel Computing Toolbox (PCT)Compute ClusterMATLAB Distributed Computing Server (MDCS) PCT enables high performance throughparallel computing on workstations MDCS allows a MATLAB PCT application to besubmitted and run on a compute cluster NVIDIA GPU acceleration now available NVIDIA GPU acceleration now availableNVIDIA Confidential

MATLAB Performance with TeslaRelative Performance, Black-Scholes DemoCompared to Single Core CPU BaselineSingle Core CPUQuad Core CPUSingle Core CPU Tesla C1060Quad Core CPU Tesla C106012.0Relative Execution Speed10.08.06.04.02.0256 K1,024 K4,096 K16,384 KInput SizeCore 2 Quad Q6600 2.4 GHz, 6 GB RAM, Windows 7 64-bit, Tesla C1060, single precision operationsNVIDIA Confidential

SUBROUTINE SAXPY (A,X,Y,N)INTEGER NREAL A,X(N),Y(N)! ACC REGIONDO I 1, NX(I) A*X(I) Y(I)ENDDO! ACC END REGIONENDPGI AcceleratorCompilerscompileAuto-generated GPU codeHost x64 asm Filesaxpy : movlmovlcall. . .call call call calllink call Unifieda.outexecute(%rbx), %eax%eax, -4(%rbp)pgi cu initpgi cu functionpgi cu allocpgi cu uploadpgi cu callpgi cu download typedef struct dim3{ unsigned int x,y,z; }dim3;typedef struct uint3{ unsigned int x,y,z; }uint3;extern uint3 const threadIdx, blockIdx;extern dim3 const blockDim, gridDim;static attribute (( global )) voidpgicuda(attribute (( shared )) int tc,attribute (( shared )) int i1,attribute (( shared )) int i2,attribute (( shared )) int n,attribute (( shared )) float* c,attribute (( shared )) float* b,attribute (( shared )) float* a ){ int i; int p1; int i;i blockIdx.x * 64 threadIdx.x;if( i tc ){a[i i2-1] (( c[i i2-1] c[i i2-1]) b[i i2-1]);b[i i2-1] c[i i2];i ( i 1);p1 (p1-1);} } no change to existing makefiles, scripts, IDEs,programming environment, etc.

PyCUDA / PyOpenCLSlide courtesy of Andreas Klöckner, Brown Universityhttp://mathema.tician.de/software/pycuda

CUDA C: C with a few keywordsvoid saxpy serial(int n, float a, float *x, float *y){for (int i 0; i n; i)y[i] a*x[i] y[i];}// Invoke serial SAXPY kernelsaxpy serial(n, 2.0, x, y);Standard C Codeglobal void saxpy parallel(int n, float a, float *x, float *y){int i blockIdx.x*blockDim.x threadIdx.x;if (i n) y[i] a*x[i] y[i];CUDA C}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks (n 255) / 256;saxpy parallel nblocks, 256 (n, 2.0, x, y);Code

Write GPU kernels in C#, F#, VB.NET, etc. Exposes a minimal API accessible from any.NET-based language— Learn a new API instead of a new language JIT compilation dynamic language support Don’t rewrite your existing code— Just give it a ―touch-up‖

OpenCL Cross-vendor open standard— Managed by the Khronos Group Low-level API for device management andlaunching kernelshttp://www.khronos.org/opencl— Close-to-the-metal programming interface— JIT compilation of kernel programs C-based language for compute kernels— Kernels must be optimized for each processor architectureNVIDIA released the first OpenCL conformant driver forWindows and Linux to thousands of developers in June 2009

DirectCompute Microsoft standard for all GPU vendors— Released with DirectX 11 / Windows 7— Runs on all 100M CUDA-enabled DirectX 10 class GPUs and later Low-level API for device management and launching kernels— Good integration with DirectX 10 and 11 Defines HLSL-based language for compute shaders— Kernels must be optimized for each processor architecture

Language & APIs for GPU ComputingApproachExamplesApplication IntegrationMATLAB, Mathematica, LabVIEWImplicit Parallel Languages PGI Accelerator, HMPPAbstraction Layer/WrapperPyCUDA, CUDA.NET, jCUDALanguage IntegrationCUDA C/C , PGI CUDA FortranLow-level Device APICUDA C/C , DirectCompute, OpenCL

Development Tools NVIDIA Corporation 2010

Parallel Nsight for Visual StudioIntegrated development for CPU and GPUBuildDebugProfile

Windows GPU Development for 2010NVIDIA Parallel Nsight 1.5nvccFX Composercuda-gdbShader Debuggercuda-memcheckPerfHUDVisual ProfilerShaderPerfcudaprofPlatform Analyzer

4 Flexible GPU Development ConfigurationsDesktopSingle machine, Single NVIDIA GPUAnalyzer, Graphics InspectorSingle machine, Dual NVIDIA GPUsAnalyzer, Graphics Inspector, Compute DebuggerNetworkedTwo machines connected over the networkAnalyzer, Graphics Inspector, Compute Debugger, Graphics DebuggerTCP/IPWorkstation SLISLI Multi OS workstation with two Quadro GPUsAnalyzer, Graphics Inspector, Compute Debugger, Graphics Debugger NVIDIA Corporation 2010

NVIDIA cuda-gdbCUDA debugging integratedinto GDB on LinuxSupported on 32bit and 64bitsystemsSeamlessly debug both thehost/CPU and device/GPU codeSet breakpoints on any sourceline or symbol nameAccess and print all CUDAmemory allocs, local, global,constant and shared varsIncluded in the CUDA ToolkitParallel SourceDebugging

Allinea DDT debugger Latest News from Allinea CUDA SDK 3.0 with DDT 2.6 Released June 2010 Fermi and Tesla support cuda-memcheck support for memory errors Combined MPI and CUDA support Stop on kernel launch feature Kernel thread control, evaluation andbreakpoints Identify thread counts, ranges and CPU/GPUthreads easily SDK 3.1 in beta with DDT 2.6.1 SDK 3.2 Coming soon: multiple GPU device support

TotalView Debugger Latest from TotalView debugger (in Beta)—Debugging of application running on the GPU device—Full visibility of both Linux threads and GPU device threads——— Device threads shown as part of the parent Unix process Correctly handle all the differences between the CPU and GPUFully represent the hierarchical memory Display data at any level (registers, local, block, global or host memory) Making it clear where data resides with type qualificationThread and Block Coordinates Built in runtime variables display threads in a warp, block and threaddimensions and indexes Displayed on the interface in the status bar, thread tab and stack frameDevice thread control —Handles CUDA function inlining —Step in to or over inlined functionsReports memory access errors —Warps advance SynchronouslyCUDA memcheckCan be used with MPI

NVIDIA Visual ProfilerAnalyze GPU HW performancesignals, kernel occupancy,instruction throughput, and moreHighly configurabletables and graphical viewsSave/load profiler sessions orexport to CSV for later analysisCompare results visuallyacross multiple sessions tosee improvementsWindows, Linux and Mac OS XOpenCL support on Windows and LinuxIncluded in the CUDA Toolkit

GPU Computing SDKHundreds of code samples forCUDA C, DirectCompute and OpenCLFinanceOil & GasVideo/Image Processing3D Volume RenderingParticle SimulationsFluid SimulationsMath Functions

ApplicationDesign Patterns 2009 NVIDIA Corporation

Trivial ApplicationDesign Rules:Serial task processing on CPUData Parallel processing on GPUCopy input data to GPUPerform parallel processingCopy results backFollow guidance in theCUDA C Best Practices GuideApplicationCPU C RuntimeCPUThe CUDA C Runtime could be substitutedwith other methods of accessing the rFortranRuntimeDriverAPIGPUGPUMemory

Basic Application“Trivial Application” plus:Maximize overlap of data transfers and computationMinimize communication required between processorsUse one CPU thread to manage each GPUApplicationCPU C RuntimeCPUMulti-GPU notebook, desktop,workstation and cluster nodeconfigurations are increasingly commonCPUMemoryCUDA C RuntimeGPUGPUMemoryGPUGPUMemory

Graphics Application“Basic Application” plus:Use graphics interop to avoid unnecessary copiesIn Multi-GPU systems, put buffers to be displayed in GPUMemory of GPU attached to the displayApplicationCPU C RuntimeCPUCPUMemoryCUDA C RuntimeGPUOpenGL / Direct3DGPUMemory

Basic Library“Basic Application” plus:Avoid unnecessary memory transfersUse data already in GPU memoryCreate and leave data in GPU memoryLibraryCPU C RuntimeCPUThese rules apply to plug-ins as wellCPUMemoryCUDA C RuntimeGPUGPUMemory

Application with Plug-ins“Basic Application” plus:Plug-in MgrAllows Application and Plug-insto (re)use same GPU memoryMulti-GPU awareFollow “Basic Library” rulesfor the Plug-insApplicationPlug-in MgrPlug-inCPU C RuntimeCPUCPUMemoryPlug-inPlug-inCUDA C RuntimeGPUGPUMemory

Database ApplicationMinimize network communicationMove analysis “upstream”to stored proceduresClient ApplicationorApplication ServerTreat each stored procedurelike a “Basic Application”App Server could also be a“Basic Application”Client Application is also a“Basic Application”Database EngineCPU C RuntimeCPUData Mining, Business Intelligence, etc.CPUMemoryStored ProcedureCUDA C RuntimeGPUGPUMemory

Multi-GPU Cluster ApplicationApplication“Basic Application” plus:CPU C RuntimeUse Shared Memory forintra-node communicationor pthreads, OpenMP, etc.Use MPI to communicatebetween nodesMPI over Ethernet, Infiniband, etc.CPUCPUMemoryCUDA C RuntimeGPUGPUMemoryGPUGPUMemoryApplicationCPU C RuntimeCPUCPUMemoryCUDA C RuntimeGPUGPUMemoryGPUGPUMemoryApplicationCPU C RuntimeCPUCPUMemoryCUDA C RuntimeGPUGPUMemoryGPUGPUMemory

Libraries 2009 NVIDIA Corporation

CUFFT 3.2: Improved Radix-3, -5, -7Radix-3 (SP, ECC off)Radix-3 (DP, ECC off )7025060200150C2070 R3.2C2070 R3.1100GFLOPSGFLOPS5040C2070 R3.2C2070 R3.130MKLMKL2050100012345678910 11 12 13 14 15log3(size)Radix-5, -7 and mixed radix improvements not shownCUFFT 3.2 & 3.1 on NVIDIA Tesla C2070 GPUMKL 10.2.3.029 on Quad-Core Intel Core i7 (Nehalem)123456789 10 11 12 13 14 15log3(size)

CUBLAS Performance12xUp to 2x average speedup over CUBLAS 3.1Speedup vs. MKL10x8xLess variation in performancefor different dimensions vs. 3.16x4x2xMKLv3.10x1024204830724096Matrix dimensions (NxN)Average speedup of {S/D/C/Z}GEMM x {NN,NT,TN,TT}CUFFT 3.2 & 3.1 on NVIDIA Tesla C2050 GPUMKL 10.2.3.029 on Quad-Core Intel Core i7 (Nehalem)512061447168v3.2

CULA (LAPACK for heterogeneous systems)GPU AcceleratedLinear Algebra“CULAPACK” Library» Dense linear algebra» C/C & FORTRAN» 150 RoutinesMATLAB Interface» 15 functions» Up to 10x speedupPartnershipDeveloped inpartnership withNVIDIASupercomputer SpeedsPerformance 7x ofIntel’s MKL LAPACK

CULA - PerformanceSupercomputing SpeedsThis graph shows the relative speed of many CULA functions when compared toIntel’s MKL 10.2. Benchmarks were obtained comparing an NVIDIA Tesla C2050(Fermi) and an Intel Core i7 860. More at www.culatools.com

Sparse Matrix Performance: CPU vs. GPUMultiplication of a sparse matrix by multiple vectors35x30x25x20x15x10x5x0xAverage speedup across S,D,C,ZCUSPARSE 3.2 on NVIDIA Tesla C2050 GPUMKL 10.2.3.029 on Quad-Core Intel Core i7 (Nehalem)"Non-transposed""Transposed"MKL 10.2

RNG Performance: CPU vs. GPUGenerating 100K Sobol' Samples25x20x15xCURAND 3.210xMKL 10.25x0xSPDPSPUniformCURAND 3.2 on NVIDIA Tesla C2050 GPUMKL 10.2.3.029 on Quad-Core Intel Core i7 (Nehalem)DPNormal

NAG GPU Library Monte Carlo related L’Ecuyer, Sobol RNGsDistributions, Brownian Bridge Coming soon Mersenne Twister RNGOptimization, PDEs Seeking input from the community For up-to-date information:www.nag.com/numeric/gpus41

NVIDIA Performance PrimitivesAggregatePerformanceNPP PerformanceSuite GrandResultsTotals Similar to Intel IPP focused onimage and video processing12 6x - 10x average speedup vs. IPP— 2800 performance tests Core i7(new)vs. GTX 285(old)Relative Agregate Speed108642 Now available with CUDA Toolkit0Core2Duo t 1Core2Duo t 2Nehalem t 1Nehalem t 8Processorwww.nvidia.com/nppGeforce 9800GTX Geforce GTX285

OpenVIDIA Open source, supported by NVIDIA Computer Vision Workbench (CVWB)GPU imaging & computer visionDemonstrates most commonly used imageprocessing primitives on CUDADemos, code & net

More Open Source ProjectsThrust: Library of parallel algorithmswith high-level STL-like t: C library for solving PDE’sover regular grids http://code.google.com/p/opencurrent200 projects on Google Code & SourceForgeSearch for CUDA, OpenCL, GPGPU

NVIDIA Application Acceleration Engines - AXEsOptiX – ray tracing engineProgrammable GPU ray tracing pipelinethat greatly accelerates general ray tracing tasksSupports programmable surfaces and custom ray dataOptiX shader exampleSceniX– scene management engineHigh performance OpenGL scene graph builtaround CgFX for maximum interactive qualityProvides ready access to new GPU capabilities & enginesCompleX – scene scaling engineAutodesk Showcase customer exampleDistributed GPU rendering for keeping complex scenesinteractive as they exceed frame buffer limitsDirect support for SceniX, OpenSceneGraph, and more15GB Visible Human model from N.I.H.

NVIDIA PhysX The World’s Most Deployed Physics APIMajor PhysXSite LicenseesIntegrated in MajorGame EnginesUE3DieselGamebryoUnity 3dVisionHeroInstinctBigWorldTrinigyCross PlatformSupportMiddleware & ToolIntegrationSpeedTreeMaxNatural MotionMayaFork ParticlesXSIEmotion FX

Cluster & GridManagement 2009 NVIDIA Corporation

GPU Management & MonitoringNVIDIA Systems Management Interface (nvidia-smi)ProductsFeaturesAll GPUs List of GPUs Product ID GPU Utilization PCI Address to Device EnumerationServer products Exclusive use mode ECC error count & location (Fermi only) GPU temperature Unit fan speeds PSU voltage/current LED state Serial number Firmware versionUse CUDA VISIBLE DEVICES to assign GPUs to processNVIDIA Confidential

Bright Cluster ManagerMost Advanced Cluster Management Solution for GPUclustersIncludes: NVIDIA CUDA, OpenCL libraries and GPU drivers Automatic sampling of all available NVIDIA GPU metrics Flexible graphing of GPU metrics against time Visualization of GPU metrics in Rackview Powerful cluster automation, setting alerts, alarms and actionswhen GPU metrics exceed set thresholds Health checking framework based on GPU metrics Support for all Tesla GPU cards and GPU Computing Systems,including the most recent “Fermi” models49

Symphony Architecture and GPUClientApplicationC#Java APIx64 Host Computer withSession ManagerGPU r(GPU aware)ServiceInstance(GPU aware)CUDA LibrariesSymphony ServiceDirectorServiceInstanceManager.NETAPIx64 Host Computer withGPU supportServiceInstanceGPU 1GPU 2x64 HostComputer with GPU supportServiceServicex64 HostComputer with GPU supportInstance with GPU supportx64 Host eInstance(GPU aware)x64 Host Computer withSession ManagerGPU supportdual quad-core CPUsClientApplicationC C APIClientApplicationJava.NET APIExcelSpreadsheetModelCOM APIClientsJava APISymphonyRepository ServiceServiceInstanceManagerC APIManagement HostsC APICompute HostsHost OSComputer with GPU supportEGO – Resource aware orchestration layer50Copyright 2010 Platform Computing Corporation. All Rights Reserved.

Selecting GPGPU Nodes

DeveloperResources NVIDIA Corporation 2010

NVIDIA Developer ResourcesDEVELOPMENTTOOLSSDKs ANDCODE SAMPLESVIDEOLIBRARIESENGINES &LIBRARIESCUDA ToolkitComplete GPU computingdevelopment kitGPU Computing SDKCUDA C, OpenCL, DirectComputecode samples and documentationMath Librariescuda-gdbGraphics SDKDirectX & OpenGL code samplesVideo Decode AccelerationNVCUVID / NVCUVENCDXVAWin7 MFTVisual ProfilerPhysX SDKComplete game physics solutionOpenAutomateSDK for test automationGPU hardware debuggingGPU hardware profiler forCUDA C and OpenCLParallel NsightIntegrated developmentenvironment for Visual StudioNVPerfKitOpenGL D3D performance toolsFX ComposerShader Authoring IDEhttp://developer.nvidia.comVideo Encode AccelerationNVCUVENCWin7 MFTPost ProcessingNoise reduction / De-interlace/Polyphase scaling / Color processCUFFT, CUBLAS, CUSPARSE,CURAND, NPP Image LibrariesPerformance primitivesfor imagingApp Acceleration EnginesOptimized software modulesfor GPU accelerationShader LibraryShader and post processingOptimization GuidesBest Practices forGPU computing andGraphics development

4 in Japanese, 3 in English, 2 in Chinese, 1 in Russian)10 Published books with 4 in Japanese, 3 in English, 2 in Chinese, 1 in Russian

Google Scholar

GPU Computing Research & EducationWorld Class ResearchLeadership and TeachingUniversity of CambridgeHarvard UniversityUniversity of UtahUniversity of TennesseeUniversity of MarylandUniversity of Illinois at Urbana-ChampaignTsinghua UniversityTokyo Institute of TechnologyChinese Academy of SciencesNational Taiwan UniversityProven Research VisionLaunched June 1stwith 5 premiere Centersand more in reviewQuality GPGPU TeachingLaunched June 1stwith 7 premiere Centersand more in reviewJohn Hopkins University , USANanyan University, SingaporeTechnical University of Ostrava, CzechCSIRO, AustraliaSINTEF, NorwayMcMaster University, CanadaPotsdam, USAUNC-Charlotte,USACal Poly San Luis Obispo, USAITESM, MexicoCzech Technical University, Prague, CzechQingdao University, ChinaPremier Academic PartnersExclusive Events, Latest HW, DiscountsTeaching Kits, Discounts, TrainingAcademic Partnerships / FellowshipsSupporting 100’s of Researchersaround the globe ever yearNV Researchhttp://research.nvidia.comEducation350 Universities

Thank You!Thank you!

Core 2 Quad Q6600 2.4 GHz, 6 GB RAM, Windows 7 64-bit, Tesla C1060, single precision operations-2.0 4.0 6.0 8.0 10.0 12.0 256 K 1,024 K 4,096 K 16,384 K eed Input Size Relative Performance, Black-Scholes Demo Compared to Single Core CPU Baseline Single Core CPU Quad Core CPU Single Core CPU Tesla C1060 Quad Core CPU Tesla C1060

Related Documents:

structures. RPGLE source members are not provided for all APIs, most notably the UNIX-Type APIs. Types of APIs There are three general types of available APIs: Original Program Model (OPM) Integrated Language Environment (ILE) UNIX-Type You can call all three types of APIs from an ILE program, but you can only call OPM APIs from an OPM program.

1 Languages at Harvard 2014 – 2015 Why Study a Foreign Language? 2 Planning Your Language Study 5 Languages Offered 2014-2015 (by Department) 6 African Languages 7 Celtic Languages 9 Classical Languages 10 East Asian Languages 11 English 17 Germanic Languages 17 Linguistics 20 Near Eastern Languages 21 Romance La

1 Introduction to JD Edwards EnterpriseOne Tools: APIs and Business Functions 1.1 APIs and Business Functions Overview. 1-1 1.2 APIs and Business Functions Implementation. 1-1 2 Working with APIs

APIS Wallet ser Guide APIS PCt APIS allet Mainal eyansaction Setting 14 Transfer 1. Click 'Transfer'. 2. Or click the transfer button next to the wallet. 3. Select a wallet to withdra from. 4. Enter the amount of coins you'll send. 5. Set the Gas fee. 6. Enter the address of the receiving wallet. You can enter a masked address as well.

APIs to changes in the nature of the rm and macroeconomic trends like SBTC Seth Benzell, Guillermo Lagarda, Marshall Van Allstyne (Boston University and MIT)The Role of APIs in the Economy July 12, 2016 2 / 16 . Some have speci c strategy in mind, others simply 'go digital' and put info out there Clearly APIs necessary for platform .

with IBM FileNet P8 APIs Wei-Dong Zhu Bill Carpenter Tim Lai Wei Liao Michael Oland James S Pagadala Juan Saad Content Engine basic and advanced APIs Process Engine basic and advanced APIs REST API

Wild honey bees (also called Africanized bees) in Arizona are a hybrid of the western honey bee (Apis mellifera), and other bee subspecies including the East African lowland honey bee (Apis mellifera scutellata), the Italian honey bee . Apis mellifera ligustica, and the Iberian honey bee . Apis mellifera iberiensis. They sometimes establish colonies

VMware Products 8 VMware Infrastructure 9 ESX Server 10 VirtualCenter Server 12 VI Client 12 VMware Management APIs 12 VI SDK 13 CIM APIs 14 VIX API 14 GuestSDK 14 VMware VMCI 15 Legacy APIs 15 Other APIs 15 CHAPTER 2: VI SDK BASICS 16 Overview of the VI SDK 16 What Is Included in VI SDK 2.5? 17 Object Model 17 Unified Interface with Different .