CSC266 Introduction To Parallel Computing Using GPUs .

3y ago

50 Views

3 Downloads

3.63 MB

33 Pages

Last View : 10d ago

Last Download : 3m ago

Upload by : Esmeralda Toy

Report this link

Download PDF

Transcription

CSC266 Introduction to Parallel Computingusing GPUsIntroduction to AcceleratorsSreepathi PaiOctober 11, 2017URCS

OutlineIntroduction to AcceleratorsGPU ArchitecturesGPU Programming Models

Accelerators Single-core processors Multi-core processors What if these aren’t enough? Accelerators, specifically GPUs what they are when you should use them

Timeline 1980s Geometry Engines 1990s Consumer GPUs Out-of-order Superscalars 2000s General-purpose GPUsMulticore CPUsCell BE (Playstation 3)Lots of specialized accelerators in phones

The Graphics Processing Unit (1980s) SGI Geometry Engine Implemented the Geometry Pipeline Hardwired logic Embarrassingly Parallel O(Pixels) Large number of logic elements High memory bandwidth From Kaufman et al. (2009):

GPU 2.0 (circa 2004) Like CPUs, GPUs benefited from Moore’s Law Evolved from fixed-function hardwired logic to flexible,programmable ALUs Around 2004, GPUs were programmable “enough” to do somenon-graphics computations Severely limited by graphics programming model (shaderprogramming) In 2006, GPUs became “fully” programmable GPGPU: General-Purpose GPU NVIDIA releases “CUDA” language to write non-graphicsprograms that will run on GPUs

FLOPS/sNVIDIA CUDA C Programming Guide

Memory BandwidthNVIDIA CUDA C Programming Guide

GPGPU Today GPUs are widely deployed asaccelerators Intel Paper 10x vs 100x Myth GPUs so successful thatother accelerators are dead Sony/IBM Cell BE Clearspeed RSX Kepler K40 GPUs fromNVIDIA have performanceof 4TFlops (peak) CM-5, #1 system in 1993was 60 Gflops (Linpack) ASCI White (#1 2001)was 4.9 Tflops (Linpack)Pictures of Titan and Tianhe 1A from the Top500 website.

Accelerator Programming Models CPUs have always depended on co-processors I/O co-processors to handle slow I/OMath co-processors to speed up computationH.264 co-processor to play video (Phones)DSPs to handle audio (Phones) Many have been transparent Drop in the co-processor and everything sped up Or used a function-based model Call a function and it is sped up (e.g. “decode video”) The GPU is not a transparent accelerator for general purposecomputations Only graphics code is sped up transparently Code must be rewritten to target GPUs

Using a GPU You must retarget code for the GPU Rewrite, recompile, translate, etc.

OutlineIntroduction to AcceleratorsGPU ArchitecturesGPU Programming Models

The Two (Three?) Kinds of GPUs Type 1: Discrete GPUs More computational power More memory bandwidth Separate memoryNVIDIA

The Two (Three?) Kinds of GPUs #2 Type 2: Integrated GPUs IntelShare memory with processorShare bandwidth with processorConsume Less powerCan participate in cache coherence

The NVIDIA KeplerNVIDIA Kepler GK110 Whitepaper

Using a Discrete GPU You must retarget code for the GPU Rewrite, recompile, translate, etc. Working set must fit in GPU RAM You must copy data to/from GPU RAM “You”: Programmer, Compiler, Runtime, OS, etc. Some recent hardware can do this for you (it’s slow)

NVIDIA Kepler SMX (i.e. CPU core equivalent)

NVIDIA Kepler SMX Details 2-wide Inorder 4-wide SMT 2048 threads per core (64 warps) 15 cores Each thread runs the same code (hence SIMT) 65536 32-bit registers (256KBytes) A thread can use upto 255 of these Partitioned among threads (not shared!) 192 ALUs 64 Double-precision 32 Load/store 32 Special Functional Unit 64 KB L1/Shared Cache Shared cache is software-managed cache

CPU vs GPUParameterClockspeedRAMMemory B/WPeak FPConcurrent ThreadsCPU 1 GHzGB to TB60 GB/s 1 TFlopO(10)LLC cache size 100MB (L3)[eDRAM] O(10)[traditional]O(1 MB)NoneOOOsuperscalarCache size per threadSoftware-managed cacheTypeGPU700 MHz12 GB (max) 300 GB/s 1 TFlopO(1000)[O(10000)] 2MB (L2)O(10 bytes)48KB/SMX2-way Inorder superscalar

Using a GPU You must retarget code for the GPU Rewrite, recompile, translate, etc. Working set must fit in GPU RAM You must copy data to/from GPU RAM “You”: Programmer, Compiler, Runtime, OS, etc. Some recent hardware can do this for you Data accesses should be streaming Or use scratchpad as user-managed cache Lots of parallelism preferred (throughput, not latency) SIMD-style parallelism best suited High arithmetic intensity (FLOPs/byte) preferred

Showcase GPU Applications Image Processing Graphics Rendering Matrix Multiply FFTSee “Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU” by V.W.Lee et al. for moreexamples and a comparison of CPU and GPU.

OutlineIntroduction to AcceleratorsGPU ArchitecturesGPU Programming Models

Hierarchy of GPU Programming ModelsModelVectorizing Compiler“Drop-in” LibrariesDirective-drivenHigh-level languagesMid-level languagesGPUPGI CUDA FortrancuBLASOpenACC,OpenMP-to-CUDApyCUDAOpenCL, CUDALow-level languagesBare-metalPTX, ShaderSASSCPU Equivalentgcc, icc, etc.ATLASOpenMPpythonpthreads C/C Assembly/Machinecode

“Drop-in” Libraries “Drop-in” replacements forpopular CPU libraries,examples from NVIDIA: CUBLAS/NVBLAS forBLAS (e.g. ATLAS) CUFFT for FFTW MAGMA for LAPACKand BLAS These libraries may stillexpect you to manage datatransfers manually Libraries may supportmultiple accelerators (GPU CPU Xeon Phi)

GPU Libraries NVIDIA Thrust Like C STL, butexecutes on the GPU Modern GPU At first glance:high-performance libraryroutines for sorting,searching, reductions, etc. A deeper look: Specific“hard” problems tackledin a different style NVIDIA CUB Low-level primitives foruse in CUDA kernels

Directive-Driven Programming OpenACC, new standard for “offloading” parallel work to anaccelerator Currently supported only by PGI Accelerator compiler gcc 5.0 support is ongoing OpenMPC, a research compiler, can compile OpenMP code extra directives to CUDA OpenMP 4.0 also supports offload to accelerators Not for GPUs yetint main(void) {double pi 0.0f; long i;#pragma acc parallel loop reduction( :pi)for (i 0; i N; i ) {double t (double)((i 0.5)/N);pi 4.0/(1.0 t*t);}printf("pi %16.15f\n",pi/N);return 0;}

Python-based Tools (pyCUDA)import pycuda.autoinitimport pycuda.driver as drvimport numpyfrom pycuda.compiler import SourceModulemod SourceModule(""\"global void multiply them(float *dest, float *a, float *b){const int i threadIdx.x;dest[i] a[i] * b[i];}""\")multiply them mod.get function("multiply them")a numpy.random.randn(400).astype(numpy.float32)b numpy.random.randn(400).astype(numpy.float32)dest numpy.zeros like(a)multiply them(drv.Out(dest), drv.In(a), drv.In(b),block (400,1,1), grid (1,1))print dest-a*b

OpenCL C99-based dialect for programming heterogenous systems Originally based on CUDA nomenclature is different Supported by more than GPUs Xeon Phi, FPGAs, CPUs, etc. Source code is portable (somewhat) Performance may not be! Poorly supported by NVIDIA

CUDA “Compute Unified Device Architecture” First language to allow general-purpose programming forGPUs preceded by shader languages Promoted by NVIDIA for their GPUs Not supported by any other accelerator though commercial CUDA-to-x86/64 compilers exist We will focus on CUDA programs

CUDA Architecture From 10000 feet – CUDA is like pthreads CUDA language – C dialect Host code (CPU) and GPU code in same file Special language extensions for GPU code CUDA Runtime API Manages runtime GPU environment Allocation of memory, data transfers, synchronization withGPU, etc. Usually invoked by host code CUDA Device API Lower-level API that CUDA Runtime API is built upon

CUDA Limitations No standard library for GPU functions No parallel data structures No synchronization primitives (mutex, semaphores, queues,etc.) you can roll your own only atomic*() functions provided Toolchain not as mature as CPU toolchain Felt intensely in performance debugging It’s only been a decade :)

Conclusions GPUs are very interesting parallel machines They’re not going away Xeon Phi might pose a formidable challenge They’re here and now Your laptop probably already contains one Your phone definitely has one

CSC266 Introduction to Parallel Computing using GPUs Introduction to Accelerators Sreepathi Pai October 11, 2017 URCS. Outline Introduction to Accelerators GPU Architectures . An Evaluation of Throughput Computing on CPU and GPU" by V.W.Lee et al. for more examples and a comparison of CPU and GPU. Outline Introduction to Accelerators GPU .

Related Documents:

LECTURE NOTES ON CLOUD COMPUTING - J. B. Institute of Engineering and ...

Cloud Computing J.B.I.E.T Page 5 Computing Paradigm Distinctions . The high-technology community has argued for many years about the precise definitions of centralized computing, parallel computing, distributed computing, and cloud computing. In general, distributed computing is the opposite of centralized computing.

19 Views

1y ago

Accelerating BCCD using parallel computing

Parallel computing is a form of High Performance computing. By using the strength of many smaller computational units, parallel computing can pro-vide a massive speed boost for traditional algorithms.[3] There are multiple programming solutions that o er parallel computing. Traditionally, programs are written to be executed linearly. Languages

25 Views

3y ago

Parallel and Distributed Computing with MATLAB - EAFIT

Practical Application of Parallel Computing Why parallel computing? Need faster insight on more complex problems with larger datasets Computing infrastructure is broadly available (multicore desktops, GPUs, clusters) Why parallel computing with MATLAB Leverage computational power of more hardware

16 Views

1y ago

Lecture 8 Scientific Computing: Symbolic Math, Parallel Computing, ODEs ...

Parallel Computing Toolbox Ordinary Di erential Equations Partial Di erential Equations Conclusion Lecture 8 Scienti c Computing: Symbolic Math, Parallel Computing, ODEs/PDEs Matthew J. Zahr CME 292 Advanced MATLAB for Scienti c Computing Stanford University 30th April 2015 CME 292: Advanced MATLAB for SC Lecture 8. Symbolic Math Toolbox .

16 Views

1y ago

ITU-PRP: Parallel and Distributed Computing Middleware for Java ... - CORE

Parallel computing, distributed computing, java, ITU-PRP . 1 Introduction . ITU-PRP provides an all-in-one solution for Parallel Programmers, with a Parallel Programming Framework and a . JADE (Java Agent Development Framework) [6] as another specific Framework implemented on Java, provides a framework for Parallel Processing.

8 Views

1y ago

Parallel Computing - University of Southern California

› Parallel computing is a term used for programs that operate within a shared memory . algorithm for parallel computing in Java and can execute ForkJoinTaskprocesses Parallel Computing USC CSCI 201L. . array into the same number of sub-arrays as processors/cores on

22 Views

2y ago

An Introduction to Parallel Computing

Short course on Parallel Computing Edgar Gabriel Recommended Literature Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill "Patterns for Parallel Programming" Software Pattern Series, Addison Wessley, 2005. Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar: "Introduction to Parallel Computing", Pearson Education .

10 Views

1y ago

Pronunciation Guide for English - Phonics International

In the English writing system, many of the graphemes (letters and letter groups) have more than one possible pronunciation. Sometimes, specific sequences of letters can alert the reader to the possible pronunciation required; for example, note the letter sequences shown as ‘hollow letters’ in this guide as in ‘watch’, ‘salt’ and ‘city’ - indicating that, in these words with .

114 Views

3y ago

Recent Views

E-Cigarette Maker Hit with Class Action Lawsuit - Truth in Advertising

action lawsuit Wednesday alleging that e-cigarettes are "unreasonably dangerous, harmful and/or . tlement-investigation/) Life Insurance Claims Lawsuit & Annuities Fraud Class Action Lawsuit Investigation

1y ago

138 Views

Se Filing Your Lawsuit in Ederal Andbook Court

Defendant: An individual (or business) against whom a lawsuit is filed. In Forma Pauperis (IFP): When the filer has been granted the ability to file their lawsuit in federal court without paying the civil filing fee. Litigation: A case, controversy, or lawsuit. Participants (plaintiffs and defendants) in

1y ago

125 Views

The Divine Covenant Lawsuit Motif in Canonical Perspective

the lawsuit genre was the city gate, and instead proposed the cult as the . Hesse insists that the cultic pronouncements and the prophetic lawsuit must be distinguished: the cult always pronounces judgment o

2y ago

129 Views

Talc to a Fws&g Export Right Now 888-7323389

Iutnwstnrts attached Illinois Lawsuit Funding issists zla:nttfs Throognout the ware & I::r,rna wan are facrng tranual nt'::,.: rS and ytCn for nor ia'wud Sc snr.s nuy -cl Ce in aptrs' ft-u n-tv 0-jar,-, or a lawson 4cl r.arwu Of SlWit and no to SIX X-O rirpaidinss ci your raven ou woda Itnstaqn Illinois Lawsuit Loans Lawsuit Loans In .

1y ago

107 Views

Christopher v. Residence Mutual Insurance Company (San .

Christopher v. Residence Mutual Insurance Company in the Superior Court of California, County of Los Angeles. The Lawsuit was transferred to the Superior Court of California, County of San Bernardino, Case No. CIVDS1711860. The Lawsuit alleges that RMIC failed to provide agreed u

2y ago

172 Views

CONFIRMED: The Trillion-Dollar Lawsuit That Could End .

information leaked by Benjamin Fulford -- the former Asia-Pacific bureau chief for Forbes Magazine -- on a week-by-week basis. Finally, the lawsuit at the epicenter of this investigation has now become a tangible reality -- validating everything Fulford has been sayin

2y ago

145 Views

HOW TO FILE AN ANSWER TO AN EVICTION LAWSUIT

SUPERIOR COURT OF STANISLAUS COUNTY SELF HELP CENTER HOW TO FILE AN ANSWER TO AN EVICTION LAWSUIT (UNLAWFUL DETAINER ) Material prepared and/or distributed by the Superior Court Clerk’s Office IS INTENDED FOR INFORMATIONAL AND EDUCATIONAL PURPOSES ONLY. Such material is NOT intended t

2y ago

356 Views

LIVINGSTON COUNTY JAIL LAWSUIT SETTLED I

LIVINGSTON COUNTY JAIL LAWSUIT SETTLED M ARCH 2004 1 The American Civil Liberties Union of Michigan 60 W. Hancock Detroit, MI 48201-1343 (313) 578-6800 www.aclumich.org . the federal district court in Bay City struck it down in 2000. In June, 2003 the entire Sixth Circuit upheld the District Court

2y ago

425 Views

F C A Pro Se Guide

Missouri, then there would be “diversity.” In a diversity case, the defendant may challenge your decision to file the lawsuit in a particular U.S. District Court by filing a motion. For example, if you file your lawsuit in the District of Kansas but the defendant believes that the

2y ago

341 Views

Civil Lawsuit Basics: Motions for Summary Judgment

Civil Lawsuit Basics: Motions for Summary Judgment Presented by Sandra Levin Executive Director LA Law Library October 22, 2016 . Disclaimer! LA Law Library does not provide legal advice. LA Law Library provides legal resources and assistance with legal research as an educational service. T

2y ago

322 Views

EXHIBIT A - Kia Engine Settlement

Kia Motors America, Inc., No. 8:17-cv-00838 (C.D. Cal.) on May 10, 2017; Plaintiffs Stanczak and Creps filed the proposed nationwide class action lawsuit Stanczak and Creps v. Kia Motors America, Inc. et al., No. 8:17-cv-1365 (C.D. Cal.) on August 8, 2017; Plaintiffs Kinnick and Coats filed the proposed nationwide class action lawsuit Kinnick

2y ago

358 Views

A Step-by-step Guide to Filing a Civil Lawsuit in The United States .

A STEP-BY-STEP GUIDE TO FILING A CIVIL LAWSUIT . IN THE UNITED STATES DISTRICT COURT . FOR THE WESTERN DISTRICT OF TEXAS . Rev. Ed. October 26, 2017 ACKNOWLEDGMENT . This Guide was prepared and revised in cooperation with the . San Antonio Chapter of the Federal Bar Association .

1y ago

122 Views

Attorney General Balderas Announces Lawsuit to Halt Holtec Nuclear .

Attorney General Balderas Announces Lawsuit to Halt Holtec Nuclear Storage Facility . Santa Fe, NM---Today, Attorney General Hector Balderas announced that the State of New Mexico filed suit against the United States Nuclear Regulatory Commission ("NRC" or "the Commission") and the United States seeking to stop them from indefinitely storing

1y ago

142 Views

Ł -- I - Utah State Bar

Lawsuit Claims and/or created a duty on the part of the insurer to defend the Claims at its expense. Specifically, throughout 2008 and 2009 (when the Philips Lawsuit Claims were asserted in Philips' initial and amended complaints), BCT was the named insured under both a

1y ago

127 Views

Updates:Legal Updates: Anatomy of a Lawsuit - Lehigh University

! jurisdiction:Personal jurisdiction:! a lawsuit! a mayspecific court, the defendant is the ty that may jurisdiction! r py, ersonal j residency, injuring tort) 2/18/2009 12. Jurisdiction! . Microsoft PowerPoint - t [Compatibility Mode] .

1y ago

116 Views

CSC266 Introduction To Parallel Computing Using GPUs .

It looks like you're using an ad-blocker