"collective" Software Primitives - NVIDIA

1y ago

6 Views

2 Downloads

1.72 MB

36 Pages

Last View : 9d ago

Last Download : 3m ago

Upload by : Brenna Zink

Report this link

Download PDF

Transcription

CUB“collective” software primitivesDuane MerrillNVIDIA Research

What is CUB?1. A design model for “collective” primitivesHow to make reusable SIMT software constructs2. A library of collective primitivesBlock-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc.3. A library of global primitivesDevice-reduce, device-sort, device-scan, etc.Constructed from collective primitivesDemonstrate performance, performance-portability2

Software reuse3

Software reuseAbstraction & composability are fundamentalReducing redundant programmer effort Saves time, energy, moneyReduces buggy softwareEncapsulating complexity Empowers ordinary programmersInsulates applications from underlying hardwareSoftware reuse empowers a durable programming model4

“Collective” primitives6

Parallel programming is hard 7

Cooperative parallel programming is hard Parallel decomposition and grain sizingSynchronizationDeadlock, livelock, and data racesPlurality of statePlurality of flow control (divergence, etc.)Bookkeeping control structuresMemory access conflicts, coalescing, etc.Occupancy constraints from SMEM, RF, etcAlgorithm selection and instruction schedulingSpecial hardware functionality, instructions, etc.8

Parallel programming is hard Parallel decomposition and grain sizingSynchronizationDeadlock, livelock, and data racesPlurality of statePlurality of flow control (divergence, etc.)Bookkeeping control structuresMemory access conflicts, coalescing, etc.Occupancy constraints from SMEM, RF, etcAlgorithm selection and instruction schedulingSpecial hardware functionality, instructions, etc. 9

CUDA todayapplication threadCUDA stubthreadblockthreadblockthreadblock 10

CUDA today“Collective primitives” are the missing layer in today’s CUDA software stackapplication threadCUDA block BlockSort11

What do these have in common?21112220112 221 2323Parallel sparse graph traversalParallel SpMVParallel radix sortParallel BWT compression12

What do these have in common?Block-wide prefix-scan2111Queuemanagement2Partitioning220112 221 2323Parallel sparse graph traversalParallel radix sortSegmentedreductionRecurrencesolverParallel SpMVParallel BWT compression13

Examples of parallel scan data flow16 threads contributing 4 items eacht0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15Work-efficient Brent-Kung hybrid( 130 binary ops)t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15id t0t1t2t3id t4t5t6t7id t8t9id id t3t3t2t3id id t4t5t6t7id id 01idid t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2Depth-efficient Kogge-Stone hybrid( 170 binary ops)14

CUDA todayKernel programming is complicatingapplication threadCUDA stubthreadblockthreadblockthreadblock 15

CUDA today“Collective primitives” are the missing layer in today’s CUDA software stackapplication threadCUDA block BlockSort16

Collective design & usage17

Collective design criteriaComponents are easily nested & sequencedthreadblockapplication threadCUDA hreadblockWarpScanBlockSortBlockSort BlockSortBlockExchangeBlockSort18

Collective design criteriaFlexible interfaces that scale (& tune) to different block sizes, thread-granularities, etc.threadblockapplication threadCUDA eadblockBlockSortthreadblock WarpScanBlockSortBlockExchangeBlockSort19

Collective interface design- 3 parameter fields separated by concerns- Reflected shared resource types1.Params dictate storage layout andglobal void ExampleKernel(){// Specialize cub::BlockScan for 128 threads1typedef cub::BlockScan int, 128 BlockScanT;// Allocate temporary storage in shared memoryshared typename BlockScanT::TempStorage scan storage;2// Obtain a 512 items blocked across 128 threadsint items[4];.Static specialization interfaceunrolling of algorithmic stepsAllows data placement in fastregisters2.Reflected shared resource typesReflection enables compile-timeallocation and tuning3.Collective construction interfaceOptional params concerning inter-34thread communication// Compute block-wide prefix sumBlockScanT(scan storage).ExclusiveSum(items, items);Orthogonal to function behavior4.Operational function interfaceMethod-specific inputs/outputs20

Collective primitive designSimplified block-wide prefix sumtemplate typename T, int BLOCK THREADS class BlockScan{// Type of shared memory needed by BlockScantypedef T TempStorage[BLOCK THREADS];// Per-thread data (shared storage reference)TempStorage &temp storage;// ConstructorBlockScan (TempStorage &storage) : temp storage(storage) {}// Prefix sum operation (each thread contributes its own data item)T Sum (T thread data){for (int i 1; i BLOCK THREADS; i * 2){temp storage[tid] thread data;syncthreads();if (tid – i 0) thread data temp storage[tid];syncthreads();}return thread data;}};21

Sequencing CUB primitivesUsing cub::BlockLoad and cub::BlockScanglobal void ExampleKernel(int *d in){// Specialize for 128 threads owning 4 integers eachtypedef cub::BlockLoad int*, 128, 4 BlockLoadT;typedef cub::BlockScan int, 128 BlockScanT;// Allocate temporary storage in shared memoryshared union {typename BlockLoadT::TempStorage load;typename BlockScanT::TempStorage scan;} temp storage;Specialize,Allocate// Use coalesced (thread-striped) loads and a subsequent local exchange to// block a global segment of 512 items across 128 threadsint items[4];BlockLoadT(temp storage.load).Load(d in, items)syncthreads()Load,Scan// Compute block-wide prefix sumBlockScanT(temp storage.scan).ExclusiveSum(items, items);.22

Nested composition of CUB n23

Nested composition of CUB lockExchange24

Nested composition of CUB primitivescub::BlockHistogram (specialized for BLOCK HISTO SORT lockExchangecub::BlockDiscontinuity25

Block-wide and warp-wide CUB ecub::BlockLoad & cub::BlockStore t0 t1 t2t0 t1 t2 t3 t4 t5 t6 t7t3t0 t1 t2 t3 t4 t5 t6 t7cub::BlockRadixSortcub::WarpReduce & cub::BlockReducecub::WarpScan & cub::BlockScanL2 / Texcub::BlockHistogram and more at the CUB project on GitHubhttp://nvlabs.github.com/cub26

Tuning with flexiblecollectives27

Example: radix sorting throughput(initial GT200 effort 2011)20001800Millions of 32-bit keys /s16001400120010008006004002000NVIDIANVIDIA NVIDIA Tesla NVIDIA NVIDIA 9800 Intel MICGTX580 [1] GTX480 [1] C2050 [1] GTX280 [1] GTX [1]Knight'sFerry [4]Intel Core i7 AMD RadeonNehalem HD 6970 [3]3.2GHz [2][1] Merrill. Back40 GPU Primitives (2012)[2] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (2010)[3] T. Harada and L. Howes. Introduction to GPU Radix Sort (2011)[4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs, 2010.2828

Radix sorting throughput (current)20001800Millions of 32-bit keys /s16001400120010008006004002000NVIDIANVIDIA NVIDIA Tesla NVIDIA NVIDIA 9800GTX580 [1] GTX480 [1] C2050 [1] GTX280 [1] GTX [1][1][2][3][4]Intel MICKnight'sFerry [4]Intel Core i7 AMD RadeonNehalem HD 6970 [3]3.2GHz [2]Merrill. Back40, CUB GPU Primitives (2013)Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (2010)T. Harada and L. Howes. Introduction to GPU Radix Sort (2011)Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs, 2010.2929

Fine-tuning primitivesTiled prefix sumData is striped across threads for memory accesses/**t0t1t2t3t4t5t6t7t0t1t2t3t4t5t6t7* Simple CUDA kernel for computing tiled partial sums*/template int BLOCK THREADS, int ITEMS PER THREAD, LoadAlgorithm LOAD ALGO, ScanAlgorithm SCAN ALGO global void ScanTilesKernel(int *d in, int *d out){// Specialize collective types for problem contexttypedef cub::BlockLoad int*, BLOCK THREADS, ITEMS PER THREAD, LOAD ALGO BlockLoadT;typedef cub::BlockScan int, BLOCK THREADS, SCAN ALGO BlockScanT;// Allocate on-chip temporary storageshared union {typename BlockLoadT::TempStorage load;Data is blocked across threads for scanningtypename BlockScanT::TempStorage reduce;} temp storage;// Load data per threadint thread data[ITEMS PER THREAD];int offset blockIdx.x * (BLOCK THREADS * ITEMS PER THREAD);BlockLoadT(temp storage.load).Load(d in offset, offset);syncthreads();t0 t1 t2 t3t4 t5 t6 t7t8 t9t10t11 t12t13t14t15t t t tt t t tt t t10t11 t12t13t14t15t00 t11 t22 t33t44 t55 t66 t77t88 t99t10t11 t12t13t14t15i t t t t i t t t t i t t t tt i t t t tt1i di t t t t i di t t t t i di t t 1 i di 1t 1t 1t 1t0 1 2 38 9 t 1t4 5 6 7t t t td d t 3t 2t 3t d d t t t t d d t 9t 01 1 d d 21 13 41 1533 3 2 381 1108 90 11 11 4152 32 3 4 54 5 6 78 9 1 10 18 90 11t1t212t4 t5 t6 t7t4 t5 t6 t7t4 t5 t6 t7t8 t9t10t11t8 t9t10t11t8 t9t10t114 5 6 74 5 6 7t t t t// Compute the block-wide prefix sumBlockScanT(temp storage).Sum(thread data); t0 t1 t2 t3t0 t1 t2 t3t0 t1 t2 t3t t t ttt t 1t 1t t t t1t 1t 1tt 1t 15t13 4154 13 12 t3 t 4 t 5t1214 15t12t1313t14t15t12t13t14t15}Scan data flow tiled from warpscans30

CUB: device-wide performance-portabilityvs. Thrust and NPP across the last three major NVIDIA arch families (Telsa, Fermi, Kepler)1.2251.40Thrust v1.7.1CUB1.051.00.710.80.60.660.50 0.510.40.20.0TeslaC1060TeslaC2050billions of 32b items / secbillions of 32b keys / C2050Global reductionTeslaK20CGlobal prefix Thrust v1.7.16645TeslaK20Cbillions of 32b inputs / secbillions of 32b items / secCUBCUB20Global radix sort504540353025201510502521Thrust v1.7.1billions of 8b items / sec1.61.42.70TeslaC10602TeslaC20502TeslaK20CGlobal Histogram16.4Thrust 50TeslaK20CGlobal partition-if31

Summary32

Summary: benefits of using CUB primitivesSimplicity of compositionKernels are simply sequences of primitives (e.g., BlockLoad - BlockSort - BlockReduceByKey)High performanceCUB uses the best known algorithms, abstractions, and strategies, and techniquesPerformance portabilityCUB is specialized for the target hardware (e.g., memory conflict rules, special instructions, etc.)Simplicity of tuningCUB adapts to various grain sizes (threads per block, items per thread, etc.)CUB provides alterative algorithmsRobustness and durabilityCUB supports arbitrary data types and block sizes33

Questions?Please visit the CUB project on GitHubhttp://nvlabs.github.com/cubDuane Merrill (dumerrill@nvidia.com)

x0 x1 x2 ix0:2scanscan prefix0:P-2scany0 y1 y2 p0p1p2pP-1

x0x1x2 ok-backincl-prefix0y0reduceaggregate0 1p2incl-prefix0scan p0pP-1Status flagPAP XAggregate256256256 -Inclusive prefix256-768 -012P-1

1. A design model for "collective" primitives How to make reusable SIMT software constructs 2. A library of collective primitives Block-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc. 3. A library of global primitives Device-reduce, device-sort, device-scan, etc. Constructed from collective primitives

Related Documents:

GPU ACCELERATED PERFORMANCE FOR TODAY’S VIRTUAL …

NVIDIA virtual GPU products deliver a GPU Experience to every Virtual Desktop. Server. Hypervisor. Apps and VMs. NVIDIA Graphics Drivers. NVIDIA Virtual GPU. NVIDIA Tesla GPU. NVIDIA virtualization software. CPU Only VDI. With NVIDIA Virtu

36 Views

2y ago

Virtual GPU Software Client Licensing

Virtual GPU Software Client Licensing DU-07757-001 _v13.0 3 NVIDIA vGPU Software Deployment Required NVIDIA vGPU Software License Enforcement C-series NVIDIA vGPU vCS or vWS Software See Note (2). Q-series NVIDIA vGPU vWS Software See Note (3). GPU pass through for workstation or professional 3D graphics vWS Software

73 Views

2y ago

Virtual GPU Software R384 for Red Hat Enterprise Linux ...

Red Hat Enterprise Linux with KVM 7.2 through 7.4 All NVIDIA GPUs that support NVIDIA vGPU software are supported. 2.3. Guest OS Support NVIDIA vGPU software supports several Windows releases and Linux distributions as a guest OS using GPU pass-through. Use only a guest OS release that is listed as supported by NVIDIA vGPU software with your virtualization software. To be listed as supported .

40 Views

3y ago

Models Match the Real - Pny

The NVIDIA IRAY system consists of the IRAY software suite, NVIDIA drivers, and NVIDIA hardware. It works with most major, widely accessible rendering programs including 3DS Max, Cinema 4D, MAYA, and Rhinoceros. FIGURE 7: NVIDIA IRAY LANDSCAPE (Source: NVIDIA) Each IRAY plugin is designed to enable artists and designers to render images easily

8 Views

11m ago

Deploying GPGPUs in Manufacturing

Veriton P330 F2 product summary Designed for users demanding an excellent combination of performance and expandability, the Veriton P330 F3 is a best-of-class choice for both computing and rendering capabilities Intel Xeon E5 processors 8 DIMMs DDR3 ECC memory NVIDIA Quadro400 NVIDIA Quadro600 NVIDIA Quadro2000 NVIDIA Quadro4000 NVIDIA QuardroK5000

34 Views

3y ago

NVIDIA GeForce GTX 400 GPU Datasheet

NVIDIA PhysX technology—allows advanced physics effects to be simulated and rendered on the GPU. NVIDIA 3D Vision Ready— GeForce GPU support for NVIDIA 3D Vision, bringing a fully immersive stereoscopic 3D experience to the PC. NVIDIA 3D Vision Surround Ready—scale games across 3 panels by leveraging

34 Views

3y ago

TESLA V100 PCIe GPU ACCELERATOR - Nvidia

all 250 W or the rack is power constrained, the board power can be set to a lower level. nvidia-smi nvidia-smi is an in-band monitoring tool provided with the NVIDIA driver and can be used to set the maximum power consumption with driver running in persistence mode. An example command to enable Max-Q is shown (power limit 180 W): nvidia-smi -pm 1

23 Views

2y ago

INDEX (Numerical/Title) A LISTING OF ALL ASTM PRODUCTS

D 341CS ASTM standards viscosity temperature charts for liquid petroleum D 412 Ringcutter, vacuum holding plate, ring tension test fixture (5 drawings) D 422 Air-jet dispersion cup for grain-size analysis of soil (1 drawing) D 429 Specimen holding fixture-adhesion of vulcanized rubber to metal (2 drawings) D 610A SSPC-VIS2/Colored Visual Examples D 623 Anvils for Goodrich flexometer (2 .

61 Views

3y ago

Recent Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Faculty of Juridical, Social and Political Sciences Year .

Law L Law IV 8 Drept procesual civil II / Civil Procedure Law II 5 Law L Law IV 8 Dreptul comerțului internațional / International ommercial Law 4 Law L Law IV 8 riminalistică / Forensics 4 Law L Law IV 8 Practică de cercetare pentru elaborarea lucrării de lincență(3 săptămân

2y ago

384 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

Ohm ’s Law

Ohm ’s Law Ohm's law states that, in an electrical circuit, the current passing through most materials is directly proportional to the potential difference applied across them. 3-1—3-3: Ohm ’s Law Formulas There are three forms of Ohm’s Law: I V/R V IR R V/I where:File Size: 1MBPage Count: 40Explore furtherOhm's Law Quiz MCQs with Answers Ohm Lawohmlaw.comOhm’s Law Worksheet - Basic Electricity - All About omohms law worksheet - eering.orgOhm’s Law Worksheet - Richmond County School Systemwww.rcboe.orgOhm's Law with Examples - Physics Problems with Solutions ended to you b

2y ago

295 Views

Intermediate Law Law and You Worksheet 3: Australian law - Home Affairs

4. There are different kinds of law to deal with different kinds of problems. Four important kinds of law are civil law, criminal law, family law and administrative law. Civil law deals with disputes between individuals; for example, if someone sells you goods that are faulty, or that cause you injury or damage, you can take that person to court.

4m ago

110 Views

Public International Law

3. "International Law' is part of the law of the land' - Discuss. Refer to the leading cases & to the British & U.S .practice. 4. State the importance of codification and the steps taken to codify International Law. Assess the contributions of the International Law Commission. 5. "States only are the subjects of International Law' Discuss. 6.

1y ago

140 Views

A Trail Guide to Careers in Environmental Law

law, constitutional law, property law, bankruptcy law, criminal law, food and drug law, land use planning law, and international law. A distinctive aspect of environmental practice is the role of science in advocacy efforts.

3y ago

241 Views

Public International Law: Treaties and International Organizations

of public international law. Public international law is a body of law that defines the relationships, rights, and responsibilities of states. You can think of it as a set of rules for how states interact and associate with each other. Public international law is composed of international treaties, customs, organizations, and even legal .

1y ago

142 Views

The Philosophy of International Law - University of São Paulo

law, such as human rights law, international economic law, international criminal law, international environmental law, and the laws of war. Of course, the volume is not exhaustive and many more issues could have been addressed in an even longer book. This volume is distinguished by its 'dialogical' methodology: there are two essays

1y ago

142 Views

Kingsbury Gal Dimensions of International Organizations Law

Law", 20 European Journal of International Law (2009) pp. 23-57. Symposia on GAL have been published in: 68:3-4 Law and Contemporary Problems (2005); 37:4 New York University Journal of International Law and Politics (2005); 17 European Journal of International Law 1 (2006).

5m ago

52 Views

Accounting Technicians Diploma (ATD) Examination Syllabus

Apply law of contract and tort in various scenarios Apply general principles of business law in practice. CONTENT 2.1 Elements of the legal system 2.1.1 Nature, purpose and classification of law - Meaning of law - Nature of law - Purpose of law - Classification of law - Law and morality 2.1.2 Sources of law - The Constitution

3y ago

216 Views

PRINCIPLES OF BUSINESS LAW - DPHU

ABE Diploma in Business Administration Study Manual PRINCIPLES OF BUSINESS LAW Contents Study Unit Title Page Syllabus i 1 Nature and Sources of Law 1 Nature of Law 3 Historical Origins 6 Sources of Law 9 The European Community and UK Law: An Overview 13 2 Common Law, Equity and Statute Law 23 Custom 25 Case Law 26 Nature of Equity 32

3y ago

285 Views

"collective" Software Primitives - NVIDIA

It looks like you're using an ad-blocker