"collective" Software Primitives - NVIDIA

1y ago
6 Views
2 Downloads
1.72 MB
36 Pages
Last View : 9d ago
Last Download : 3m ago
Upload by : Brenna Zink
Transcription

CUB“collective” software primitivesDuane MerrillNVIDIA Research

What is CUB?1. A design model for “collective” primitivesHow to make reusable SIMT software constructs2. A library of collective primitivesBlock-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc.3. A library of global primitivesDevice-reduce, device-sort, device-scan, etc.Constructed from collective primitivesDemonstrate performance, performance-portability2

Software reuse3

Software reuseAbstraction & composability are fundamentalReducing redundant programmer effort Saves time, energy, moneyReduces buggy softwareEncapsulating complexity Empowers ordinary programmersInsulates applications from underlying hardwareSoftware reuse empowers a durable programming model4

Software reuseAbstraction & composability are fundamentalReducing redundant programmer effort Saves time, energy, moneyReduces buggy softwareEncapsulating complexity Empowers ordinary programmersInsulates applications from underlying hardwareSoftware reuse empowers a durable programming model5

“Collective” primitives6

Parallel programming is hard 7

Cooperative parallel programming is hard Parallel decomposition and grain sizingSynchronizationDeadlock, livelock, and data racesPlurality of statePlurality of flow control (divergence, etc.)Bookkeeping control structuresMemory access conflicts, coalescing, etc.Occupancy constraints from SMEM, RF, etcAlgorithm selection and instruction schedulingSpecial hardware functionality, instructions, etc.8

Parallel programming is hard Parallel decomposition and grain sizingSynchronizationDeadlock, livelock, and data racesPlurality of statePlurality of flow control (divergence, etc.)Bookkeeping control structuresMemory access conflicts, coalescing, etc.Occupancy constraints from SMEM, RF, etcAlgorithm selection and instruction schedulingSpecial hardware functionality, instructions, etc. 9

CUDA todayapplication threadCUDA stubthreadblockthreadblockthreadblock 10

CUDA today“Collective primitives” are the missing layer in today’s CUDA software stackapplication threadCUDA block BlockSort11

What do these have in common?21112220112 221 2323Parallel sparse graph traversalParallel SpMVParallel radix sortParallel BWT compression12

What do these have in common?Block-wide prefix-scan2111Queuemanagement2Partitioning220112 221 2323Parallel sparse graph traversalParallel radix sortSegmentedreductionRecurrencesolverParallel SpMVParallel BWT compression13

Examples of parallel scan data flow16 threads contributing 4 items eacht0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15Work-efficient Brent-Kung hybrid( 130 binary ops)t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15id t0t1t2t3id t4t5t6t7id t8t9id id t3t3t2t3id id t4t5t6t7id id 01idid t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t3t4t5t6t7t8t9 t10 t11t12 t13 t14 t15t0t1t2Depth-efficient Kogge-Stone hybrid( 170 binary ops)14

CUDA todayKernel programming is complicatingapplication threadCUDA stubthreadblockthreadblockthreadblock 15

CUDA today“Collective primitives” are the missing layer in today’s CUDA software stackapplication threadCUDA block BlockSort16

Collective design & usage17

Collective design criteriaComponents are easily nested & sequencedthreadblockapplication threadCUDA hreadblockWarpScanBlockSortBlockSort BlockSortBlockExchangeBlockSort18

Collective design criteriaFlexible interfaces that scale (& tune) to different block sizes, thread-granularities, etc.threadblockapplication threadCUDA eadblockBlockSortthreadblock WarpScanBlockSortBlockExchangeBlockSort19

Collective interface design- 3 parameter fields separated by concerns- Reflected shared resource types1.Params dictate storage layout andglobal void ExampleKernel(){// Specialize cub::BlockScan for 128 threads1typedef cub::BlockScan int, 128 BlockScanT;// Allocate temporary storage in shared memoryshared typename BlockScanT::TempStorage scan storage;2// Obtain a 512 items blocked across 128 threadsint items[4];.Static specialization interfaceunrolling of algorithmic stepsAllows data placement in fastregisters2.Reflected shared resource typesReflection enables compile-timeallocation and tuning3.Collective construction interfaceOptional params concerning inter-34thread communication// Compute block-wide prefix sumBlockScanT(scan storage).ExclusiveSum(items, items);Orthogonal to function behavior4.Operational function interfaceMethod-specific inputs/outputs20

Collective primitive designSimplified block-wide prefix sumtemplate typename T, int BLOCK THREADS class BlockScan{// Type of shared memory needed by BlockScantypedef T TempStorage[BLOCK THREADS];// Per-thread data (shared storage reference)TempStorage &temp storage;// ConstructorBlockScan (TempStorage &storage) : temp storage(storage) {}// Prefix sum operation (each thread contributes its own data item)T Sum (T thread data){for (int i 1; i BLOCK THREADS; i * 2){temp storage[tid] thread data;syncthreads();if (tid – i 0) thread data temp storage[tid];syncthreads();}return thread data;}};21

Sequencing CUB primitivesUsing cub::BlockLoad and cub::BlockScanglobal void ExampleKernel(int *d in){// Specialize for 128 threads owning 4 integers eachtypedef cub::BlockLoad int*, 128, 4 BlockLoadT;typedef cub::BlockScan int, 128 BlockScanT;// Allocate temporary storage in shared memoryshared union {typename BlockLoadT::TempStorage load;typename BlockScanT::TempStorage scan;} temp storage;Specialize,Allocate// Use coalesced (thread-striped) loads and a subsequent local exchange to// block a global segment of 512 items across 128 threadsint items[4];BlockLoadT(temp storage.load).Load(d in, items)syncthreads()Load,Scan// Compute block-wide prefix sumBlockScanT(temp storage.scan).ExclusiveSum(items, items);.22

Nested composition of CUB n23

Nested composition of CUB lockExchange24

Nested composition of CUB primitivescub::BlockHistogram (specialized for BLOCK HISTO SORT lockExchangecub::BlockDiscontinuity25

Block-wide and warp-wide CUB ecub::BlockLoad & cub::BlockStore t0 t1 t2t0 t1 t2 t3 t4 t5 t6 t7t3t0 t1 t2 t3 t4 t5 t6 t7cub::BlockRadixSortcub::WarpReduce & cub::BlockReducecub::WarpScan & cub::BlockScanL2 / Texcub::BlockHistogram and more at the CUB project on GitHubhttp://nvlabs.github.com/cub26

Tuning with flexiblecollectives27

Example: radix sorting throughput(initial GT200 effort 2011)20001800Millions of 32-bit keys /s16001400120010008006004002000NVIDIANVIDIA NVIDIA Tesla NVIDIA NVIDIA 9800 Intel MICGTX580 [1] GTX480 [1] C2050 [1] GTX280 [1] GTX [1]Knight'sFerry [4]Intel Core i7 AMD RadeonNehalem HD 6970 [3]3.2GHz [2][1] Merrill. Back40 GPU Primitives (2012)[2] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (2010)[3] T. Harada and L. Howes. Introduction to GPU Radix Sort (2011)[4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs, 2010.2828

Radix sorting throughput (current)20001800Millions of 32-bit keys /s16001400120010008006004002000NVIDIANVIDIA NVIDIA Tesla NVIDIA NVIDIA 9800GTX580 [1] GTX480 [1] C2050 [1] GTX280 [1] GTX [1][1][2][3][4]Intel MICKnight'sFerry [4]Intel Core i7 AMD RadeonNehalem HD 6970 [3]3.2GHz [2]Merrill. Back40, CUB GPU Primitives (2013)Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (2010)T. Harada and L. Howes. Introduction to GPU Radix Sort (2011)Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs, 2010.2929

Fine-tuning primitivesTiled prefix sumData is striped across threads for memory accesses/**t0t1t2t3t4t5t6t7t0t1t2t3t4t5t6t7* Simple CUDA kernel for computing tiled partial sums*/template int BLOCK THREADS, int ITEMS PER THREAD, LoadAlgorithm LOAD ALGO, ScanAlgorithm SCAN ALGO global void ScanTilesKernel(int *d in, int *d out){// Specialize collective types for problem contexttypedef cub::BlockLoad int*, BLOCK THREADS, ITEMS PER THREAD, LOAD ALGO BlockLoadT;typedef cub::BlockScan int, BLOCK THREADS, SCAN ALGO BlockScanT;// Allocate on-chip temporary storageshared union {typename BlockLoadT::TempStorage load;Data is blocked across threads for scanningtypename BlockScanT::TempStorage reduce;} temp storage;// Load data per threadint thread data[ITEMS PER THREAD];int offset blockIdx.x * (BLOCK THREADS * ITEMS PER THREAD);BlockLoadT(temp storage.load).Load(d in offset, offset);syncthreads();t0 t1 t2 t3t4 t5 t6 t7t8 t9t10t11 t12t13t14t15t t t tt t t tt t t10t11 t12t13t14t15t00 t11 t22 t33t44 t55 t66 t77t88 t99t10t11 t12t13t14t15i t t t t i t t t t i t t t tt i t t t tt1i di t t t t i di t t t t i di t t 1 i di 1t 1t 1t 1t0 1 2 38 9 t 1t4 5 6 7t t t td d t 3t 2t 3t d d t t t t d d t 9t 01 1 d d 21 13 41 1533 3 2 381 1108 90 11 11 4152 32 3 4 54 5 6 78 9 1 10 18 90 11t1t212t4 t5 t6 t7t4 t5 t6 t7t4 t5 t6 t7t8 t9t10t11t8 t9t10t11t8 t9t10t114 5 6 74 5 6 7t t t t// Compute the block-wide prefix sumBlockScanT(temp storage).Sum(thread data); t0 t1 t2 t3t0 t1 t2 t3t0 t1 t2 t3t t t ttt t 1t 1t t t t1t 1t 1tt 1t 15t13 4154 13 12 t3 t 4 t 5t1214 15t12t1313t14t15t12t13t14t15}Scan data flow tiled from warpscans30

CUB: device-wide performance-portabilityvs. Thrust and NPP across the last three major NVIDIA arch families (Telsa, Fermi, Kepler)1.2251.40Thrust v1.7.1CUB1.051.00.710.80.60.660.50 0.510.40.20.0TeslaC1060TeslaC2050billions of 32b items / secbillions of 32b keys / C2050Global reductionTeslaK20CGlobal prefix Thrust v1.7.16645TeslaK20Cbillions of 32b inputs / secbillions of 32b items / secCUBCUB20Global radix sort504540353025201510502521Thrust v1.7.1billions of 8b items / sec1.61.42.70TeslaC10602TeslaC20502TeslaK20CGlobal Histogram16.4Thrust 50TeslaK20CGlobal partition-if31

Summary32

Summary: benefits of using CUB primitivesSimplicity of compositionKernels are simply sequences of primitives (e.g., BlockLoad - BlockSort - BlockReduceByKey)High performanceCUB uses the best known algorithms, abstractions, and strategies, and techniquesPerformance portabilityCUB is specialized for the target hardware (e.g., memory conflict rules, special instructions, etc.)Simplicity of tuningCUB adapts to various grain sizes (threads per block, items per thread, etc.)CUB provides alterative algorithmsRobustness and durabilityCUB supports arbitrary data types and block sizes33

Questions?Please visit the CUB project on GitHubhttp://nvlabs.github.com/cubDuane Merrill (dumerrill@nvidia.com)

x0 x1 x2 ix0:2scanscan prefix0:P-2scany0 y1 y2 p0p1p2pP-1

x0x1x2 ok-backincl-prefix0y0reduceaggregate0 1p2incl-prefix0scan p0pP-1Status flagPAP XAggregate256256256 -Inclusive prefix256-768 -012P-1

1. A design model for "collective" primitives How to make reusable SIMT software constructs 2. A library of collective primitives Block-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc. 3. A library of global primitives Device-reduce, device-sort, device-scan, etc. Constructed from collective primitives

Related Documents:

NVIDIA virtual GPU products deliver a GPU Experience to every Virtual Desktop. Server. Hypervisor. Apps and VMs. NVIDIA Graphics Drivers. NVIDIA Virtual GPU. NVIDIA Tesla GPU. NVIDIA virtualization software. CPU Only VDI. With NVIDIA Virtu

Virtual GPU Software Client Licensing DU-07757-001 _v13.0 3 NVIDIA vGPU Software Deployment Required NVIDIA vGPU Software License Enforcement C-series NVIDIA vGPU vCS or vWS Software See Note (2). Q-series NVIDIA vGPU vWS Software See Note (3). GPU pass through for workstation or professional 3D graphics vWS Software

Red Hat Enterprise Linux with KVM 7.2 through 7.4 All NVIDIA GPUs that support NVIDIA vGPU software are supported. 2.3. Guest OS Support NVIDIA vGPU software supports several Windows releases and Linux distributions as a guest OS using GPU pass-through. Use only a guest OS release that is listed as supported by NVIDIA vGPU software with your virtualization software. To be listed as supported .

The NVIDIA IRAY system consists of the IRAY software suite, NVIDIA drivers, and NVIDIA hardware. It works with most major, widely accessible rendering programs including 3DS Max, Cinema 4D, MAYA, and Rhinoceros. FIGURE 7: NVIDIA IRAY LANDSCAPE (Source: NVIDIA) Each IRAY plugin is designed to enable artists and designers to render images easily

Veriton P330 F2 product summary Designed for users demanding an excellent combination of performance and expandability, the Veriton P330 F3 is a best-of-class choice for both computing and rendering capabilities Intel Xeon E5 processors 8 DIMMs DDR3 ECC memory NVIDIA Quadro400 NVIDIA Quadro600 NVIDIA Quadro2000 NVIDIA Quadro4000 NVIDIA QuardroK5000

NVIDIA PhysX technology—allows advanced physics effects to be simulated and rendered on the GPU. NVIDIA 3D Vision Ready— GeForce GPU support for NVIDIA 3D Vision, bringing a fully immersive stereoscopic 3D experience to the PC. NVIDIA 3D Vision Surround Ready—scale games across 3 panels by leveraging

all 250 W or the rack is power constrained, the board power can be set to a lower level. nvidia-smi nvidia-smi is an in-band monitoring tool provided with the NVIDIA driver and can be used to set the maximum power consumption with driver running in persistence mode. An example command to enable Max-Q is shown (power limit 180 W): nvidia-smi -pm 1

D 341CS ASTM standards viscosity temperature charts for liquid petroleum D 412 Ringcutter, vacuum holding plate, ring tension test fixture (5 drawings) D 422 Air-jet dispersion cup for grain-size analysis of soil (1 drawing) D 429 Specimen holding fixture-adhesion of vulcanized rubber to metal (2 drawings) D 610A SSPC-VIS2/Colored Visual Examples D 623 Anvils for Goodrich flexometer (2 .