THE GPU COMPUTING ERA - University Of Wisconsin-Madison

1y ago
12 Views
3 Downloads
694.90 KB
14 Pages
Last View : 16d ago
Last Download : 3m ago
Upload by : Harley Spears
Transcription

[3B2-14]mmi2010020005.3d23/3/01015:43Page 56.THE GPU COMPUTING ERA.GPU COMPUTING IS AT A TIPPING POINT, BECOMING MORE WIDELY USED IN DEMANDINGCONSUMER APPLICATIONS AND HIGH-PERFORMANCE COMPUTING. THIS ARTICLEDESCRIBES THE RAPID EVOLUTION OF GPU ARCHITECTURES—FROM GRAPHICSPROCESSORS TO MASSIVELY PARALLEL MANY-CORE MULTIPROCESSORS, RECENTDEVELOPMENTS IN GPU COMPUTING ARCHITECTURES, AND HOW THE ENTHUSIASTICADOPTION OF CPUþGPU COPROCESSING IS ACCELERATING PARALLEL APPLICATIONS.John NickollsWilliam J. DallyNVIDIAAs we enter the era of GPUcomputing, demanding applications withsubstantial parallelism increasingly use themassively parallel computing capabilitiesof GPUs to achieve superior performanceand efficiency. Today GPU computingenables applications that we previouslythought infeasible because of long execution times.With the GPU’s rapid evolution from aconfigurable graphics processor to a programmable parallel processor, the ubiquitous GPU in every PC, laptop, desktop,and workstation is a many-core multithreaded multiprocessor that excels atboth graphics and computing applications.Today’s GPUs use hundreds of parallelprocessor cores executing tens of thousands of parallel threads to rapidly solvelarge problems having substantial inherentparallelism. They’re now the most pervasive massively parallel processing platformever available, as well as the most costeffective.Using NVIDIA GPUs as examples, thisarticle describes the evolution of GPU computing and its parallel computing model, theenabling architecture and software developments, how computing applications useCPUþGPU coprocessing, example application performance speedups, and trends inGPU computing.GPU computing’s evolutionWhy have GPUs evolved to have largenumbers of parallel threads and manycores? The driving force continues to bethe real-time graphics performance neededto render complex, high-resolution 3Dscenes at interactive frame rates for games.Rendering high-definition graphics scenesis a problem with tremendous inherent parallelism. A graphics programmer writes asingle-thread program that draws one pixel, andthe GPU runs multiple instances of thisthread in parallel—drawing multiple pixelsin parallel. Graphics programs, written inshading languages such as Cg or HighLevel Shading Language (HLSL), thus scaletransparently over a wide range of threadand processor parallelism. Also, GPU computing programs—written in C or Cþþwith the CUDA parallel computingmodel,1,2 or using a parallel computingAPI inspired by CUDA such as DirectCompute3 or OpenCL4—scale transparentlyover a wide range of parallelism. Softwarescalability, too, has enabled GPUs torapidly increase their parallelism andperformance with increasing transistordensity.GPU technology developmentThe demand for faster and higherdefinition graphics continues to drive the.56Published by the IEEE Computer Society 0272-1732/10/ 26.00 c 2010 IEEE

[3B2-14]mmi2010020005.3d23/3/01015:43Page 57Table 1. NVIDIA GPU technology development.DateProductTransistorsCUDA coresTechnology1997RIVA 1283 million—3D graphics accelerator1999GeForce 25625 million—First GPU, programmed with DX7 and OpenGL2001GeForce 360 million—First programmable shader GPU, programmed2002GeForce FX125 million—32-bit floating-point (FP) programmable GPU with2004GeForce 6800222 million—32-bit FP programmable scalable GPU, GPGPU2006GeForce 8800681 million128Cg programs, DX9, and OpenGLFirst unified graphics and computing GPU,2007Tesla T8, C870681 million128First GPU computing system programmed in C2008GeForce GTX 2801.4 billion240Unified graphics and computing GPU, IEEE FP,2008Tesla T10, S10701.4 billion240GPU computing clusters, 64-bit IEEE FP, 4-Gbyte2009Fermi3.0 billion512memory, CUDA C, and OpenCLGPU computing architecture, IEEE 754-2008 FP,with DX8 and OpenGLCg programs, DX9, and OpenGLprogrammed in C with CUDAwith CUDACUDA C, OpenCL, and DirectCompute64-bit unified addressing, caching, ECCmemory, CUDA C, C , OpenCL, andDirectComputedevelopment of increasingly parallel GPUs.Table 1 lists significant milestones in NVIDIAGPU technology development that drove theevolution of unified graphics and computingGPUs. GPU transistor counts increased exponentially, doubling roughly every 18 monthswith increasing semiconductor density. Sincetheir 2006 introduction, CUDA parallelcomputing cores per GPU also doublednearly every 18 months.In the early 1990s, there were no GPUs.Video graphics array (VGA) controllers generated 2D graphics displays for PCs to accelerate graphical user interfaces. In 1997,NVIDIA released the RIVA 128 3D singlechip graphics accelerator for games and 3Dvisualization applications, programmed withMicrosoft Direct3D and OpenGL. Evolvingto modern GPUs involved adding programmability incrementally—from fixedfunction pipelines to microcoded processors,configurable processors, programmable processors, and scalable parallel processors.Early GPUsThe first GPU was the GeForce 256, asingle-chip 3D real-time graphics processorintroduced in 1999 that included nearlyevery feature of high-end workstation 3Dgraphics pipelines of that era. It containeda configurable 32-bit floating-point vertextransform and lighting processor, and a configurable integer pixel-fragment pipeline,programmed with OpenGL and MicrosoftDirectX 7 (DX7) APIs.GPUs first used floating-point arithmeticto calculate 3D geometry and vertices, thenapplied it to pixel lighting and color valuesto handle high-dynamic-range scenes andto simplify programming. They implemented accurate floating-point rounding toeliminate frame-varying artifacts on movingpolygon edges that would otherwise sparkleat real-time frame rates.As programmable shaders emerged, GPUsbecame more flexible and programmable. In2001, the GeForce 3 introduced the first programmable vertex processor that executedvertex shader programs, along with a configurable 32-bit floating-point pixel-fragmentpipeline, programmed with OpenGL andDX8. The ATI Radeon 9700, introducedin 2002, featured a programmable 24-bitfloating-point pixel-fragment processor programmed with DX9 and OpenGL. TheGeForce FX and GeForce 68005 featured.MARCH/APRIL 201057

[3B2-14]mmi2010020005.3d23/3/01015:43Page 58.HOT CHIPSprogrammable 32-bit floating-point pixelfragment processors and vertex processors,programmed with Cg programs, DX9, andOpenGL. These processors were highly multithreaded, creating a thread and executing athread program for each vertex and pixelfragment. The GeForce 6800 scalable processor core architecture facilitated multipleGPU implementations with different numbers of processor cores.Developing the Cg language6 for programming GPUs provided a scalable parallelprogramming model for the programmablefloating-point vertex and pixel-fragment processors of GeForce FX, GeForce 6800, andsubsequent GPUs. A Cg program resemblesa C program for a single thread that drawsa single vertex or single pixel. The multithreaded GPU created independent threadsthat executed a shader program to drawevery vertex and pixel fragment.In addition to rendering real-time graphics, programmers also used Cg to computephysical simulations and other generalpurpose GPU (GPGPU) computations.Early GPGPU computing programs achievedhigh performance, but were difficult to writebecause programmers had to express nongraphics computations with a graphics APIsuch as OpenGL.Unified computing and graphics GPUsThe GeForce 8800 introduced in 2006featured the first unified graphics and computing GPU architecture7,8 programmablein C with the CUDA parallel computingmodel, in addition to using DX10 andOpenGL. Its unified streaming processorcores executed vertex, geometry, and pixelshader threads for DX10 graphics programs,and also executed computing threads forCUDA C programs. Hardware multithreading enabled the GeForce 8800 to efficientlyexecute up to 12,288 threads concurrentlyin 128 processor cores. NVIDIA deployedthe scalable architecture in a family ofGeForce GPUs with different numbers ofprocessor cores for each market segment.The GeForce 8800 was the first GPU touse scalar thread processors rather than vectorprocessors, matching standard scalar languageslike C, and eliminating the need to managevector registers and program vector.58IEEE MICROoperations. It added instructions to supportC and other general-purpose languages,including integer arithmetic, IEEE 754floating-point arithmetic, and load/storememory access instructions with byte addressing. It provided hardware and instructions tosupport parallel computation, communication, and synchronization—including threadarrays, shared memory, and fast barriersynchronization.GPU computing systemsAt first, users built personal supercomputers by adding multiple GPU cards toPCs and workstations, and assembled clustersof GPU computing nodes. In 2007, responding to demand for GPU computing systems,NVIDIA introduced the Tesla C870, D870,and S870 GPU card, deskside, and rackmount GPU computing systems containingone, two, and four T8 GPUs. The T8GPU was based on the GeForce 8800GPU, configured for parallel computing.The second-generation Tesla C1060 andS1070 GPU computing systems introducedin 2008 used the T10 GPU, based on theGPU in GeForce GTX 280. The T10 featured 240 processor cores, 1-teraflop-persecond peak single-precision floating-pointrate, IEEE 754-2008 double-precision 64bit floating-point arithmetic, and 4-GbyteDRAM memory. Today there are TeslaS1070 systems with thousands of GPUswidely deployed in high-performance computing systems in production and research.NVIDIA introduced the third-generationFermi GPU computing architecture in2009.9 Based on user experience with priorgenerations, it addressed several key areas tomake GPU computing more broadly applicable. Fermi implemented IEEE 754-2008and significantly increased double-precisionperformance. It added error-correcting code(ECC) memory protection for large-scaleGPU computing, 64-bit unified addressing,cached memory hierarchy, and instructionsfor C, Cþþ, Fortran, OpenCL, andDirectCompute.GPU computing ecosystemThe GPU computing ecosystem is expanding rapidly, enabled by the deployment ofmore than 180 million CUDA-capable

[3B2-14]mmi2010020005.3d23/3/01015:43GPUs. Researchers and developers have enthusiastically adopted CUDA and GPUcomputing for a diverse range of applications,10 publishing hundreds of technicalpapers, writing parallel programming textbooks,11 and teaching CUDA programmingat more than 300 universities. The CUDAZone (see http://www.nvidia.com/object/cuda home new.html) lists more than1,000 links to GPU computing applications,programs, and technical papers. The 2009GPU Technology Conference (see http://www.nvidia.com/object/research summitposters.html) published 91 research posters.Library and tools developers are makingGPU development more productive. GPUcomputing languages include CUDA C,CUDA Cþþ, Portland Group (PGI)CUDA Fortran, DirectCompute, andOpenCL. GPU mathematics packages include MathWorks Matlab, Wolfram Mathematica, National Instruments Labview, SciComp SciFinance, and PyCUDA. NVIDIAdeveloped the parallel Nsight GPU development environment, debugger, and analyzerintegrated with Microsoft Visual Studio.GPU libraries include Cþþ productivitylibraries, dense linear algebra, sparse linearalgebra, FFTs, video and image processing,and data-parallel primitives. Computer systemmanufacturers are developing integratedCPUþGPU coprocessing systems in rackmount server and cluster configurations.CUDA scalable parallel architectureCUDA is a hardware and software coprocessing architecture for parallel computingthat enables NVIDIA GPUs to execute programs written with C, Cþþ, Fortran,OpenCL, DirectCompute, and other languages. Because most languages weredesigned for one sequential thread, CUDApreserves this model and extends it with aminimalist set of abstractions for expressingparallelism. This lets the programmer focuson the important issues of parallelism—howto design efficient parallel algorithms—using a familiar language.By design, CUDA enables the development of highly scalable parallel programsthat can run across tens of thousands of concurrent threads and hundreds of processorcores. A compiled CUDA program executesPage 59ThreadPer-thread privatelocal memoryThread blockPer-blockshared memoryGrid 0 Per-applicationglobal memoryGrid 1 Figure 1. The CUDA hierarchy of threads, thread blocks, and grids ofblocks, with corresponding memory spaces: per-thread private local,per-block shared, and per-application global memory spaces.on any size GPU, automatically using moreparallelism on GPUs with more processorcores and threads.A CUDA program is organized into ahost program, consisting of one or more sequential threads running on a host CPU, andone or more parallel kernels suitable for execution on a parallel computing GPU. A kernel executes a sequential program on a set oflightweight parallel threads. As Figure 1shows, the programmer or compiler organizes these threads into a grid of thread blocks.The threads comprising a thread block cansynchronize with each other via barriersand communicate via a high-speed, perblock shared memory.Threads from different blocks in the samegrid can coordinate via atomic operations ina global memory space shared by all threads.Sequentially dependent kernel grids can synchronize via global barriers and coordinatevia global shared memory. CUDA requiresthat thread blocks be independent, which.MARCH/APRIL 201059

[3B2-14]mmi2010020005.3d23/3/01015:43Page 60.HOT CHIPSvoid saxpy(uint n, float a,float *x, float *y){uint i;for(i 0; i n; i)y[i] a*x[i] y[i];}global void saxpy(uint n, float a,float *x, float *y){uint i blockIdx.x*blockDim.x threadIdx.x;if(i n)y[i] a*x[i] y[i];}void serial sample(){// Call serial SAXPY function}void parallel sample(){// Launch parallel SAXPY kernel// using n/256 blocks of 256 threads eachsaxpy ceil(n/256),256 (n, 2.0, x, y);}(a)(b)saxpy(n, 2.0, x, y);Figure 2. Serial (a) and parallel CUDA (b) SAXPY kernels computing y ¼ ax þ y.provides scalability to GPUs with differentnumbers of processor cores and threads.Thread blocks implement coarse-grainedscalable data parallelism, while the lightweight threads comprising each threadblock provide fine-grained data parallelism.Thread blocks executing different kernelsimplement coarse-grained task parallelism.Threads executing different paths implementfine-grained thread-level parallelism. Detailsof the CUDA programming model are available in the programming guide.2Figure 2 shows some basic features of parallel programming with CUDA. It containssequential and parallel implementations ofthe SAXPY routine defined by the basic linearalgebra subroutines (BLAS) library. Givenscalar a and vectors x and y containing nfloating-point numbers, it performs the update y ¼ ax þ y. The serial implementationis a simple loop that computes one elementof y per iteration. The parallel kernel executeseach of these independent iterations in parallel, assigning a separate thread to computeeach element of y. The globalmodifier indicates that the procedure is a kernel entry point, and the extended functioncall syntax saxpy B, T (. . .)launches the kernel saxpy() in parallelacross B blocks of T threads each. Each threaddetermines which element it should process from its integer thread block indexblockIdx.x, its thread index within itsblock threadIdx.x, and the total numberof threads per block blockDim.x.This example demonstrates a common parallelization pattern, where we can.60IEEE MICROtransform a serial loop with independentiterations to execute in parallel across manythreads. In the CUDA paradigm, the programmer writes a scalar program—the parallel saxpy() kernel—that specifies thebehavior of a single thread of the kernel.This lets CUDA leverage standard C language with only a few small additions, suchas built-in thread and block index variables.The SAXPY kernel is also a simple exampleof data parallelism, where parallel threadseach produce assigned result data elements.GPU computing architectureTo address different market segments,GPU architectures scale the number of processor cores and memories to implement different products for each segment while usingthe same scalable architecture and software.NVIDIA’s scalable GPU computing architecture varies the number of streaming multiprocessors to scale computing performance,and varies the number of DRAM memoriesto scale memory bandwidth and capacity.Each multithreaded streaming multiprocessor provides sufficient threads, processorcores, and shared memory to execute oneor more CUDA thread blocks. The parallelprocessor cores within a streaming multiprocessor execute instructions for parallelthreads. Multiple streaming multiprocessorsprovide coarse-grained scalable data andtask parallelism to execute multiple coarsegrained thread blocks (possibly running different kernels) in parallel. Multithreadingand parallel-pipelined processor cores withineach streaming multiprocessor implement

mmi2010020005.3dSML1Tex Tex Tex Tex23/3/010SML1Tex Tex Tex Tex15:43Page 61SML1Tex Tex Tex TexSMSML1Tex Tex Tex TexL1Tex Tex Tex TexSMSMSML1Tex Tex Tex TexL1Tex Tex Tex TexL1Tex Tex Tex TexTex Tex Tex TexL1Tex Tex Tex TexL1Tex Tex Tex TexL1DRAM interfaceGigaThreadL2 cacheTex Tex Tex TexL1SMTex Tex Tex TexL1SMTex Tex Tex TexL1SMTex Tex Tex TexL1SMTex Tex Tex TexL1SMSMSMSMDRAM interface DRAM interface DRAM interface DRAM interfaceHost interfaceDRAM interface[3B2-14]Figure 3. Fermi GPU computing architecture with 512 CUDA processor cores organized as 16 streaming multiprocessors(SMs) sharing a common second-level (L2) cache, six 64-bit DRAM interfaces, and a host interface with the host CPU,system memory, and I/O devices. Each streaming multiprocessor has 32 CUDA cores.fine-grained data and thread-level parallelismto execute hundreds of fine-grained threadsin parallel. Application programs using theCUDA model thus scale transparently tosmall and large GPUs with different numbers of streaming multiprocessors and processor cores.Fermi computing architectureTo illustrate GPU computing architecture,Figure 3 shows the third-generation Fermicomputing architecture configured with16 streaming multiprocessors, each with32 CUDA processor cores, for a total of512 cores. The GigaThread work schedulerdistributes CUDA thread blocks to streamingmultiprocessors with available capacity, dynamically balancing the computing workloadacross the GPU, and running multiple kerneltasks in parallel when appropriate. The multithreaded streaming multiprocessors scheduleand execute CUDA thread blocks and individual threads. Each streaming multiprocessorexecutes up to 1,536 concurrent threads tohelp cover long latency loads from DRAMmemory. As each thread block completes executing its kernel program and releases itsstreaming multiprocessor resources, the workscheduler assigns a new thread block to thatstreaming multiprocessor.The PCIe host interface connectsthe GPU and its DRAM memory withthe host CPU and system memory. TheCPUþGPU coprocessing and data transfersuse the bidirectional PCIe interface. Thestreaming multiprocessor threads accesssystem memory via the PCIe interface, andCPU threads access GPU DRAM memoryvia PCIe.The GPU architecture balances its parallelcomputing power with parallel DRAMmemory controllers designed for high memory bandwidth. The Fermi GPU in Figure 3has six high-speed GDDR5 DRAM interfaces, each 64 bits wide. Its 40-bit addresseshandle up to 1 Tbyte of address space forGPU DRAM and CPU system memoryfor large-scale computing.MARCH/APRIL 201061

[3B2-14]mmi2010020005.3d23/3/01015:43Page 62.HOT CHIPSCached memory hierarchyEfficient multithreadingFermi introduces a parallel cached memory hierarchy for load, store, and atomicmemory accesses by general applications.Each streaming multiprocessor has a first-level(L1) data cache, and the streaming multiprocessors share a common 768-Kbyte unifiedsecond-level (L2) cache. The L2 cache connectswith six 64-bit DRAM interfaces and the PCIeinterface, which connects with the host CPU,system memory, and PCIe devices. It cachesDRAM memory locations and system memory pages accessed via the PCIe interface.The unified L2 cache services load, store,atomic, and texture instruction requests fromthe streaming multiprocessors and requestsfrom their L1 caches, and fills the streamingmultiprocessor instruction caches and uniform data caches.Fermi implements a 40-bit physicaladdress space that accesses GPU DRAM,CPU system memory, and PCIe deviceaddresses. It provides a 40-bit virtual addressspace to each application context and maps itto the physical address space with translationlookaside buffers and page tables.The streaming multiprocessor implementszero-overhead multithreading and threadscheduling for up to 1,536 concurrentthreads. To efficiently manage and executethis many individual threads, the multiprocessor employs the single-instruction multiplethread (SIMT) architecture introduced in thefirst unified computing GPU.7,8 The SIMTinstruction logic creates, manages, schedules,and executes concurrent threads in groups of32 parallel threads called warps. A CUDAthread block comprises one or more warps.Each Fermi streaming multiprocessor hastwo warp schedulers and two dispatch unitsthat each select a warp and issue an instruction from the warp to 16 CUDA cores,16 load/store units, or four SFUs. Becausewarps execute independently, the streamingmultiprocessor can issue two warp instructions to appropriate sets of CUDA cores,load/store units, and SFUs.To support C, Cþþ, and standardsingle-thread programming languages, eachstreaming multiprocessor thread is independent, having its own private registers, condition codes and predicates, private per-threadmemory and stack frame, instructionaddress, and thread execution state. TheSIMT instructions control the execution ofan individual thread, including arithmetic,memory access, and branching and controlflow instructions. For efficiency, the SIMTmultiprocessor issues an instruction to awarp of 32 independent parallel threads.The streaming multiprocessor realizes fullefficiency and performance when all threadsof a warp take the same execution path. Ifthreads of a warp diverge at a data-dependentconditional branch, execution serializes foreach branch path taken, and when all pathscomplete, the threads converge to the sameexecution path. The Fermi streaming multiprocessor extends the flexibility of the SIMTindependent thread control flow with indirect branch and function-call instructions,and trap handling for exceptions anddebuggers.ECC memoryFermi introduces ECC memory protection to enhance data integrity in large-scaleGPU computing systems. Fermi ECC corrects single-bit errors and detects double-biterrors in the DRAM memory, GPU L2cache, L1 caches, and streaming multiprocessor registers. The ECC lets us integrate thousands of GPUs in a system while maintaininga high mean time between failures (MTBF)for high-performance computing and supercomputing systems.Streaming multiprocessorThe Fermi streaming multiprocessorintroduces several architectural features thatdeliver higher performance, improve its programmability, and broaden its applicability.As Figure 4 shows, the streaming multiprocessor execution units include 32 CUDA processor cores, 16 load/store units, and fourspecial function units (SFUs). It has a64-Kbyte configurable shared memory/L1cache, 128-Kbyte register file, instructioncache, and two multithreaded warp schedulers and instruction dispatch units.62IEEE MICROThread instructionsParallel thread execution (PTX) instructions describe the execution of a single threadin a parallel CUDA program. The PTX

[3B2-14]mmi2010020005.3d23/3/01015:43Page 63Instruction cacheWarp schedulerWarp schedulerDispatch unitDispatch unitRegister file (128 Kbytes)CUDA reCoreCoreCoreCoreCoreCoreCoreCoreDispatch portOperand collectorFP unitINT ULD/STInterconnect networkFP Floating pointINT Integer arithmetic logicLD/ST Load/storeSFU Special function unit64-Kbyte shared memory and L1 cacheUniform cacheFigure 4. The Fermi streaming multiprocessor has 32 CUDA processor cores, 16 load/storeunits, four special function units, a 64-Kbyte configurable shared memory/L1 cache,128-Kbyte register file, instruction cache, and two multithreaded warp schedulers andinstruction dispatch units.instructions focus on scalar (rather thanvector) operations to match standard scalarprogramming languages. Fermi implementsthe PTX 2.0 instruction set architecture(ISA), which targets C, Cþþ, Fortran,OpenCL, and DirectCompute programs.Instructions include 32-bit and 64-bit integer, addressing,and floating-point arithmetic; load, store, and atomic memory access; texture and multidimensional surfaceaccess; individual thread flow control with pre-dicated instructions, branching, function calls, and indirect function callsfor Cþþ virtual functions; and parallel barrier synchronization.CUDA coresEach pipelined CUDA core executes ascalar floating point or integer instructionper clock for a thread. With 32 cores, thestreaming multiprocessor can execute up to32 arithmetic thread instructions per clock.MARCH/APRIL 201063

[3B2-14]mmi2010020005.3d23/3/01015:43Page 64.HOT CHIPSThe integer unit implements 32-bit precisionfor scalar integer operations, including 32-bitmultiply and multiply-add operations, andefficiently supports 64-bit integer operations.The Fermi integer unit adds bit-field insertand extract, bit reverse, and populationcount.IEEE 754-2008 floating-point arithmeticThe Fermi CUDA core floating-point unitimplements the IEEE 754-2008 floatingpoint arithmetic standard for 32-bit singleprecision and 64-bit double-precision results,including fused multiply-add (FMA) instructions. FMA computes D ¼ A * B þ C withno loss of precision by retaining full precisionin the intermediate product and addition,then rounding the final sum to form theresult. Using FMA enables fast division andsquare-root operations with exactly roundedresults.Fermi raises the throughput of 64-bitdouble-precision operations to half that ofsingle-precision operations, a dramatic improvement over the T10 GPU. This performance level enables broader deployment ofGPUs in high-performance computing.The floating-point instructions handle subnormal numbers at full speed in hardware,allowing small values to retain partialprecision rather than flushing them tozero or calculating subnormal values inmulticycle software exception handlers asmost CPUs do.The SFUs execute 32-bit floating-pointinstructions for fast approximations of reciprocal, reciprocal square root, sin, cos, exp,and log functions. The approximations areprecise to better than 22 mantissa bits.Unified memory addressing and accessThe streaming multiprocessor load/storeunits execute load, store, and atomic memory access instructions. A warp of 32 activethreads presents 32 individual byte addresses,and the instruction accesses each memoryaddress. The load/store units coalesce 32individual thread accesses into a minimalnumber of memory block accesses.Fermi implements a unified thread address space that accesses the three separateparallel memory spaces of Figure 1: perthread local, per-block shared, and global.64IEEE MICROmemory spaces. A unified load/store instruction can access any of the three memory spaces, steering the access to the correctmemory, which enables general C andCþþ pointer access anywhere. Fermi provides a terabyte 40-bit unified byte addressspace, and the load/store ISA supports64-bit byte addressing for future growth.The ISA also provides 32-bit addressinginstructions when the program can limitits accesses to the lower 4 Gbytes of addressspace.Configurable shared memory and L1 cacheOn-chip shared memory provides lowlatency, high-bandwidth access to datashared by cooperating threads in the sameCUDA thread block. Fast shared memorysignificantly boosts the performance ofmany applications having p

the gpu computing era gpu computing is at a tipping point, becoming more widely used in demanding consumer applications and high-performance computing.this article describes the rapid evolution of gpu architectures—from graphics processors to massively parallel many-core multiprocessors, recent developments in gpu computing architectures, and how the enthusiastic

Related Documents:

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

GPU Tutorial 1: Introduction to GPU Computing Summary This tutorial introduces the concept of GPU computation. CUDA is employed as a framework for this, but the principles map to any vendor’s hardware. We provide an overview of GPU computation, its origins and development, before presenting both the CUDA hardware and software APIs. New Concepts

OpenCV GPU header file Upload image from CPU to GPU memory Allocate a temp output image on the GPU Process images on the GPU Process images on the GPU Download image from GPU to CPU mem OpenCV CUDA example #include opencv2/opencv.hpp #include <

Machine learning (ML) and artificial intelligence (AI) have been around for many years. However, in the last 5 years, remarkable progress has been made using multilayered neural networks in diverse areas such as image recognition, speech recognition, and machine translation. AI is a general purpose technology that is likely to impact many industries. In this chapter I consider how machine .