GPU Computing In Medical Physics: A Review

3y ago
40 Views
2 Downloads
832.98 KB
13 Pages
Last View : 14d ago
Last Download : 3m ago
Upload by : Axel Lin
Transcription

GPU computing in medical physics: A reviewGuillem Pratxa) and Lei XingDepartment of Radiation Oncology, Stanford University School of Medicine, 875 Blake Wilbur Drive,Stanford, California 94305(Received 29 November 2010; revised 21 March 2011; accepted for publication 25 March 2011;published 9 May 2011)The graphics processing unit (GPU) has emerged as a competitive platform for computing massively parallel problems. Many computing applications in medical physics can be formulated asdata-parallel tasks that exploit the capabilities of the GPU for reducing processing times. Theauthors review the basic principles of GPU computing as well as the main performance optimization techniques, and survey existing applications in three areas of medical physics, namely imagereconstruction, dose calculation and treatment plan optimization, and image processing.C 2011 American Association of Physicists in Medicine. [DOI: 10.1118/1.3578605]VKey words: graphics processing units, high-performance computing, image segmentation, dosecalculation, image processingI. INTRODUCTIONParallel processing has become the standard for high-performance computing. Over the last thirty years, general-purpose, single-core processors have enjoyed a doubling of theirperformance every 18 months, a feat made possible bysuperscalar pipelining, increasing instruction-level parallelism and higher clock frequency. Recently, however, the progress of single-core processor performance has slowed dueto excessive power dissipation at GHz clock rates and diminishing returns in instruction-level parallelism. Hence, application developers—in particular in the medical physicscommunity—can no longer count on Moore’s law to makecomplex algorithms computationally feasible. Instead, theyare increasingly shifting their algorithms to parallel computing architectures for practical processing times.With the increased sophistication of medical imaging andtreatment machines, the amount of data processed in medicalphysics is exploding; processing time is now limiting thedeployment of advanced technologies. This trend has beendriven by many factors, such as the shift from 3-D to 4-D inimaging and treatment planning, the improvement of spatialresolution in medical imaging, the shift to cone-beam geometries in x-ray CT, the increasing sophistication of MRIpulse sequences, and the growing complexity of treatmentplanning algorithms. Yet typical medical physics datasetscomprise a large number of similar elements, such as voxelsin tomographic imaging, beamlets in intensity-modulatedradiation therapy (IMRT) optimization, k-space samples inMRI, projective measurements in x-ray CT, and coincidenceevents in PET. The processing of such datasets can often beaccelerated by distributing the computation over many parallel threads.Originally designed for accelerating the production ofcomputer graphics, the graphics processing unit (GPU) hasemerged as a versatile platform for running massively parallel computation. Graphics hardware presents clear advantages for processing the type of datasets encountered in2685Med. Phys. 38 (5), May 2011medical physics: high memory bandwidth, high computationthroughput, support for floating-point arithmetic, the lowestprice per unit of computation, and a programming interfaceaccessible to the nonexpert. These features have raised tremendous enthusiasm in many disciplines, such as linearalgebra, differential equations, databases, raytracing, datamining, computational biophysics, molecular dynamics, fluiddynamics, seismic imaging, game physics, and dynamic programming.1–4In medical physics, the ability to perform general-purposecomputation on the GPU was first demonstrated in 1994when a research group at SGI implemented image reconstruction on an Onyx workstation using the RealityEngine2.5Despite this pioneering work, it took almost 10 yr for GPUcomputing to become mainstream as a topic of research(Fig. 1). There were several reasons for this slow start.Throughout the 1990s, researchers were blessed with thedoubling of single-core processor performance every 18months. As a result, a single-core processor in 2004 couldperform image reconstruction 100 times faster than in 1994,and as fast as SGIs 1994 graphics-hardware implementation.5 However, the performance of recent single-coreprocessors suggests that the doubling period might now be5 yr.6 As a result, vendors have switched to multicore architectures to keep improving the performance of their CPUs, ashift that has given a strong incentive for researchers to consider parallelizing their computations.Around the same time, the programmable GPU was introduced. Unlike previous graphics processors, which were limited to running a fixed-function pipeline with 8-bit integerarithmetic, these new GPUs could run custom programs(called shaders) in parallel, with floating-point precision.The shift away from single-core processors and the increasing programmability of the GPU created favorable conditions for the emergence of GPU computing.GPUs now offer a compelling alternative to computerclusters for running large, distributed applications. With theintroduction of compute-oriented GPU interfaces, shared0094-2405/2011/38(5)/2685/13/ 30.00C 2011 Am. Assoc. Phys. Med.V2685

2686G. Pratx and L. Xing: GPU computing in medical physicsFIG. 1. Number of publications relating to the use of GPUs in medicalphysics, per year. Data were obtained by searching PubMed using the terms“GPU,” “graphics processing unit,” and “graphics hardware” and excludingirrelevant citations.memory, and support for double-precision arithmetic, therange of computational applications that can run on the GPUhas vastly increased. By off-loading the data-parallel part ofthe computation onto GPUs, the number of physical computers within a computer cluster can be greatly reduced.Besides reducing cost, smaller computer clusters also requireless maintenance, space, power, and cooling. These are important factors to consider in medical physics given that thecomputing resources are typically located on-site, inside thehospital.2686FIG. 2. Number of computing cores (5) and memory bandwidth (D) forhigh-end NVIDIA GPUs as a function of year (data from vendorspecifications).forth. In order to sustain the increased computation throughput, the GPU memory bandwidth has doubled every 1.7 yr,and recent GPUs can achieve a peak memory bandwidth of408 GB/s (Fig. 2).With more computing cores, the peak performance ofGPUs, measured in billion floating-point operations per second (GFLOPS), has been steadily increasing (Fig. 3). Inaddition, the performance gap between GPU and CPU hasbeen widening, due to a performance doubling rate of 1.5 yrfor CPUs versus 1 yr for GPUs (Fig. 3).The faster progress of the GPUs performance can be attributed to the highly scalable nature of its architecture. For multicore/multi-CPU systems, the number of threads physicallyII. OVERVIEW OF GPU COMPUTINGII.A. Evolution of the GPUOver the years, the GPU has evolved from a highly specialized pixel processor to a versatile and highly programmable architecture that can perform a wide range of dataparallel operations. The hardware of early 3-D accelerationcards (such as the 3Dfx Voodoo) was devoted to processingpixel and texture data. These cards offered no parallel processing capabilities, but freed the CPU from the computationally demanding task of filling polygon with texture andcolor. A few years later, the task of transforming the geometry was also moved from the CPU to the GPU, one of thefirst steps toward the modern graphics pipeline.Because the processing of vertices and pixels is inherentlyparallel, the number of dedicated processing units increasedrapidly, allowing commodity PCs to render ever more complex 3-D scenes in tens of milliseconds. Since 1997, the number of compute cores in GPU processors has doubled roughlyevery 1.4 yr (Fig. 2). Over the same period, GPU cores havebecome increasingly sophisticated and versatile, enrichingtheir instruction set with a wide variety of control-flow mechanisms, support for double-precision floating-point arithmetic, built-in mathematical functions, a shared-memory modelfor interthread communications, atomic operations, and soMedical Physics, Vol. 38, No. 5, May 2011FIG. 3. Computing performance, measured in billion single-precision floating-point operation per second (GFLOPS), for CPUs (D) and GPUs (5).GPUs: (A) NVIDIA GeForce FX 5800, (B) FX 5950 Ultra, (C) 6800 Ultra,(D) 7800 GTX, (E) Quadro FX 4500, (F) GeForce 7900 GTX, (G) 8800GTX, (H) Tesla C1060, and (I) AMD Radeon HD 5870. CPUs: (A) Athlon64 3200þ, (B) Pentium IV 560, (C) Pentium D 960, (D) 950, (E) Athlon 64X2 5000þ, (F) Core 2 Duo E6700, (G) Core 2 Quad Q6600, (H) Athlon 64FX-74, (I) Core 2 Quad QX6700, (J) Intel Core i7 965 XE, and (K) Core i7980X Extreme (data from vendors).

2687G. Pratx and L. Xing: GPU computing in medical physicsresiding in the hardware can be no greater than twice the number of physical cores (with hyperthreading). As a result,advanced PCs can run at most 100 threads simultaneously. Incontrast, current GPU hardware can host up to 30 000 concurrent threads. Whereas switching between CPU threads iscostly because the operating system physically loads thethread execution context from the RAM, switching betweenGPU threads does not incur any overhead as the threads areresiding on the GPU for their entire lifetime. A further difference is that the GPU processing pipeline is for the most partbased on a feed-forward, single-instruction multiple-data(SIMD) architecture, which removes the need for advanceddata controls. In comparison, multicore multi-CPU pipelinesrequire complex control logic to avoid data hazards.II.B. The graphics pipelineIn the early days of GPU computing, the GPU could onlybe programmed through a graphics rendering interface. Inthese pioneering implementations, computation was reformulated as a rendering task and programmed through thegraphics pipeline. While new compute-specific interfaceshave made these techniques obsolete, a basic understandingof the graphics pipeline is still useful for writing efficientGPU code.Graphics applications (such as video games) use the GPUto perform the calculations necessary to render complex 3-Dscenes in tens of milliseconds. Typically, 3-D scenes are represented by triangular meshes filled with color or textures.Textures are 2-D color images, stored in GPU memory,designed to increase the perceived complexity of a 3-Dscene. The graphics pipeline decomposes graphics computation into a sequence of stages that exposes both task parallelism and data parallelism (Fig. 4). Task parallelism isachieved when different tasks are performed simultaneouslyat different stages of the pipeline. Data parallelism isachieved when the same task is performed simultaneouslyon different data. The computational efficiency is furtherFIG. 4. The graphics pipeline. The boxes shaded in light red correspond tostages of the pipeline that can be programmed by the user.Medical Physics, Vol. 38, No. 5, May 20112687improved by implementing each stage of the graphics pipeline using custom rather than general-purpose hardware.Within a graphics application, the GPU operates as astream processor. In the graphics pipeline, a stream of vertices (representing triangular meshes) is read from the host’smain memory and processed in parallel by vertex shaders(Fig. 4). Typical vertex processing tasks include projectingthe geometry onto the image plane of the virtual camera,computing the surface normal vectors and generating 2-Dtexture coordinates for each vertex.After having been processed, vertices are assembled intotriangles to undergo rasterization (Fig. 4). Rasterization,implemented in hardware, determines which pixels are covered by a triangle and, for each of these pixels, generates afragment. Fragments are small data structures that containall the information needed to update a pixel in the framebuffer, including pixel coordinates, depth, color, and texturecoordinates. Fragments inherit their properties from the vertices of the triangles from which they originate, whereinproperties are bilinearly interpolated within the triangle areaby dedicated GPU hardware.The stream of fragments is processed in parallel by fragment shaders (Fig. 4). In a typical graphics application, thisprogrammable stage of the pipeline uses the fragment data tocompute the final color and transparency of the pixel. Fragment shaders can fetch textures, calculate lighting effects,determine occlusions, and define transparency. After havingbeen processed, the stream of fragments is written to the framebuffer according to predefined raster operations, such asadditive blending.All the stages of the graphics pipeline are implemented onthe GPU using dedicated hardware. In a unified shader model,the system allocates computing cores to vertex and fragmentshading based on the relative intensity of each task. The earlyGPU computing work focused on exploiting the high computational throughput of the fragment shading stage, which iseasily accessible. For instance, a popular technique for processing 2-D arrays of data consisted in rendering a rectangleinto the framebuffer with multiple-data arrays mapped as textures and custom fragment shaders enabled.Graphics applications and video games favor throughputover latency, because, above a certain threshold, the humanvisual system is less sensitive to the frame-rate than to thelevel of detail of a 3-D scene. As a result, GPU implementations of the graphics pipeline are optimized for throughputrather than latency. Any given triangle might take hundredsto thousands of clock cycles to be rendered, but, at any giventime, tens of thousands of vertices and fragments are in flightin the pipeline. Most medical physics applications aresimilar to video games in the sense that high throughput isconsiderably more important than low latency.In graphics mode, the GPU is interfaced through agraphics API such as OpenGL or DirectX. The API providesfunctions for defining and rendering 3-D scenes. Forinstance, a 3-D triangular mesh is defined as an array of vertices and rendered by streaming the vertices to the GPU.Arrays of data can be moved to and from video memory astextures. Custom shading programs can be written using a

2688G. Pratx and L. Xing: GPU computing in medical physicshigh-level shading language such as CG, GLSL, or HLSL, andloaded on the GPU at run time. Early GPU computing programs written in such a framework have achieved impressiveaccelerations but suffered from several drawbacks: the codeis difficult to develop and maintain because the computationis defined in terms of graphics concepts such as vertices, texture coordinates, and fragments; performance is compromised by the lack of access to all the capabilities of the GPU(most notably shared memory and scattered writes); andcode portability is limited by the hardware-specific nature ofsome graphics extensions.II.C. GPU computing modelCompute-oriented APIs expose the massively parallelarchitecture of the GPU to the developer in a C-like programming paradigm. These commonly used APIs include NVIDIA CUDA, Microsoft DirectCompute, and OpenCL. Forcohesion, this review focuses on CUDA, currently the mostpopular GPU computing API, but the concepts it presentsare readily applied to other APIs. CUDA provides a set ofextensions to the C language that allows the programmer toaccess computing resources on the GPU such as video memory, shading units, and texture units directly, without havingto program the graphics pipeline.7 From a hardware perspective, a CUDA-enabled graphics card comprises SGRAMmemory and the GPU chip itself—a collection of streamingmultiprocessors (MPs) and on-chip memory.In the CUDA paradigm, a parallel task is executed bylaunching a multithreaded program called a kernel. Thecomputation of a kernel is distributed to many threads,which are grouped into a grid of blocks (Fig. 5). Physically,the members of a thread block run on the same MP for theirentire lifetime, communicate information through fast sharedmemory and synchronize their execution by issuing barrierinstructions. Threads belonging to different blocks arerequired to execute independently of one another, in arbitrary order. Within one thread block, threads are further divided in groups of 32 called warps. Each warp executes in aSIMD fashion, with the MP broadcasting the same instruction to all its cores repeatedly until the entire warp is processed. When one warp stalls, for instance, because of aFIG. 5. GPU thread and memory hierarchy. Threads are organized as a gridof thread blocks. Threads within a block are executed on the same MP andhave access to on-chip private registers (R) and shared memory. Additionalglobal and local memories (LM) are available off-chip to supplement limited on-chip resources.Medical Physics, Vol. 38, No. 5, May 20112688memory operation, the MP can hide this latency by quicklyswitching to a ready warp.Even though each MP runs as a SIMD device, the CUDAprogramming model allows threads within a warp to followdifferent branches of a kernel. Such diverging threads arenot executed in parallel but sequentially. Therefore, CUDAdevelopers can safely write kernels which include if statements or variable-bounds for loops without taking intoaccount the SIMD behavior of the GPU at the warp level. Aswe will see in Sec. II D, thread divergence substantiallyreduces performance and should be avoided.The organization of the GPUs memory mirrors the hierarchy of the threads (Fig. 5). Global memory is randomly accessible for reading and writing by all threads in theapplication. Shared memory provides storage reserved forthe members of a thread block. Local memory is allocated tothreads for storing their private data. Last, private registersare divided among all the threads residing on the MP. Registers and shared memory, located on the GPU chip, havemuch lower latency than local and global memories, implemented in SGRAM. However, global memory can store several gigabytes of data, far more than shared memory orregisters. Furthermore, while shared memory and registersonly hold data temporarily, data stored in global memorypersists beyond the lifetime of the kernels.Two other types of memories are available on the GPU,called texture and constant memories. Both of these memories are read-only and cached for fast access. In addition, texture memory fetches are serviced by dedicated hardwareunits that can perform linear filtering and address calculations. In devices of compute capability 2.0 and greater,global memory operations are serviced by two levels ofcache, namely a per-MP L1 cache and a unified L2 cache.Unlike previous graphics APIs, CUDA threads can writemultiple data to arbitrary memory locations. Such scatteredwrites are useful for many algorithms, yet, conflicts can arisewhen multiple threads attempt to write to the same memorylocation simultaneously. In such a case, only one of the writesis guaranteed to succeed. To safely write data to a commonmemory location, threads must use an atomic operation; forinstance, an atomic add operation accumulates its operandinto a given memory location. The GPU processes conflictingatomic writes in a serial manner to avoid data write hazards.An issue important in medical physics is the reliability ofmemory operations. Errors introduced while reading or writing memory can have harmful consequences for dose calculation or image reconstruction. The memory of consumergrade GPUs is optimized for speed because the occasionalbit flip has little consequence in a video game. Professionaland high-performance computing GPUs are designed for ahigher level of reliability that they achieve using speciallychosen hardware operated at lower clock rate. For furtherprotection against unavoidable cosmic radiations and othersources of error, some of the more recent GPUs store redundant error-correcting codes (ECCs

ing programmability of the GPU created favorable condi-tions for the emergence of GPU computing. GPUs now offer a compelling alternative to computer clusters for running large, distributed applications. With the introduction of compute-oriented GPU interfaces, shared

Related Documents:

the gpu computing era gpu computing is at a tipping point, becoming more widely used in demanding consumer applications and high-performance computing.this article describes the rapid evolution of gpu architectures—from graphics processors to massively parallel many-core multiprocessors, recent developments in gpu computing architectures, and how the enthusiastic

GPU Tutorial 1: Introduction to GPU Computing Summary This tutorial introduces the concept of GPU computation. CUDA is employed as a framework for this, but the principles map to any vendor’s hardware. We provide an overview of GPU computation, its origins and development, before presenting both the CUDA hardware and software APIs. New Concepts

OpenCV GPU header file Upload image from CPU to GPU memory Allocate a temp output image on the GPU Process images on the GPU Process images on the GPU Download image from GPU to CPU mem OpenCV CUDA example #include opencv2/opencv.hpp #include <

Introduction to GPU computing Felipe A. Cruz Nagasaki Advanced Computing Center Nagasaki University, Japan. Felipe A. Cruz Nagasaki University The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU has recently evolved towards a more flexible architecture.

GPU Computing in Matlab u Included in the Parallel Computing Toolbox. u Extremely easy to use. To create a variable that can be processed using the GPU, use the gpuArray function. u This function transfers the storage location of the argument to the GPU. Any functions which use this argument will then be computed by the GPU.

limitation, GPU implementers made the pixel processor in the GPU programmable (via small programs called shaders). Over time, to handle increasing shader complexity, the GPU processing elements were redesigned to support more generalized mathematical, logic and flow control operations. Enabling GPU Computing: Introduction to OpenCL

Latest developments in GPU acceleration for 3D Full Wave Electromagnetic simulation. Current and future GPU developments at CST; detailed simulation results. Keywords: gpu acceleration; 3d full wave electromagnetic simulation, cst studio suite, mpi-gpu, gpu technology confere

Will Landau (Iowa State University) Introduction to GPU computing for statisticicans September 16, 2013 20 / 32. Introduction to GPU computing for statisticicans Will Landau GPUs, parallelism, and why we care CUDA and our CUDA systems GPU computing with R CUDA and our CUDA systems Logging in