Load Balanced Parallel GPU Out-of-Core For Continuous LOD Model .

1y ago
14 Views
2 Downloads
3.03 MB
9 Pages
Last View : 16d ago
Last Download : 3m ago
Upload by : Mika Lloyd
Transcription

Load Balanced Parallel GPU Out-of-Core forContinuous LOD Model VisualizationChao PengPeng MiYong CaoDepartment of Computer ScienceVirginia TechBlacksburg, Virginia 24060Email: chaopeng@vt.eduDepartment of Computer ScienceVirginia TechBlacksburg, Virginia 24060Email: mipeng@vt.eduDepartment of Computer ScienceVirginia TechBlacksburg, Virginia 24060Email: yongcao@vt.eduAbstract—Rendering massive 3D models has been recognizedas a challenging task. Due to the limited size of GPU memory,a massive model containing hundreds of millions of primitivescannot fit into most of modern GPUs. By applying parallel levelof-detail (LOD), as proposed in [1], only a portion of primitivesinstead of the whole are necessary to be streamed to the GPU.However, the low bandwidth in CPU-GPU communication is stillthe major bottleneck that prevents users from achieving highperformance rendering of massive 3D models on a single-GPUsystem. This paper explores a device-level parallel design thatdistributes the workloads for both GPU out-of-core and LODprocessing in a multi-GPU multi-display system. Our multi-GPUout-of-core takes advantages of a load-balancing method andseamlessly integrates with the parallel LOD algorithm. By usingframe-to-frame coherence, the overhead of data transferringis significantly reduced on each GPU. Our experiments showa highly interactive visualization of the “Boeing 777” airplanemodel that consists of over 332 million triangles and over 223million vertices.I. I NTRODUCTIONRecently, GPU hardwares have been praised not only because of their dramatically increased computational power butalso due to their programability for general-purpose computation. GPU architectures allow a large number of threadslaunched simultaneously to support high-performance applications. However, because of the enormous amount of renderableprimitives in a massive model, the power and memory of asingle GPU may not be sufficient to visualize them at a decentrendering rate. In the research domains, such as ComputerAided Design (CAD) products, mechanical visualization andvirtual reality applications, researchers develop very complex3D models that may consist of millions, or even hundreds ofmillions of polygon primitives and consume several gigabyteson storage. Since these gigabyte-sized models cannot fit intomost of commodity GPUs, the polygon primitives have to betransferred from a CPU host before rendering each frame.Although parallel LOD algorithms [1], [2], [3] have beenproposed to reduce the amount of GPU resident primitives,the bandwidth in CPU-GPU communication is still too lowto transfer the data efficiently, which usually is the majorperformance bottleneck in massive model rendering.To address this issue, multi-GPU systems have caught attention of researchers since it can provide more computationalpower and memory. With the rapid hardware development, amotherboard is commonly configured with multiple PCIe slotsthat enable installations of two or more graphics cards. Thisconfiguration would have high potentials to increase the performance by distributing workloads across GPUs. Meanwhile,with additional GPU display ports, multiple display monitorscan be connected, so that the rich information embedded inmassive data can be visualized at much higher resolutionsas what they should deserve. However, it is not trivial totransplant a parallel approach from a single-GPU to a multiGPU system. One major reason is the lacks of both programming models and well-established inter-GPU communicationfor a multi-GPU system. Although major GPU suppliers, suchas NVIDIA and AMD, support multi-GPUs by establishingScalable Link Interface (SLI) and Crossfire, respectively, theirtechnologies are primarily designed for gaming and short ofthe functionalities for both general programming and softwareimplementations. Also, when enabling SLI or Crossfire, theGPUs behave as one hardware entity and per GPU executionis not allowed. Another reason is the workload-balancing issuebetween GPUs. Imbalanced workload distribution will hurt theperformance.Main contributions. In this paper, we present a device-levelparallel approach for real-time massive model rendering in amulti-GPU multi-display system. Our contributions include thefollowing two features:1) Parallel GPU out-of-core. We employ a device-levelparallelism for efficient data fetching from CPU mainmemory to multiple GPU devices. Our parallel out-ofcore is seamlessly integrated with LOD processing andframe-to-frame coherence schemes.2) Balanced data distribution for parallel rendering.We propose a load balancing algorithm to dynamicallyand evenly distribute workloads to each GPU, so thatboth performance and memory usage achieve an optimalstandard.System configuration. While the performance of multicoreCPUs and GPUs are scaling along Moore’s law, it has beenbecoming feasible to build clusters of heterogeneous multiGPU architectures for high-performance graphics. However,before carrying on an efficient algorithm on a cluster system, itis essential to make a single node perform optimally. Upon this

demand, we use a single node machine, which is a standardPC platform configured with two GPU devices. Each GPU isconnected with a dedicated PCI-Express and able to executetheir local instructions and perform rendering tasks.Input 3D model. CAD datasets are one of major digital representations for many of man-made designs. In our approach,we target on the visualization of “Boeing 777” CAD model,which organizes rich geometric information in a huge numberof loosely connected objects. This airplane model containsover 332 million triangles and over 223 million vertices thatconsume more than 6 gigabyte memory.Organization. Section II briefly reviews some relatedworks. In Section III, we describe the current state-of-arttechnologies in massive model rendering. In Section IV, westate the problems in high-performance rendering and givean overview of our approach. In Section V, we discuss themethods of CUDA-OpenGL interoperability for a multi-GPUarchitecture. In Section VI, we present our load balancingalgorithm for GPU data distribution. Section VII discussesthe methods of GPU-GPU communication and synchronization. We evaluate our approach in Section VIII. Finally, weconclude our work in Section IX.II. R ELATED W ORKIn the past, researchers have dedicated their efforts inmassive model rendering by using many different accelerationdata structures. We review some of them in Section II-A.Since multi-GPU systems have become a new trend forHigh-Performance Computing (HPC). We provide some ofthe previous works that concentrated on device/cluster-levelparallel designs in Section II-B.A. Massive Model VisualizationInteractively rendering of massive 3D models has beenan active research domain. To address performance issues,mesh simplification is commonly used to reduce the workloadof rendering. The basic idea of mesh simplification is tosimplify a complex model until it becomes manageable byrenderers. Some previous approaches, such as ProgressiveMeshes ([4], [5]), Quadric Error Metrics ([6], [7]) and adaptiveLOD [8], where meshes are simplified based on a sequence ofedge-collapsing modifications. In order to handle large-scalemodels, out-of-core methods have been proposed. Cignoniet.al [9] presented a geometry-based multi-resolution structure,known as the tetrahedra hierarchy, to pre-compute static LODs.Yoon et.al [10] presented a clustered vertex hierarchy (CHPM)as their out-of-core structure for view-dependent simplificationand rendering.More recently, GPUs have been used to simplify complexmodels. Hu et.al [3] implemented a view-dependent LODalgorithm on GPU without fully considering the vertex dependencies. Derzapf et.al [11] used a compact data structure forProgressive Meshes that requires less GPU memory, so that therun-time parallel processing can be optimized, later, their workwas extended for parallel out-of-core LOD in [12]. In orderto render gigabyte-scale 3D models in parallel, Peng et.al [1]presented an parallel approach that successfully removed thedata dependencies for efficient run-time processing. By utilizing temporal coherence between frames, the large amount ofdata can be streamed and defragmented efficiently.B. Multi-GPU ApproachesAs the rapid development of hardwares, many computingsystems are built with GPU clusters for high-performancecomputations. Eilemann [13] summarized and analyzed existing approaches targeting on parallel rendering designs withmultiple or clustered GPUs. One popular software package,Equalizer, has been commonly used in multi-GPU rendering society. As introduced in paper [14], Equalizer is ascalable parallel rendering framework suitable for large datavisualization and OpenGL-based applications. It utilizes aflexible compound tree structure to support its rendering andimage compositing strategy; however, there are some issuesremaining unsolved and affecting the parallel performance, forexample, the load balancing issue between GPUs or clustersfor data distribution and replication.Because of the importance of load-balancing issues, Fogal et.al [15] discussed it by considering data-transferringoverhead and the balance among the rendering, the compositing and the viability of GPU cluster in a distributedmemory environment. Erol et.al [16] concentrated on thecross-segment load balancing within Equalizer framework.Their work proposed a dynamic task partition strategy for thebest usage of the available shared graphics resources. Anotherapproach, presented in [17], was for dynamic load balancing. By utilizing frame-to-frame coherence, this approach redistributed data based on the historical frame rates; If oneGPU has a higher frame rate in pervious frames, it wouldbe allocated larger workloads; otherwise, its workloads wouldbe reduced accordingly. To achieve higher resolutions in avisualization, the tiled display systems have been widely usedin many visualization researches. As discussed in [18], thewhole picture is projected onto multiple display nodes withproportional viewports. Besides the descriptions of systemsetting, the authors also provided their solutions for the multidisplay synchronization.III. D ESCRIPTION OF THE C URRENT S TATE - OF -A RTAmong standard real-time visualizations, mesh simplification is one of acceleration techniques that reduces thecomplexity of 3D models for fast rendering. Traditional algorithms represent data in hierarchical structures, such as multiresolution of the static LODs [9] and the clustered vertexhierarchy [10]. To build them, a bottom-up node-mergingprocess is used, and inter-dependency is introduced betweenlevels of the hierarchies.Peng and Cao [1] proposed a dependency-free approach thatmakes simplification of massive models suitable on GPU. Ourparallel LOD implementation is extended from Peng and Cao’swork. We give more details in this section. In that work, edgecollapsing information is encoded in an array structure that isgenerated by collapsing edges iteratively. At each iteration,

two vertices of an edge are merged, and the correspondingtriangles are eliminated. To assure a faithful look of low-polyobject, the edge can be chosen based on the rule that, whencollapsed, visual changes are minimal (e.g., the rule introducedin [19]). Each element in the array corresponds to a vertex,and its value is the index of the target vertex that it mergesto.According to the order of edge-collapsing operations, Storage of vertices and triangles are re-arranged. Basically, the firstremoved vertex during iterative edge-collapsing is re-stored tothe last position in the set of vertex data; and the last removedvertex is re-stored to the first position. The same re-arrangingstrategy is applied to the triangle data as well. As a result, theorder of storing re-arranged data reflects the levels of detailsof the model. Consequently, if needing a coarse version of themodel, a small number of continuous vertices and trianglesare sufficient and selected by starting from the first elementin their sets.At run-time, based on LOD selection criteria, such as thoseused in [20], [21], [22], [23], only a portion of data is activeto generate the simplified version of original model as thealternative for rendering. By using GPU parallel architectures,each selected triangle is assigned to a GPU thread and isreformed to a new shape, where all three vertex indicesof the triangle is replaced with an appropriate target vertexby walking backward through the array of edge-collapsinginformation.The increase of GPU memory does not catch up the capability on CPU main memory. Most of todays GPUs cannothold the data requiring several gigabytes on storage. Thus,at each rendering frame, the selected portion of data haveto be streamed to GPU to perform parallel computation andrasterization.IV. P ROBLEM S TATEMENTS AND OVERVIEWIn this section, we describe our research problems byidentifying performance bottlenecks and load balancing issues.We also give the overview of our parallel design.A. Performance BottlenecksCPU-GPU data streaming is unavoidable in large-scale datavisualization. Although the size of renderable data can bereduced using simplification algorithms, to preserve a decentlevel of visual fidelity, the simplified data is usually stilltoo large to be streamed efficiently. Thus, the size of tobe-streamed data becomes the major issue that prevents theachievement of high interactive rates.Brute-force streaming those selected data to GPU is verytime-consuming. In many situations, a common effort toreduce the time spent on data streaming is utilizing frame-toframe coherence, so that only frame-different data is identifiedand needs to be streamed. By combining this streamed datawith the GPU existing data from the previous frame, we canassemble the new data coinciding with the currently renderingframe. As we know, standard mechanisms of graphics programming, such as OpenGL’s Vertex Buffer Objects (VBO),take the graphics driver’s hints about primitive usage patternsto increase the rasterization performance of graphics API. Toenable VBO, the data have to be organized in the same orderas they are originally stored. However, in general cases, theframe-different data will be stored in a memory block thatis separate from those already existing data. For combiningthem with respect to the original data appearance in storage,the data in both blocks have to be shuffled around to thecorrect positions in the memory reserved by the currentframe. This process is generally know as “Defragmentation”.Unfortunately, Defragmentation on GPU is slow, because itinvolves many operations of non-coalesced global memoryaccesses. Defragmentation time scales with the size of thedata residing on GPU, so that it will be a significant factor thatinfluences the overall rendering performance, especially whenthe data size is massively large. Using multi-GPU systems,the data is able to be distributed. Each GPU will perform aless amount of data, which can reduce the Defragmentationoverhead.B. Load Balancing Issues between GPUsUsing multiple GPUs is a trend to increase computationalpower and memory capabilities. However, balancing workloads and resource utilizations between GPUs has not beensatisfactorily addressed. Load balancing problems in massivemodel visualization are centered around how to distribute therenderable data to GPUs. Imbalanced distribution would causeunderutilization of available GPU resources and waste memoryspaces, so that both performance and visual quality may bedecreased.Commercial solutions, such as Nvidia SLI technology, havebeen introduced to balance geometry workloads between twoGPUs. SLI is a bridge that spans two GPUs to send datadirectly within a master-slave configuration. For example, themaster GPU send half of rendering work to the slave GPU.Then the slave GPU send its output image back to the masterfor compositing the images. However, SLI is not suitable forthe problems that we want to solve in this paper. SLI does notincorporate with out-of-core and Nvidia CUDA development.It requires all data to fit in GPU memory. Also, its masterslave configuration is only for single-display applications, notsuitable for multi-display applications, because the latter oneconnects each GPU to a display monitor, which breaks themaster-slave concept. Therefore, it is essential to have a betterload balancer for GPU out-of-core and CUDA-programmedparallel LOD.C. Overview of Our ApproachIn today’s PC systems, multiple GPUs start to becomethe standard configuration for all levels of users. The goalof our work is to design a parallel system that providesthe software supports for multi-GPU rendering. Each GPUwill be driven by one CPU-core. In our implementation,each CPU-core executes an instance of the program andfeeds the data from CPU main memory to the GPU thatit associates with. The input 3D model and other necessary

Occlusion treaming forOccludersLOD ProcessingCulling HiddenObjectsLODSelectionCPU-GPUStreaming forVisible ObjectsLOD ModelGeneratingLOD ModelRenderingFor LOD processingVertex &Triangle DataFor Occlusion cullingOn CPU memoryAABBsComplex3D cupancyrulesFig. 1.CollapsingInformationOn GPU memoryThe GL calls within different threads cannot be executed inorder as they are issued. When the driver schedules thesecalls, OpenGL contexts would switch frequently, which isparticularly time-consuming and would significantly decreasethe overall performance.Parallel processingC. Multiple ProcessesThe overview of our approach.run-time control parameters are shared among processes byemploying a method of Inter-Process Communication (IPC).We illustrate the overview of our approach in Figure 1. EachGPU is connected to a display monitor to show the renderedgeometries. If the data is balanced between GPUs, our loadbalancing algorithm will automatically calculate the optimalsolutions. The direct inter-communication between GPUs isestablished in order to perform framebuffer exchanging asnecessary for the final display.V. PARALLELIZATION OF CUDA-O PEN GLI NTEROPERABILITY ON M ULTIPLE GPU SWhen asking a GPU to perform both general-purposeand graphics computations, the interoperation between NvidiaCUDA and OpenGL is desired. Like in our application,CUDA is suitable for running the calculations of data defragmentation and parallel triangle-level simplifications, whileOpenGL is better to take the task of rendering the simplifiedmodels. Thanks to CUDA-OpenGL interoperability providedby CUDA SDK, but for multi-GPU implementations, interoperation among multiple CUDA and OpenGL contexts willbe an issue. In general, the possible solutions include: (1)A single CPU thread; (2) Multiple CPU threads (one GPUcontrolled by one CPU thread); (3) Multiple processes (oneGPU controlled by one process). We will discuss about thedetails in the following paragraphs.A. A Single CPU ThreadSince the release of CUDA v4.0, multi-GPU programming can be performed in a single CPU thread. By callingcudaSetDevice(), CUDA kernel executions and contexts areswitched between GPUs. But switching OpenGL contextsbetween GPUs in a singe CPU thread is not allowed. Forexample, cudaSetGLSetDevice() can be called only once atthe start of the program. This problem will make the singleCPU thread implementation not compatible.B. Multiple CPU ThreadsCreating multiple CPU threads allows one CPU threadto bind one GPU, so that each GPU has not only its owncontexts but also its own host. A thread is able to maintainits local storage and control the device that is is assignedwith. The OpenGL context associated to a CPU thread is ableto interoperate with the CUDA context of the same thread.However, the problem is that OpenGL is not a thread-safegraphics API since it is actually asynchronous by nature.To eliminate the overheads of switching contexts, a multiprocess strategy will be a solution for interoperating CUDAand OpenGL on multiple GPUs. One process communicates toone GPU, and maintains its own memory spaces and its privaterun-time resources. The GL calls within a process are executedin-order without the affections of the other processes. Thus,in our implementation, we use this multi-process strategyplus the employments of Inter-Process Communication (IPC)for synchronization of shared properties (such as cameraviewpoints).VI. DATA D ISTRIBUTION AT RUN -T IMEIn our system, two GPUs are installed in a single computernode. Each GPU drives a display monitor and visualizes halfsize of the frame, which will contain the data appearing itswindow. Unfortunately, this simple strategy usually results inpoor performance if graphical primitives are not uniformlydistributed over all GPUs.A worse issue caused by imbalanced distribution is inmemory usage. The “Boeing 777 airplane” used in our systemhas over 700 thousands of individual objects and containsmore than 6 gigabytes of vertices and triangles. Since mostof GPUs have much less memory than that size, the primitivecount allocated for each object is well budgeted by ensuringthe sum of all primitive counts is constrained within thegiven maximal amount. Of course, more GPUs mean moreavailable memory, and consequently indicate a higher potentialto increase the levels of details by adding more primitives toobjects. However, imbalanced distribution may overburden aGPU’s memory capability. For example, in an extreme case,one GPU may obtain the whole of selected data that exceedsits maximal memory size, while the other one is idling withoutany.Our load balancer uses a dynamic partitioning procedurethat recursively splits space of the view frustum. The balancerwill harmonize its execution time with the partition qualitywithin an efficient parallel implementation on GPU. In the following paragraphs, we first introduce the fundamental methodof view frustum partitioning, then we propose our dynamicload balancing algorithm.A. The fundamental of View Frustum Partitioning for DataDistributionFrom a viewpoint, only the objects inside the view frustum (represented with six planes defined by the camera) arevisible to the renderer. We pre-calculate a tight Axis-AlignedBounding Box(AABB) for each object. AABBs are used todetermine the visibilities of objects by testing them againstthe view frustum. If an object is outside, it will be assigned

ViewpointImage portiontransferred fromGPU0 to GPU1410532LOD 5a0c0000bcd00Rendered image on GPU0PartitioningResultGPU0GPU1Fig. 2. The view frustum partitioning. The indices 0-5 represent the object’sbounding boxes. a-d stand for the desired count resulted in LOD selection.with the lowest level of details (e.g., zero vertex and triangle),otherwise, it will be allocated a cut from the budget of theoverall primitive count through the process of LOD selection.After that, we distribute the primitive data to the GPUs.As shown in Figure 2, the view-frustum is divided into subfrustums, each of which is associated to a GPU. For eachvisible object, we identify the GPU it belong to by testing itsAABB against the sub frustums. If an object is not in the subfrustum of a GPU, the detail level of the object will be set tozero for this GPU. At the stage of rasterization, the calls ofperspective and projection transformations take the full sizeof the view frustum, but for displaying the contents on thescreen, the viewport only needs to be set with the half of theframebuffer that contains rendered objects.B. Parallel Dynamic Load-Balancing AlgorithmObviously, the fundamental approach described in Section VI-A will lead to load-balancing problems since it alwaysdistributes the data by partitioning the frustum statically andevenly. The static partitioning method distributes the workloads usually according to the dimension proportions of thewindows that are launched by GPUs, rather than by balancingthe computational cost of data processing. To achieve theoptimal rendering performance, we present a parallel dynamicload-balancing algorithm. Given a specific viewpoint, thescreen is dynamically split by balancing the number of polygon primitives that are operated in each GPU. The renderedimages will be exchanged between GPUs to adjust the imageprojection within inter-process communications. For example,as illustrated in Figure 3, the number of triangles are balancedbetween GP U0 and GP U1 . GP U0 renders a larger part ofthe screen, so that it transfers an image portion to GP U1 toensure the viewport correctness.Our method is illustrated in Algorithm 1. The goal of thealgorithm is to calculate where the view frustum should besplit, so that the amount of data can be balanced. In theDisplayed image on GPU1Displayed image on GPU0SplittingpositionRendered image on GPU1Fig. 3. The dynamic load-balancing algorithm. The whole screen is splitby balancing the number of primitives distributed between GPUs. In thisexample, GP U0 transfers a potion of the image to GP U1 .algorithm, the returned value, split, ranges between (0,1). Ittells the position of splitting the view frustum on the nearplane. The algorithm is executed in a per-process manner. Thevf represents the view frustum associated to the camera, andthe id is the process index. Here, we use either the vertexcount or the triangle count to represent the complexity (thelevel of details) of each object. In the algorithm, the list ofobject complexities is represented as compLevel, where theith object’s complexity is denoted as compLevel[i].In the initialization, we set the split to be 0.5 indicatingthat the view frustum is divided evenly. Then, we iterativelyfind the optimal split value. At each iteration, the sub frustumof each process is updated. We denoted the left sub frustumas subFl . The objects’ AABBs are tested against subFl tore-generate the compLevel for the GPU. In the Algorithm 1,compLevell represents the list of objects’ complexities for theGPU rendering the left part of the screen; and compLevelris for the GPU rendering the right part (refer to Line 8-16in Algorithm 1). For efficiency purposes, we can employ animplementation using CUDA to compute compLevell andcompLevelr in parallel. In Line 19, the ratio is defined tofind out if the value of the split has reached the satisfaction bycomparing to the threshold value. Ideally, setting thresholdto 0.5 will cut the data evenly for distribution. However,in most of cases, this may need too many iterations, whichwould potentially slow down the performance. an appropriatethreshold that weights between the execution time of the loadbalancer and the distribution proportion will be better learnedin practice. After that, each GPU renders its own objectsclassified by the split value. As shown in Figure 3, the GPUreceiving a larger portion of the view frustum will send the“unwanted” portion of output frame to the other GPU.VII. S YNCHRONIZATION AND I NTER -P ROCESSC OMMUNICATION (IPC)In CUDA programming environments, GPUs cannot interact with each other directly. The only way of inter-GPUcommunication is going through the controls of inter-CPU

Algorithm 1 Computing Screen Splitting ValueLoadBalancing(in id, vf, threshold, compLevel;out split, compLevel)1: split 0.5;2: increment 0.5;3: subFl vf ;4: while 1 do5:// updating the left sub frustum6:UpdateSubFrustum(subFl , split, vf 24:25:26:27:28:29:// balancing the amount of datafor ith object’s AABB in parallel doif AABBi inside subFl thencompLevell [i] compLevel[i];compLevelr [i] 0;elsecompLevelr [i] compLevel[i];compLevell [i] 0;end ifend forsuml the sum of the elements in compLevell inparallel;sumr the sum of the elements in compLevelr inparallel;ratio suml /sumr ;increment increment 0.5;if ratio 1/threshold thensplit split increment;else if ratio threshold thensplit split increment;elsecompLevel compLevelid ;return;end ifend whilecommunication, where each GPU is controlled by the processof a CPU core. We use Massage-Passing Interface (MPI)in our system. MPI is a specification that moves the dataamong processes through cooperative operations on each.MPI are portable, hardware optimized and widely used inHigh Performance Computing (HPC). Recently, researchersand developers have demonstrate the efficiency of multi-GPUapplications using MPI communications to facilitate GPU datamovements, such as [24], [25], [26]. In this section, we discussthe MPI-based methods of synchronization and IPC used inour multi-GPU multi-display system.Synchronization and communication are necessary to uniteGPUs, so that they act as co-processors to coordinate therendering tasks. In Figure 4, we show the requirements fromthe pipeline of our system, including: (1) only one process isallowed to control the camera at a time. Values of the cameraneed to be shared between processes during run-time; (2)an inter-GPU communication scheme is needed to exchangeGPU0GPU1Viewpoint(View frustum)Viewpoint(View frustum)LOD SelectionLOD SelectionComputing theSplit ValueComputing theSplit ValueGPU Out-of-CoreGPU enderingRenderingDisplaying theimage to windowDisplaying theimage to onFig. 4.The principle of the synchronizations between GPUs.framebuffers between GPUs.Camera movement Sync. When updating the cameraviewpoint, only one process is active to response themouse/keyboard callback events. To have an efficient way ofpassing the camera values, we use the shared memory thatmay be accessed by all processes simultaneously without anyredundant copy. We select one process as the root, and otherssend their activation status (e.g., if mouse-overing the window)to the root. The root identifies which process is activated andbroadcasts the index of this process to all others. Then, afterthe active process finishes the

transplant a parallel approach from a single-GPU to a multi-GPU system. One major reason is the lacks of both program-ming models and well-established inter-GPU communication for a multi-GPU system. Although major GPU suppliers, such as NVIDIA and AMD, support multi-GPUs by establishing Scalable Link Interface (SLI) and Crossfire, respectively .

Related Documents:

OpenCV GPU header file Upload image from CPU to GPU memory Allocate a temp output image on the GPU Process images on the GPU Process images on the GPU Download image from GPU to CPU mem OpenCV CUDA example #include opencv2/opencv.hpp #include <

In the heterogeneous soil model, OpenMP parallel optimization is used for multi-core parallelism implementation [27]. In our previous work, various parallel mechanisms have been introduced to accelerate the SAR raw data simulation, including clouding computing, GPU parallel, CPU parallel, and hybrid CPU/GPU parallel [28-35].

Possibly: OptiX speeds both ray tracing and GPU devel. Not Always: Out-of-Core Support with OptiX 2.5 GPU Ray Tracing Myths 1. The only technique possible on the GPU is “path tracing” 2. You can only use (expensive) Professional GPUs 3. A GPU farm is more expensive than a CPU farm 4. A

While these works provide methods for load balancing, they do not focus on multi-GPU load balancing using a pipelining approach as our method does. 2.2 Multi-GPU Load Balancing Fogal et al. implement GPU cluster volume rendering that uses load balanc-ing for rendering massive datasets [7]. They present a brick-based partitioning

GPU Tutorial 1: Introduction to GPU Computing Summary This tutorial introduces the concept of GPU computation. CUDA is employed as a framework for this, but the principles map to any vendor’s hardware. We provide an overview of GPU computation, its origins and development, before presenting both the CUDA hardware and software APIs. New Concepts

limitation, GPU implementers made the pixel processor in the GPU programmable (via small programs called shaders). Over time, to handle increasing shader complexity, the GPU processing elements were redesigned to support more generalized mathematical, logic and flow control operations. Enabling GPU Computing: Introduction to OpenCL

Latest developments in GPU acceleration for 3D Full Wave Electromagnetic simulation. Current and future GPU developments at CST; detailed simulation results. Keywords: gpu acceleration; 3d full wave electromagnetic simulation, cst studio suite, mpi-gpu, gpu technology confere

32.33 standards, ANSI A300:Performance parameters established by industry consensus as a rule for the measure of quantity, weight, extent, value, or quality. 32.34 supplemental support system: Asystem designed to provide additional support or limit movement of a tree or tree part. 32.35 swage:A crimp-type holding device for wire rope. 32.36 swage stop: Adevice used to seal the end of cable. 32 .