Topology-Aware Mapping Techniques For Heterogeneous HPC .

2y ago
41 Views
3 Downloads
816.92 KB
6 Pages
Last View : 17d ago
Last Download : 2m ago
Upload by : Eli Jorgenson
Transcription

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 9, No. 10, 2018Topology-Aware Mapping Techniques forHeterogeneous HPC Systems: A Systematic SurveySaad B. Alotaibi1, Dr. Fathy alboraei2Faculty of Computing and Information TechnologyKing Abdulaziz UniversityRiyadh, Saudi ArabiaAbstract—At the present time, the modern platforms of highperformance computing (HPC) consists of heterogeneouscomputing devices which are connected through complexhierarchical networks. Moreover, it is moving towards theExascale era and which makes the number of nodes to increaseas well as the number of cores within a node to increase. As aconsequence, the communication costs and the data movementare increasing. Given that, the efficient topology-aware processmapping has become vital to efficiently optimize the data localitymanagement in order to improve the system performance andenergy consumption. It will also decrease the communication costof the processes by matching the application virtual topology(exploited by the system for assigning the processes to thephysical processor) to the target underlying hardwarearchitecture called physical topology. Additionally, improvingthe locality problem which is one of the most challenging issuesfaced by the current parallel applications. In this survey paper,we have studied various topology-aware mapping techniques andalgorithms.communication, by matching the application data to theprocessors that are physically close one to the other.Keywords—Virtual topology; physical topology; topology-awaremapping; parallel applications; communication pattern1) Develop a virtual topology by gathering the applicationcommunication pattern.2) Develop a physical topology by modeling theunderlying hardware architecture.3) Develop a clever algorithm or technique by matchingthe numbers of computing elements and the process ranks ofthe application.I.INTRODUCTIONGood topology-aware process mapping has an acute role inimproving the performance of the parallel applications in highperformance computing (HPC) as well as the energyconsumption, considering the increasing hierarchical,heterogeneous and complex nature of the current and futurehigh-performance computing (HPC) platforms. The"Heterogeneous" term refers to non-symmetry in a few orseveral system aspects. The heterogeneity appears in severalparts such as; networks and can emerge from hardwareheterogeneity (CPUs, GPUs, FPGAs), software heterogeneity(Compilers, operating system, libraries, etc.) and the networktopology complexity [1]. For that matter, the applications ofhigh-performance computing need to adapt the heterogeneityplatforms to optimum execution.As an illustration, the topology-aware process mapping is away of carrying out a particular task to enhance parallelapplication execution by decreasing the communication cost ofprocesses by matching the application of virtual topology(exploited by the system for assigning the processes to thephysical processor) to the target underlying hardwarearchitecture called physical topology. One of the advantages oftopology-aware mapping is the decreased cost ofIn order to do a topology-aware process mapping, it isnecessary to choose the parallel programming models that helpin this matter. To put it another way, the parallel programmingmodel has a valuable help in application execution, becausesome of the parallel programming models have a mechanismthat helps the application to exploit the underlying hardware toimprove communication and the locality. Moreover, it will behelpful for virtual topology management to reorganize theprocesses according to the target underlying hardwarearchitecture. Therefore, the most important parallelprogramming model is the Message Passing Interface (MPI)which is the standard model of the parallel programmingmodels.As discussed above, we propose the main three steps tomake an efficient topology-aware process mapping, as follows:The following architecture explains the previous steps “Fig.1”.The mapping of topologies is of two types: static anddynamic. In the static approach, the mapping can be done priorto the execution. As for the second approach which is dynamicmapping, it happens at runtime (remap the processes to anotherprocessor or core during the runtime) [2].Fig. 1. High-Level Architecture of Topology-Aware Process Mapping.371 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 9, No. 10, 2018This paper is organized as follows: section 2 comprises thedefinitions of the topologies with examples, section 3 includesthe previous related work, whilst section 4 discusses thedefinition of the problem and the section 5 concludes thispaper.II. TOPOLOGIES DEFINITIONSA. Virtual TopologyThe term virtual topology means the dependence among thesoftware processing entitles. These dependencies may bedefined as the data that is exchanged between the processes oran access to the memory by the application threads. In otherwords, the virtual topology refers to the applicationcommunication patterns [2]. Furthermore, the virtual topologyhas several types such as graph topologies and Cartesiantopologies. The example of the virtual topology is shown in“Fig. 2”Fig. 4. Hardware Topology Information.Fig. 2. Virtual Topology Example, (0.0) is a Coordinate and 0 is a Rank Id.B. Pysical TopologyNowadays, the modern machines are increasingly complex,include multiple processors, multi-core processors (socket package), simultaneous multithreading, NUMA nodes, sharedcaches, and multiple GPUs, NICs, etc. Similarly, theunderlying hardware known as physical topology includes theNUMA memory nodes, cores, simultaneous multithreading,sockets and shared caches [3]. Correspondingly, the applicationneeds to understand the target underlying hardware foroptimum execution. The example of the underlying hardwarearchitecture is shown in “Fig. 3”.Fig. 5. Physical Topology Distance, d distince, N node and s switch.Likewise, we can gather the information on the targetmachine using the topology discovery mechanism as shown in“Fig. 4”Given that, the physical topology is the hardware affinityknown as physical topology distance [4], shown in “Fig. 5”.C. Parallel Programming ModelThe main parallel programming models for highperformance computing are OpenMP (which are used forshared memory architecture) and MPI (which are used fordistributed memory systems). At the present time, we haveseveral parallel programming models such as OpenCL (OpenComputing Language –used for the heterogeneous parallelcomputing), OpenCV (which has the power to concentrate onthe real-time applications) and OpenACC (which is aprogramming standard and was intended to simplify parallelprogramming of heterogeneous CPU/GPU systems) [5] [6].Additionally, in the high-performance computing we canmake hybrid parallel programming models to do a specific taskthat takes the advantages of the shared and distributed memory.“table-1” shows the parallel programming models as well asthe systems that implement them [6].Fig. 3. High-Level Architecture of the Target Machine.372 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 9, No. 10, 2018TABLE I.PARALLEL PROGRAMMING MODELS AND THEIR IMPLEMENTEDSYSTEMSProgramming ModelExample Programming SystemsShared memoryOn the negative side and in our case, the programmers facemany challenges with the parallel systems such as the complexhierarchy of the hardware, methods to minimize the memoryusage by the applications, less communication, and datalocality.The high-level architecture of parallel computing is shownin “Fig. 6”.Dynamic scheduling, nested bulksynchronousOpenMP, TBB, Cilk Dynamic scheduling, the generalsynchronizationpthreads, OpenMP, TBB, Cilk III. BACKGROUND AND RELATED WORKDistributed memoryBulk-synchronousBSP, MPI with collectives/barriers, X10with clocksStatic scheduling, two-sidedcommunicationMPI point-to-pointStatic scheduling, one-sidedcommunicationMPI RDMA, SHMEM, UPC, FortranHybrid scheduling (static acrossnodes, dynamic within nodes)MPI OpenMP, DPLASMAThe local view of data andcontrolMPI, FortranThe local view of control, globalview of dataUPC, Global ArraysGlobal view of data and controlOpenMP, ChapelCoProcessor/Acceleratorseparate memoryOpenCL, OpenACC, CUDADomain-specific languages andlibrariesPETSc, Liszt, TCED. Parallel Computing SystemsThe modern engineering and science applications require amassive amount of computing because it deals with verycomplex problems. In order to address these complexproblems, we need powerful computing systems such asparallel computing. As an illustration, parallel computing isone of the most powerful computations that can makenumerous calculations and execute the processes,simultaneously. To put it differently, large problems can oftenbe divided into smaller ones, and then solved at the same time[7].The modern platforms of high-performance computing(HPC) consists of heterogeneous computing devices which areconnected through complex hierarchical networks. In order toefficiently execute the data-parallel Exascale applications onthat platforms, we need to balance a load of the processors, aswell as minimize the communications cost. To achieve that weneed to separate the data among processors whilst consideringtheir speed. The second can be optimized by decreasing thecommunications volume by mapping the application data to theprocessors that are physically close to one another. Moreover,the topology information will be used as the guide to improvethe communications in the hierarchical-heterogeneousplatforms.Nowadays, as we are moving towards the Exascale, thetopology-aware process mapping is becoming an importantapproach to improve the performance and reduce the powerconsumption of Exascale applications. Accordingly, mostresearchers in this area have proposed many techniques andapproaches for finding the best and efficient topology-awareprocess mapping. As can be seen, every researcher focusses ondifferent aspects of how to build the efficient mapping of theprocess-to-processor. It is also noticed that most researcherscome up with their own mapping approach and try to makeefficient topology-aware process mapping.Briefly, we have summarized all the previously donestudies on the topology-aware process mapping problem. Tobegin with, Emmanuel et al. [7] have proposed techniques todeal with NUMA node clusters for reducing thecommunications costs. The proposed techniques can gather theinformation of the application communication pattern and thedetails of the target machine hardware, and then compute therelevant ranks of reordering application process. Eventually,the new ranks are used for reducing the applicationcommunication costs. As a matter of fact, those techniques arebased on the TreeMatch algorithm. This algorithm deals withresource binding technique such as computing unit numbersand the rank reordering technique as the new MPI ranks.However, the algorithm design is as follows:Fig. 6. High-Level Architecture of Parallel Computing.373 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 9, No. 10, 2018The work by Guillaume et al. [8] has modified the functionof the MPICH2 implementation of the MPI Dist graph createfor reordering the process ranks of MPI. The objective is tocreate a map between the hardware topology and theapplication communication pattern. Nonetheless, thismodification is achieved through two main but differentmethods, the core binding, and the rank reordering.Balaji et al. [9] say that the varying mapping of theapplication on the large-scale systems is an important factorthat affects the overall performance. Furthermore, the authorshave highlighted the mapping impact on the applicationperformance of "the IBM Blue Gene/Q systems" with thenetwork topology of the 5D torus.Francois et al. [3] have observed that the number of cores,memory nodes, and shared caches are increasing, thus, makingthe hardware topology very complex. Moreover, the highperformance computing applications need to be careful whileadapting their placement to the target underlying hardware. Forthat matter, they proposed the hardware locality (HWLOC)tool that gathers the information of the physical topologyincluding caches, processors, and memory nodes which makesit visible to the application as well as the runtime systems. Thistool is used by the most important parallel programmingmodels such as OpenMP & MPI.Joshua et al. [10] have proposed a Locality-AwareMapping algorithm to distribute the parallel applicationprocesses across processing resources in the high-performancecomputing system. This algorithm is capable of dealing withboth, heterogeneous and homogeneous hardware systems. Inthe final analysis, they implemented it on the OpenMPI.Bhatele et al. [11] have proposed various heuristics that arebased on the hop-bytes metrics for mapping the graphs ofirregular communication to the mesh topologies. Theirheuristics try to place the communicating processes close toone another.Mercier et al. [12] built the topology-aware mapping, basedon the Scotch library. Generally speaking, they used the virtualtopology (The application communication pattern) and thephysical topology as a complete weighted graph.Rashti et al. [13] have extracted the network topologies andintra-node using the InfiniBand tools and HWLOC libraryrespectively. To develop the undirected graph with edges thatrepresent the performance of the communication between coresdepending on their distances. Then, this mapping technique isexecuted by the Scotch library.Ito et al. [14] have proposed a similar mapping techniquebut using the existing bandwidth between the nodes measuredat the time of execution for assigning the edge weights in thegraph of the physical topology. Again, the method of thismapping technique was implemented by the Scotch library.Chung et al. [15] proposed an efficient technique based onthe hierarchical mapping which partitions the physicaltopology graphs and the process into numerous super nodes.Also, the very first mapping assigns process topology graphsupernodes to the equivalent peers in the graph of the physicaltopology.Cyril Bordage et al. [16] proposed a Netloc tool forcollecting the physical topology that is integrated with a Scotchpractitioner for computing the topology-aware MPI processplacement. However, their experiments were based on the fattree machine.K. B. Manwade et al. [17] proposed a novel techniqueknown as a “ClustMap” for mapping the application andsystem topologies.Abhinav Bhatele et al. [18] constructed an automaticmapping framework that can help the developer to automatethe application communication pattern and physical topologyof the parallel application. In addition, their framework cananalyze the process topology to find regular patterns and thenidentify the communication graphs dimensions for theapplication.Jingjin Wu et al. [19] proposed a strategy for the mappingof the hierarchical task that implements inter and intra nodemapping. They considered supercomputers with torus networkand fat-tree topologies, additionally providing two mappingalgorithms. The first can deal with both inter-node and intranode mapping. The second can partition the nodes of thecomputation regarding its affinity.Torsten Hoefler et al. [2] demonstrate a new heuristic basedon the graph similarity and shows its utility with the virtualtopology on real physical topologies. In other words, theirmapping strategies support the heterogeneous networks and tryto reduce the congestion on fat-tree, torus, and the PERCSnetwork topologies for irregular communication patterns.Subramoni et al. [20] proposed efficient topology mappingon the InfiniBand networks for detecting the InfiniBandnetwork topology and that can be done using the neighborjoining algorithm.Deveci et al. [21] considered machines with the allocationof the sparse node and then applied a geometric partitioningalgorithm to processors and tasks to find the appropriatemapping.Agarwal et al. [22] proposed a greedy heuristic through theestimation functions that are used to evaluate the mappingdecisions effects.Mohammad et al. [23] used the network/node architectureand graph embedding modules for mapping the applicationcommunication topology onto the multi-core clusters physicaltopology with multi-level networks. As the result, they havegot the great improvement in the application communicationperformance as well as the execution time. In the final analysis,this result is obtained by Micro-benchmark.IV. DISCUSSIONSAggregated power for computing is recognized as the mostrecent phenomenon for data-intensive tasks in the 21st century.High-performance computing is able to handle simulationmodeling as well as support standard workstations. Throughcarrying out several computing operations within a reasonableamount of time, high-performance computing is able to counterperformance challenges related to limited data sources. This isachieved using high-end specialized hardware that incorporates374 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 9, No. 10, 2018several units which gather computing power. Additionally, theunits use the concept of parallelization to distribute data andoperations across the various subsequent levels. This is due to alarge amount of data movement and lack of applicationplacement patterns onto the elements of the hardwareprocessing. In short, when we study the process placement, wemust focus on the system hierarchy of the high-performancecomputing (HPC) because the system hierarchy increases moreand more, and the nodes become multi-levels of memory (nonvolatile memory, faster but smaller MCDRAM for KNL,standard DRAM, etc.) and composed of multicore processors.Moreover, the network that connects these nodes has verycomplex topology [24]. Thus, it is concluded that the processplacement is not an easy task in case of very effective processplacement. Additionally, the topology mapping or processplacement has a critical role on the parallel applicationperformance and we need to map these processes ontoprocessors carefully. Therefore, the goal of every successfulmapping algorithm relies on how to reduce the communicationcosts by carefully mapping the processes that are closest toeach other and require most communication. Algorithmically,the mapping process has two kinds; the first one is how themachine computes the messages communication costs and thesecond one is how the application can describe the computingelements affinity. Because the affinity of the computing entitiesis very important in case of mapping the processes on theprocessors which are close to each other.extra overhead and/or the power consumption. We will focuson the mapping between the nodes (internode) and the mappingwithin a node (intra-node) for achieving the efficientperformance as much as we can. Given that, we have proposedan efficient new technique based on hybrid parallelprogramming model as a tri-model for mapping virtualtopology onto physical topology to optimize the data localitymanagement for increasing the performance and reducing thepower consumption in the HPC systems. This approach canoptimize the mapping of inter-node by taking into account thecommunication pattern of the inter-node and the networktopology. Moreover, it will optimize the intra-node mappingwhereby the node physical topology and the correspondingcommunication pattern of intra-node. According to themapping process, we will consider the load balancing withinnodes as the nodes will be heterogeneous.Lastly, it was witnessed that the topology-aware processmapping is an active research filed. The both, applicationcommunication pattern (virtual topology) and the underlyinghardware details (physical topology) are not difficult to extract,th

the hardware topology very complex. Moreover, the high-performance computing applications need to be careful while adapting their placement to the target underlying hardware. For that matter, they proposed the hardware locality (HWLOC) tool that gathers the information of the physical topology

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

concept mapping has been developed to address these limitations of mind mapping. 3.2 Concept Mapping Concept mapping is often confused with mind mapping (Ahlberg, 1993, 2004; Slotte & Lonka, 1999). However, unlike mind mapping, concept mapping is more structured, and less pictorial in nature.

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

Argument mapping is different from mind mapping and concept mapping (Figure 1). As Davies described, while mind mapping is based on the associative connections among images and topics and concept mapping is concerned about the interrelationships among concepts, argument mapping “ is interested in the inferential basis for a claim

Mapping is one of the basic elements in Informatica code. A mapping with out business rules are know as Flat mappings. To understand the basics of Mapping in Informatica, let us create a Mapping that inserts data from source into the target. Create Mapping in Informatica. To create Mapping in Informatica, open Informatica PowerCenter Designer .