Collective Framework And Performance Optimization To Open .

2y ago
33 Views
2 Downloads
2.32 MB
24 Pages
Last View : 19d ago
Last Download : 6m ago
Upload by : Grady Mosby
Transcription

Collective Framework and PerformanceOptimization to Open MPI for Cray XT 5platformsCray Users Group 20111Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

Collectives are Critical for HPC ApplicationPerformance A large percentage of application execution time is spent inthe global synchronization operations (collectives) Moving towards exascale systems (million processorcores), the time spent in collectives only increases Performance and scalability of HPC applications requiresefficient and scalable collective operations2Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

Weakness in current Open MPI implementationOpen MPI lacks support for Customized collective implementation for arbitrarycommunication hierarchies Concurrent progress of collectives on differentcommunication hierarchies Nonblocking collectives Taking advantage of capabilities of recent networkinterfaces (example offload capabilities) Efficient point-to-point message protocol for Cray XTplatforms3Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

Cheetah : A Framework for ScalableHierarchical CollectivesGoals of the framework Provide building blocks for implementing collectives forarbitrary communication hierarchy Support collectives tailored to the communicationhierarchy Support both blocking and nonblocking collectivesefficiently Enable building collectives customized for the hardwarearchitecture4Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

Cheetah Framework : Design principles Collective operation is split into collective primitives overdifferent communication hierarchies Collective primitives over the different hierarchies areallowed to progress concurrently Decouple the topology of a collective operation from theimplementation, enabling the reusability of primitives Design decisions are driven by nonblocking collectivedesign, blocking collectives are a special case ofnonblocking ones Use Open MPI component architecture5Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

Cheetah is Implemented as a Part of Open MUMAPTPCOLLIBOFFLOADBASEMUMACheetah ComponentsOpen MPI Components6Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

Cheetah Components and its Functions Base Collectives (BCOL) – Implements basic collectiveprimitives Subgrouping (SBGP) – Provides rules for grouping theprocesses Multilevel (ML) – Coordinates collective primitiveexecution, manages data and control buffers, and mapsMPI semantics to BCOL primitives Schedule – Defines the collective primitives that are partof collective operation Progress Engine – Responsible for starting, progressingand completing the collective primitives7Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

BCOL Component – Base collective primitives Provides collective primitives that are optimized for certaincommunication hierarchies– BASESMUMA: Shared memory– P2P: SeaStar 2 , Ethernet, InfiniBand– IBNET: ConnectX-2 A collective operation is implemented as a combination ofthese primitives– Example, n level Barrier can be a combination of Fanin ( firstn-1 levels), Barrier (nth level) and Fanout ( first n-1 levels)8Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

SBGP Component – Group the Processes Basedon the Communication HierarchyP2P SubgroupUMA SubroupUMA Group LeaderSocket SubroupsSocket Group LeaderCPU SocketAllocated CoreUnallocated CoreNode 19Managed by UT-Battellefor the Department of EnergyNode 2Graham OpenMPI SC08

Open MPI portals BTL optimizationSender MPI ProcessReceiver MPI ProcessMPI MessageOpen MPIMessagePortals MessageAckXPortal acknowledgment is not required for Cray XT 5 platforms asthey use Basic End to End Protocol (BEER) for message transfer10 Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

Experimental Setup Hardware :Jaguar– 18,688 Compute Nodes– 2.6 GHz AMD Opteron (Istanbul)– SeaStar 2 Routers connected in a 3D torus topology Benchmarks :– Point-to-Point : OSU Latency and Bandwidth– Collectives : Broadcast in a tight loop Barrier in a tight loop11 Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

1 Byte Open MPI P2P Latency is15% better than Cray MPIOMPI vs CRAY portals latency110OMPI with portals optimizationOMPI without portals optimizationCray MPI10090Latency (Usec)807060504030201001101001000Message size (bytes)12 Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08100001000001e 06

Open MPI and Cray MPI bandwidth saturateat 2 Gbp/sOMPI vs CRAY portals bandwidth2500OMPI with portals optimizationOMPI without portals optimizationCray MPIBandwidth (Mb/s)2000150010005000110100100010000Message size (bytes)13 Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC081000001e 061e 07

Hierarchical Collective Algorithms14 Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

Flat Barrier AlgorithmHost 1Host 21234Step 11234Inter HostCommunicationStep 2115 Managed by UT-Battellefor the Department of Energy23Graham OpenMPI SC084

Hierarchical Barrier AlgorithmHost 1Host 21234Step 11234Inter HostCommunicationStep 21234Step 3116 Managed by UT-Battellefor the Department of Energy23Graham OpenMPI SC084

Cheetah’s Barrier Collective Outperforms theCray MPI Barrier by 10%140CheetahCray MPI120Latency (microsec.)10080604020002000400060008000MPI Processes17 Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08100001200014000

Data Flow in a Hierarchical Broadcast AlgorithmSNODE 1SNODE 2Source of the Broadcast18 Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

Hierarchical Broadcast Algorithms Knownroot Hierarchical Broadcast– the suboperations are ordered based on the source of data– the suboperations are concurrently started after theexecution of suboperation with the source of broadcast– uses k-nomial tree for data distribution N-ary Hierarchical Broadcast– same as Knownroot algorithm but uses N-ary tree for datadistribution Sequential Hierarchical Broadcast– the suboperations are ordered sequentially– there is no concurrent execution19 Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

Cheetah’s Broadcast Collective Outperforms theCray MPI Broadcast by 10% (8 Byte)908070Latency (microsec.)6050403020Cray MPICheetah three level known k-nomialCheetah three level known n-aryCheetah three level sequential bcast100050001000015000MPI Processes20 Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC082000025000

Cheetah’s Broadcast Collective Outperforms theCray MPI Broadcast by 92% (4 KB)200Latency (microsec.)15010050Cray MPICheetah three-level known k-nomialCheetah three-level known NB n-aryCheetah three-level known NB k-nomialCheetah sequential bcast0021 Managed by UT-Battellefor the Department of Energy100002000030000MPI ProcessesGraham OpenMPI SC084000050000

Cheetah’s Broadcast Collective Outperforms theCray MPI Broadcast by 9% (4 MB)550005000045000Latency (Usec)40000350003000025000Cray MPICheetah three level known k-nomialCheetah three level known n-aryCheetah three level sequential bcast20000150000500010000MPI Processes22 Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08150002000025000

Summary Cheetah’s Broadcast is 92% better than the Cray MPI’sBroadcast Cheetah’s Barrier outperforms Cray MPI’s Barrier by 10% Open MPI point-to-point message latency is 15% betterthan the Cray MPI (1 byte message) The key to the performance and scalability of thecollective operations––––Concurrent execution of sub-operationsScalable resource usage techniquesAsynchronous semantics and progressCustomized collective primitives for each of communicationhierarchy23 Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

Acknowledgements US Department of Energy ASCR FASTOSprogram National Center For Computational Sciences,ORNL24 Managed by UT-Battellefor the Department of EnergyGraham OpenMPI SC08

10 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Open MPI portals BTL optimization Open MPI Message Portals Message MPI Message Ack Sender MPI Process Receiver MPI Process X Portal acknowledgment is not required for Cray XT 5 platforms as they use Basic End to End Protocol (BEER) for message transfer

Related Documents:

Since the eld { also referred to as black-box optimization, gradient-free optimization, optimization without derivatives, simulation-based optimization and zeroth-order optimization { is now far too expansive for a single survey, we focus on methods for local optimization of continuous-valued, single-objective problems.

determine the terms and conditions of the Collective Agreement. This Collective Agreement then sets the rules for the workplace until the next bargaining round. The best way to learn about collective bargaining is to do it! We will be using a collective bargaining simulation designed by Dr. Kelly Williams-Whitt of the University of Lethbridge, AB.

fleet constraints. Finally, the total fleet size F is enforced using: slk - F 5 0 k This results in a total of 184 constraints. Collective Intelligence and Product Distribution Theory Collective Intelligence (COIN) is a framework for design- ing a collective, defined as a group of agents with a specified world utility or system-level objective.

An approach for the combined topology, shape and sizing optimization of profile cross-sections is the method of Graph and Heuristic Based Topology Optimization (GHT) [4], which separates the optimization problem into an outer optimization loop for the topology modification and an inner optimization loo

Structure topology optimization design is a complex multi-standard, multi-disciplinary optimization theory, which can be divided into three category Sizing optimization, Shape optimization and material selection, Topology optimization according to the structura

2. Robust Optimization Robust optimization is one of the optimization methods used to deal with uncertainty. When the parameter is only known to have a certain interval with a certain level of confidence and the value covers a certain range of variations, then the robust optimization approach can be used. The purpose of robust optimization is .

2. Topology Optimization Method Based on Variable Density 2.1. Basic Theory There are three kinds of structure optimization, they are: size optimization, shape optimization and topology op-timization. Three optimization methods correspond to the three stages of the product design process, namely the

alculus In Motion “Related Rates” * Related Rates MORE” 4.7 Applied Optimization Pg. 262-269 #2-8E, 12, 19 WS –Optimization(LL) NC #45(SM) MMM 19 Optimization MMM 20 Economic Optimization Problems WS – Optimization(KM) Calculus In Motion “Optimization-Applications” TEST: CH