Lecture 26: The Future Of High- Performance Computing

2y ago
24 Views
2 Downloads
6.95 MB
51 Pages
Last View : 2m ago
Last Download : 2m ago
Upload by : Amalia Wilborn
Transcription

Lecture 26:The Future of HighPerformance ComputingParallel Computer Architecture and ProgrammingCMU 15-418/15-618, Spring 2017

Carnegie MellonComparing Two Large-Scale Systems Oakridge Titan Monolithicsupercomputer (3rdfastest in world) Designed for computeintensive applications215-418/618 Google Data Center Servers to supportmillions of customersDesigned for datacollection, storage, andanalysis2

Data IntensityCarnegie MellonGoogle Data CenterInternet-ScaleComputingCloudServices Computing LandscapeWeb searchMapping / directionsLanguage translationVideo streamingOakridge TitanTraditional SupercomputingPersonalComputing315-418/618Modeling &Simulation-DrivenScience &EngineeringComputational Intensity3

Data IntensityCarnegie MellonSupercomputing LandscapeOakridge TitanModeling &Simulation-DrivenScience &Engineering415-418/618Computational Intensity4

Carnegie MellonSupercomputer ApplicationsScience IndustrialProductsPublic HealthSimulation-Based Modeling System structure initial conditions transition behavior Discretize time and space Run simulation to see what happens Requirements Model accurately reflects actual system Simulation faithfully captures model515-418/6185

Carnegie MellonTitan HardwareLocal NetworkCPUCPUCPU GPUGPUNode 1Node 2Node 18,688Each Node GPUAMD 16-core processornVidia Graphics Processing Unit38 GB DRAMNo disk driveOverall 7MW, 200M615-418/6186

Carnegie MellonTitan Node Structure: CPUDRAMMemory CPU 16 cores sharing common memory Supports multithreaded programming 0.16 x 1012 floating-point operations per second (FLOPS) peakperformance715-418/6187

Carnegie MellonTitan Node Structure: GPU Kepler GPU 14 multiprocessors Each with 12 groups of 16 stream processors 14 X 12 X 16 2688 Single-Instruction, Multiple-Data parallelism 815-418/618Single instruction controls all processors in group 4.0 x 1012 FLOPS peak performance8

Carnegie MellonTitan Programming: Principle Solving Problem Over Grid E.g., finite-element system Simulate operation over time Bulk Synchronous Model Partition into Regions p regions for p-node machine Map Region per Processor915-418/6189

Carnegie MellonTitan Programming: Principle (cont) Bulk Synchronous Model Map Region per Processor Alternate All nodes compute behavior ofregion– Perform on GPUs All nodes communicate values municateComputeCommunicate1015-418/61810

Carnegie MellonBulk Synchronous PerformanceP1P2P3P4P5 Limited by performance of slowestprocessorCompute 5-418/618Strive to keep perfectlybalanced Engineer hardware to be highlyreliable Tune software to make as regularas possible Eliminate “noise”Operating system events Extraneous network activity 11

Carnegie MellonTitan Programming: Reality System Level Message-Passing Interface (MPI) supports node computation,synchronization and communication Node Level OpenMP supports thread-level operation of node CPU CUDA programming environment for GPUs Performance degrades quickly if don’t have perfect balanceamong memories and processorsResult Single program is complex combination of multipleprogramming paradigms Tend to optimize for specific hardware configuration1215-418/61812

Carnegie MellonMPI Fault Tolerance P1P2P3P4CheckpointCompute &Communicate Periodically store state of all processes Significant I/O trafficP5 WastedComputation RestoreCheckpointRestore When failure occurs Reset state to that of last checkpoint All intervening computation wastedPerformance Scaling Very sensitive to number of failingcomponentsCompute &Communicate1315-418/61813

Carnegie MellonSupercomputer Programming ModelApplicationPrograms Program on top of bare hardware Low-level programming toSoftwarePackagesmaximize node performance Keep everything globallysynchronized and balancedMachine-DependentProgramming ModelHardwarePerformance Reliability Single failure causes major delay Engineer hardware to minimizefailures1415-418/61814

Data IntensityCarnegie MellonGoogle Data CenterInternet-ScaleComputing Data-IntensiveComputing LandscapeWeb searchMapping / directionsLanguage translationVideo 8Computational Intensity15

Carnegie MellonInternet Computing Web Search Aggregate text data from acrossWWW No definition of correct operation Do not need real-time updating Mapping Services Huge amount of (relatively) staticdata Each customer requiresindividualized computationOnline Documents 1615-418/618Must be stored reliablyMust support real-time updating(Relatively) small data volumes16

Carnegie MellonOther Data-Intensive Computing Applications Wal-Mart 267 million items/day, sold at 6,000 stores HP built them 4 PB data warehouse Mine data to manage supply chain, understandmarket trends, formulate pricing strategies LSST Chilean telescope will scan entire sky every 3 days A 3.2 gigapixel digital camera Generate 30 TB/day of image data1715-418/61817

Carnegie MellonData-Intensive Application Characteristics Diverse Classes of Data Structured & unstructured High & low integrity requirements Diverse Computing Needs Localized & global processing Numerical & non-numerical Real-time & batch processing1815-418/61818

Carnegie MellonGoogle Data CentersDalles, Oregon Hydroelectric power @ 2 / KW Hr 50 Megawatts Enough to power 60,000 homes1915-418/618 Engineered for low cost,modularity & power efficiencyContainer: 1160 server nodes,250KW19

Carnegie MellonGoogle ClusterLocal NetworkCPUCPUCPU Node 1Node 2Node n Typically 1,000 2,000 nodes Node Contains 2 multicore CPUs 2 disk drives DRAM2015-418/61820

Carnegie MellonHadoop Project File system with files distributed across nodesLocal NetworkCPUCPUCPU Node 1Node 2Node n Store multiple (typically 3 copies of each file) If one node fails, data still available Logically, any node has access to any file May need to fetch across networkMap / Reduce programming environment Software manages execution of tasks on nodes2115-418/61821

Carnegie MellonMap/Reduce Operation Map/ReduceCharacteristics Computation broken into many, shortlived tasksMap Reduce Tasks mapped onto processorsMapdynamically Use disk storage to hold intermediateresultsReduceMapReduceMapMapping, reducing ReduceStrengths Flexibility in placement, scheduling,and load balancing Can access large data sets Weaknesses Higher overhead Lower raw performance2215-418/61822

Map/Reduce Fault Tolerance Map/ReduceData Integrity Store multiple copies of each file Including intermediate results ofMapReduceeach Map / ReduceMap ReduceMapCarnegie Mellon Continuous checkpointingRecovering from Failure Simply recompute lost resultReduce MapLocalized effect Dynamic scheduler keeps allReduceprocessors busy 2315-418/618Use software to build reliable systemon top of unreliable hardware23

Carnegie MellonCluster Programming Model Application programs written interms of high-level operations ondata Runtime system controls scheduling,load balancing, Scaling ogramming ModelRuntimeSystem Centralized scheduler formsbottleneck Copying to/from disk very costly Hard to limit data movement 2415-418/618HardwareSignificant performance factor24

Carnegie MellonRecent Programming Systems Spark Project at U.C., Berkeley Grown to have large open source community GraphLab Started as project at CMU by Carlos Guestrin Environment for describing machine-learning algorithms 2515-418/618Sparse matrix structure described by graphComputation based on updating of node values25

Data IntensityCarnegie MellonComputing Landscape TrendsMixing simulationwith data ing &Simulation-DrivenScience &EngineeringComputational Intensity26

Carnegie MellonCombining Simulation with Real Data Limitations Simulation alone: Hard to know if model is correct Data alone: Hard to understand causality & “what if” Combination Check and adjust model during simulation2715-418/61827

Carnegie MellonReal-Time Analytics Millenium XXL Simulation (2010) 3 X 109 particles Simulation run of 9.3 days on 12,228cores 700TB total data generatedSave at only 4 time points 70 TB Large-scale simulations generatelarge data sets What If? Could perform data analysis whilesimulation is 8

Data IntensityCarnegie Mellon2915-418/618Computing Landscape TrendsGoogle Data CenterInternet-ScaleComputingSophisticateddata analysisComputational Intensity29

Carnegie MellonExample Analytic ApplicationsMicrosoft Project erDescriptionGermanText30

Carnegie MellonData Analysis with Deep Neural Networks Task: Compute classification of set ofinput signals Training Operation 3115-418/618Use many training samples of form input / desired outputCompute weights that minimize classification errorPropagate signals from input to output31

Carnegie MellonDNN Application Example 3215-418/618Facebook DeepFace Architecture32

Carnegie MellonTraining DNNsModel Size20201515 1050TrainingData 300 10510152000Characteristicsalgorithm Regular dataorganization15-418/618200100 Iterative numerical3340050TrainingEffort51015 20005101520Project Adam Training 2B connections15M images62 machines10 days33

Data IntensityCarnegie MellonTrendsGoogle Data CenterSophisticatedInternet-Scale data analysisComputingConvergence?Mixing simulationwith real-world dataModeling &Simulation-DrivenScience omputational Intensity34

Carnegie MellonChallenges for ConvergenceSupercomputers Data Center Clusters Hardware CustomizedOptimized for reliability Consumer gradeOptimized for low costRun-Time System Source of “noise”Static scheduling Provides reliabilityDynamic allocationApplication Programming 3515-418/618Low-level, processor-centricmodel High level, data-centric model35

Carnegie MellonSummary: Computation/Data Convergence Two Important Classes of Large-Scale Computing Computationally intensive supercomputing Data intensive processing Internet companies many other applicationsFollowed Different Evolutionary Paths Supercomputers: Get maximum performance from available hardware Data center clusters: Maximize cost/performance over variety of data-centrictasks Yielded different approaches to hardware, runtime systems, and applicationprogramming A Convergence Would Have Important Benefits Computational and data-intensive applications But, not clear how to do it3615-418/61836

Carnegie MellonGETTING TO EXASCALE3715-418/61837

Carnegie MellonWorld’s Fastest Machines Top500 Ranking: High-performance LINPACK Benchmark: Solve N x N linear system Some variant of Gaussian elimination2/3 N3 O(N2) operations Vendor can choose N to give best performance (in FLOPS) Alternative: High-performance conjugate gradient Solve sparse linear system ( 27 nonzeros / row) Iterative method Higher communication / compute ratio15-418/61838

Carnegie MellonSunway TaihuLight Wuxi China Operational 2016 Machine Total machine has 40,960 processor chips Processor chip contains 256 compute cores 4management cores Each has 4-wide SIMD vector unit 8 FLOPS / clock cycle Performance 15-418/618HPL: 93.0 PF (World’s top)HPCG: 0.37 PF15.4 MW1.31 PB DRAM Ratios (Big is Better) GigaFLOPS/Watt: 6.0 Bytes/FLOP: 0.01439

Carnegie MellonTianhhe-2 Guangzhou China Operational 2013 Machine Total machine has 16,000 nodes Each with 2 Intel Xeons 3 Intel Xeon Phi’s Performance 15-418/618HPL: 33.9 PFHPCG: 0.58 PF (world’s best)17.8 MW1.02 PB DRAM Ratios (Big is Better) GigaFLOPS/Watt: 1.9 Bytes/FLOP: 0.03040

Carnegie MellonTitan Oak Ridge, TN Operational 2012 Machine Total machine has 18,688 nodes Each with 16-core Opteron Tesla K20X GPU Performance 15-418/618HPL: 17.6 PFHPCG: 0.32 PF8.2 MW0.71 PB DRAM Ratios (Big is Better) GigaFLOPS/Watt: 2.2 Bytes/FLOP: 0.04041

Carnegie MellonHow Powerful is a Titan Node?Titan CPU 15-418/618 Kepler K20XFeb., 2013. 28nmCuda capability 3.53.9 TF Peak (SP)CPU Opteron 6274Nov., 2011. 32nm technology2.2 GHz16 cores (no hyperthreading)16 MB L3 cache32 GB DRAMGPU GHC Machine Xeon E5-1660June, 2016. 14nm technology3.2 GHz8 cores (2x hyperthreaded)20 MB L3 cache32 GB DRAMGPU GeForce GTX 1080May, 2016. 16nmCuda capability 6.08.2 TF Peak (SP)42

Carnegie MellonPerformance of Top 500 Machines 15-418/618From presentation by Jack DongarraMachines far off peak when performing HPCG43

Carnegie MellonWhat Lies Ahead DOE CORAL Program Announced Nov 2014 Delivery in 2018 Vendor #1 Vendor #2 15-418/618IBM nVidia Mellanox3400 nodes10 MW150 – 300 PF peakIntel Cray 50,000 nodes (Xeon Phi’s)13 MW 180 PF peak44

Carnegie MellonTECHNOLOGY CHALLENGES4515-418/61845

Carnegie MellonMoore’s Law Basis for ever-increasing computer power We’ve come to expect it will continue4615-418/61846

Carnegie MellonChallenges to Moore’s Law: Technical 2022: transistorswith 4nm featuresize Si lattice spacing0.54nm Must continue to shrink features sizes Approaching atomic scale Difficulties Lithography at such small dimensions Statistical variations among devices4715-418/61847

Carnegie MellonChallenges to Moore’s Law: Economic Growing Capital Costs State of art fab line 20B Must have very high volumes to amortize investment Has led to major consolidations4815-418/61848

Carnegie MellonDennard Scaling Due to Robert Dennard, IBM, 1974 Quantifies benefits of Moore’s Law How to shrink an IC Process Reduce horizontal and vertical dimensions by k Reduce voltage by k Outcomes Devices / chip increase by k2 Clock frequency increases by k Power / chip constant Significance Increased capacity and performance No increase in power4915-418/61849

Carnegie MellonEnd of Dennard Scaling What Happened? Can’t drop voltage below 1V Reached limit of power / chip in 2004 More logic on chip (Moore’s Law), but can’t make them run faster 5015-418/618Response has been to increase cores / chip50

Carnegie MellonResearch Challenges Supercomputers Can they be made more dynamic and adaptive? Requirement for future scalability Can they be made easier to program? Abstract, machine-independent programming modelsData-Intensive Computing Can they be adapted to provide better computational performance? Can they make better use of data locality? Performance & power-limiting factorTechnology / Economic What will we do when Moore’s Law comes to an end for CMOS? How can we ensure a stable manufacturing environment?5115-418/61851

Carnegie Mellon 15-418/618 3 3 Computing Landscape Computational Intensity Internet-Scale Computing Personal Computing Cloud Services ty Modeling & Simulation-Driven Science & Engineering Traditional Supercomputing . Facebook DeepFace Architecture. Carnegie Mellon 15-418/618 33

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Pré-textes. L’homme préhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Café, thé, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. Crawford M., Marsh D. The driving force : food in human evolution and the future.