Assessing Sustained System Performance - A Practical Solution . - XSEDE

1y ago
11 Views
2 Downloads
1.50 MB
35 Pages
Last View : 15d ago
Last Download : 3m ago
Upload by : Raelyn Goode
Transcription

Assessing Sustained System Performance - APractical Solution for Heterogeneous SystemsTorsten Hoefler, Bill Kramer, Greg Bauer, & AUS@BWAll used images belong to the owner/creator!

The State of Performance Measurements Most used metric: Floating Point Performance That’s what limited performance in the 80’s! Systems were balanced, peak was easy! FP performance was the limiting factor Architecture Update (2012): Deep memory hierarchies Hard to predict and model Algorithmic structure and data locality matters Complicates things further2/31

Rough Computational Algorithm Classification High locality, moderate locality, low locality Highly Structured Dense linear algebra (HPL) FFT Stencil Semi-structured Adaptive refinements Sparse linear algebra Unstructured Graph computations (Graph500)3/31

How do we assess performance? Microbenchmarks Libraries (DGEMM, FFT) Communication (p2p, collective) Application Microbenchmark HPL (for historic reasons?) NAS (outdated) Applications4/31

We still somehow agree on FLOPS because that’s what we always did And it’s an OK metric But the benchmarks should reflect the workload “Sustained performance” Cf. “real application performance” In the Blue Waters context “Sustained Petascale Performance” (SPP) Reflects the NSF workload5/31

The SPP Metric Enables us to compare different computer systemsVerify system performance and correctnessMonitor performance through lifetimeGuide design of future systems It has to represent the “average workload” andmust still be of manageable size We chose ten applications (8 x86, 4 GPU) Performance is geometric mean of all apps6/31

Validating a System Model – Memory I Stride-1 word load/store/copy (32 MiB data): 1 int core r/w/c: 3.8 / 4 / 3 GB/s 16 int cores (1 IL) r/w/c: 32 / 16 / 9.6 GB/s 32 int cores (2 IL) r/w/c: 32 / 16 / 9.6 GB/s Comments: Very high fairness between cores Very low variance between measurementsMeasured with Netgauge 2.4.7, pattern memory/stream7/31

Validating a System Model – Memory II CL latency (random pointer chase, 1 GiB data): 1 int core: 110 ns 16 int cores (1 IL): 257 ns 32 int cores (2IL): 258 ns Comments: High fairness between cores Low variance between measurementsMeasured with Netgauge 2.4.7, pattern memory/pchase8/31

Validating a System Model – Memory III Random word access bandwidth (32 MiB data): 1 int core r/w/c: 453 / 422 / 228 MiB/s 16 int cores (1 IL) r/w/c: 241 / 119 / 77 MiB/s 32 int cores (2IL) r/w/c: 241 / 119 / 77 MiB/s Comments: 96% of stream bandwidth Very high fairness between cores Very low variance between measurementsMeasured with Netgauge 2.4.7, pattern memory/rand9/31

Validating a System Model – Network Scaling Average random latency and variance5 us50 us1 process per nodeMeasured with Netgauge 2.4.7, pattern ebb32 processes per node11/31

Validating a System Model – Collectives Large message (4k) alltoall performance Model: unclear (depends on mapping etc.)20x1 process per nodeMeasured with Netgauge 2.4.7, pattern nbcolls10 MB/s/proc32 processes per node12/31

The SPP Application Mix Representative Blue Waters applications: NAMD – molecular dynamicsMILC, Chroma – Lattice Quantum ChromodynamicsVPIC, SPECFEM3D – Geophysical ScienceWRF – Atmospheric SciencePPM – AstrophysicsNWCHEM, GAMESS – Computational ChemistryQMCPACK – Materials Science13/31

The Grand Modeling Vision Our very high-level strategy consists of thefollowing six steps:1)2)3)4)Identify input parameters that influence runtimeIdentify application kernelsDetermine communication patternDetermine communication/computation overlap5) Determine sequential baseline6) Determine communication parametersAnalyticEmpiricHoefler, Gropp, Snir, Kramer: Performance Modeling for Systematic Performance Tuning , SC1114/31

A Simplified Modeling Method Fix input problem (omit step 1) No fancy tools, simple library using PAPI (libPGT) Determine performance-critical kernels We demonstrate a simple method to identify kernels Analyze kernel performance Using black-box counter approach More accurate methods if time permits Establish system bounds, roofline What can be improved? Are we hitting a bottleneck?15/31

NAMD Dynamic schedulingcomplicates model Excellent cachelocality PME performs wellbut will slow down atscale (alltoall) Good IPC16/31

MILC Five phases, CG mostcritical at scale Low FLOPs and IPC Turbo boost seemsto help here! Low FLOPs are underinvestigation (alreadyusing SSE)17/31

PPM Many micro-phasesHard to instrumentVery highly optimizedby science team Cache blocking High FLOP rate High locality18/31

QMCPACK Variational MonteCarlo initializes Performance issuesare investigated Diffusion MonteCarlo: load balance (LB)update walker (uw)19/31

WRF Microphysics dominates Low performance, manybranches Planet Boundary Layeralso problematic Turbo Boost helps! Runge Kutta is fast High locality20/31

SPECFEM3D Two phases, both dosmall mat-mat mult Internal forcesperform well21/31

NWCHEM Highly optimized Even running inturbo boost! Very good locality Steps 3 4 decent Step 5 close to peak!22/31

Some Early Conclusions Average Effective Frequency: 2.40 GHz Anticipated frequency: 2.45 GHz Average FLOP rate: 1.48 GF (min: 398 GF(WRF), max: 6.876 GF (NWCHEM)) 15% of peak Standard deviation: 1.37 GF (!!!) But what does that mean? Are we hitting any limits/bottlenecks?23/31

Serial Performance - The Roofline ModelWilliams et al.: “Roofline: An Insightful Visual Performance Model“, CACM 20092424/31

The poster child - NWCHEMThanks to Victor Anisimov25/31

MIMD Lattice Computation - MILCThanks to Greg Bauer26/31

MIMD Lattice Computation - MILCCache-awareprogrammingThanks to Greg Bauer27/31

MIMD Lattice Computation - MILCMicroOptimizationThanks to Greg Bauer28/31

Lessons learned and Discussion Performance modeling is a powerful tool To detect bottlenecks and bounds The Roofline Model PAPI/CrayPAT may fool you (uncore events) Bandwidth read or write? Shows most important characteristics of serial codes Room for improvement and interpretation Roofline may depend on critical parameters Could use a tool to handle all of this!Hoefler: “Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC“29/31

Conclusions & Future Work We modeled the performance ofseveral SPP applications Gained insight on limits/bounds Kernel classification through IPC works well Not automatic yet Kernel profiling works in early stages Need better tools Extending modeling towards communication “MPI counters”, congestion, etc.Hoefler, Gropp, Snir, Kramer: Performance Modeling for Systematic Performance Tuning , SC1130/31

Acknowledgments Thanks to Gregory Bauer (pulling together the data) Victor Anisimov, Eric Bohm, Robert Brunner, RyanMokos, Craig Steffen, Mark Straka (SPP PoCs) Bill Kramer, Bill Gropp, Marc Snir (generalmodeling ideas/discussions) The Cray performance group (Joe Glenski et al.) The National Science Foundation31/31

Backup SlidesBackup Slides32/31

Blue Waters in a Nutshell XE6 with AMD Interlagos 2.3-2.6 (3.0?) GHz 390k BD modules, 780k INT cores XK6 with Kepler GPUs 3k Gemini Torus Very large (23x24x24), BB-challenged, torus How do we make sure the (heterogeneous)system is ready to fulfill it’s mission? Well, confirm a certain SPP number ( 1PF!)33/31

Performance Counter Sanity Checks Running small test kernels to check counterss small, l largeStream: 2 GB/s per integer coreLL CACHE MISSES are L2 misses!? Still a proxy metric (use with caution!)34/31

Upping my FLOPS (if I was a vendor) Algorithms may have different FLOP counts Slow time to solution but high FLOPS (dense LA)Same time to solution, more FLOPSSingle of half FLOPS (esp. GPUs)Redundant FLOPS for parallel codes Performance counters are thus not reliable! Just count the observed, not the necessaryFLOPS35/31

Reference FLOP Counts We establish “reference FLOP count” Specific to an input problem Ideally established analytically Or (if necessary) on reference code on x86 Single-core run (or several parallel runs) Input problem needs to be clearly defined Set the right expectations Real, complete science run vs. maximum FLOPS36/31

Guide design of future systems . Performance Modeling for Systematic Performance Tuning , SC11 . 15/31 A Simplified Modeling Method Fix input problem (omit step 1) No fancy tools, simple library using PAPI (libPGT) Determine performance-critical kernels . "Bridging Performance Analysis Tools and Analytic Performance Modeling .

Related Documents:

4.19 Type 2 low frequency self-sustained oscillations, measured at E13. . . 56 4.20 Type 1 low frequency self-sustained oscillations, measured at E13. . . 57 4.21 Type one low frequency self-sustained oscillations, measured at E13. . 58 4.

akuntansi musyarakah (sak no 106) Ayat tentang Musyarakah (Q.S. 39; 29) لًََّز ãَ åِاَ óِ îَخظَْ ó Þَْ ë Þٍجُزَِ ß ا äًَّ àَط لًَّجُرَ íَ åَ îظُِ Ûاَش

Collectively make tawbah to Allāh S so that you may acquire falāḥ [of this world and the Hereafter]. (24:31) The one who repents also becomes the beloved of Allāh S, Âَْ Èِﺑاﻮَّﺘﻟاَّﺐُّ ßُِ çﻪَّٰﻠﻟانَّاِ Verily, Allāh S loves those who are most repenting. (2:22

AASM Scoring Manual 2012 –“Recommends” reporting the following adult conditions Sinus tachycardia sustained HR greater than 90 bpm Sinus bradycardia sustained HR less than 40 bpm –“sustained” means greater than 30 seconds of a stable rhythm Asystole scored for a pause greater than 3 seconds

Journal or Management 1991 , Voi.17.No.l,99-120 Firm Resources and Sustained Competitive Advantage Jay Barney Texas A&M University Understanding sources of sustained competitive advantage has be come a major area of research in strategic management. Building on the assumptions that strategic resources are heterogeneously distrib

Chapter 3 Exploratory research on the effects of sustained centrifugation: an overview This chapter provides an overview of the research that was performed in the past to characterize the effects of sustained centrifugation on postural stability, motion and attitude perception and vestibularly driven ocular responses.

suspected of having sustained a concussion/mTBI (mild traumatic brain injury). This guideline is not intended for use with patients or clients under the age of 18 years. This guideline is not intended for use by people who have sustained or are suspected of having sustained a concussion/mTBI for any self-diagnosis or treatment. Patients

Alfredo López Austin TEMARIO SEMESTRAL DEL CURSO V. LOS PRINCIPALES SISTEMAS DEL COMPLEJO, LAS FORMAS DE EXPRESIÓN Y LAS TÉCNICAS 11. La religión 11.1. El manejo de lo k’uyel. 11.1.1. La distinción entre religión, magia y manejo de lo k’uyel impersonal. Los ritos específicos. 11.2. Características generales de la religión mesoamericana. 11.3. La amplitud social del culto. 11.3.1 .