AMD EPYC For HPC - Easybuild.io

1y ago
15 Views
3 Downloads
1.67 MB
25 Pages
Last View : 2d ago
Last Download : 23d ago
Upload by : Camille Dion
Transcription

AMD EPYC for HPCOverview, Strategies, and Best Practices for the HPC CommunityPart 1Summer 2021

Overview of AMD EPYC in HPC SpaceWhat’s all the fuss?History of EPYCHPC community adoptionBuying/Using EPYC for HPCWhat problems are you trying to solve?What does someone new to EPYC need to know?What to buy and why?What traps to avoid?Best PracticesBIOS, Networking, OSApplications

Humble employee of .Lifelong Blue Devil .Grew up in Durham, North CarolinaDuke ’05 Grad5 National Titles, a million amazingmemoriesStill can’t believe we lost to Uconn in ‘99and ‘04 Final Fours

Principal Program Manager, Azure HPC (2017 – present) Lead for Azure H-series (CPUs RDMA networking)Director, HPC Solutions, Cycle Computing (2016-2017)National Center for Supercomputing Applications, Univ Illinois (2009-2016)

Is NotClever Azure marketing ployAdvertisement for AMDAnti-Intel rantPhD-level thesisIsContribution to broader HPC Community from a group that’s deployed a lot of AMD EPYC for HPC/AIDigestible, pragmatic guidance for those thinking of buying or have bought AMD EPYCRecommendations/data to help answer common Q’s, save you time, and support HPC workloadsOpen invite to ask questions and get my best, most data-driven answers

TL;DR - “EPYC” CPU is a credible alternative to Intel in the datacenter for buyers and users of HPC Leadership memory bandwidth & IO Competitive FLOPS x86 compatibility Very good power efficiency Highly competitive economicsAll things we in the HPC world really like!

2012201720192021“Abu Dhabi” core uArch“Zen1” core uArch“Zen2” core uArch“Zen3” core uArch“Piledriver” SoC“Naples” SoC“Rome” SoC“Milan” SoCUp to 16 coresUp to 32 coresUp to 64 coresUp to 64 cores 64 GB DRAM B/w 260 GB DRAM B/W 340 GB DRAM B/W 340 GB DRAM B/WPCIe 2.0Up to 4MB L3/coreUp to 16MB L3/coreUp to 32 MB L3/corePCIe 3.0PCIe 4.0PCIe 4.0

TL;DRChiplets help to increase fab yields and schedule, lower cost,improve socket level performance and power efficiencyProsEPYC “Rome” Die Shot2CH DDR42CH DDR4I/O DieAll of the above are good!2CH DDR42CH DDR4TradeoffsUsers and developers need to think of EPYC CPUs asalmost “clusters on a chip” and have awareness as to howbest to overlay software on top of this kind of hardwareE.g. Above is more “4 * 2 CH memory” rather than “8 channel”

SimilaritiesSame core countsSame 280w TDP maxSame PCIe 4.0 supportSame 8ch DDR4 3200 (2/quadrant)Same 16 GT/s xGMIDifferences2x addressable cache/core19% higher IPC from Zen3 v. Zen2Higher frequenciesBetter memory latencies

Top500 – New ale and Exascale 89PF Peak 552PF Peak 1.5 EF Peak 2 EF Peak

HBv1HB-series VMsND A100 v4-series VMsCPU-based HPCGPU-based HPC/AIEPYC Gen1 “Naples”100 Gb EDRQ2 2019HBv2EPYC Gen2 “Rome”200 Gb HDRQ1 2020HBv3EPYC Gen3 “Milan”200 Gb HDRQ1 2021NDv4EPYC Gen2 “Rome”8 * Nvidia A100 NVLINK 40 GB8 * 200 Gb GDRQ3 2021HBv1 docs: https://bit.ly/3CQbIoxHBv2 docs: https://bit.ly/3iT23WiHBv3 docs: https://bit.ly/37REkQ5NDv4 docs: https://bit.ly/3xSd1zE

Non-blocking Fat Tree topologyHardware offload of MPI collectivesSupports all MPI1.3 microsecond latenciesBare-metal passthroughInfiniBand Network CoreUp to 200 Gb HDRDynamic Connected TransportIntelligent Adaptive Routing

040,0003.1x30,00020,00010,0001x0Highest on other CloudCP2K (quantum chemistry) Graph500 (Graph Analytics)HCHCHBv2HBv2Star-CCM (CFD)HBv2HBv2HBv2WRF (Weather)NAMD (Biophysics)HBv2HBv2HBv2HBv213

First Step – What are the most important problems you are trying to solve for? How do you stack rank? What is the relevant level(s) of scale? Pure performance ? Performance/ ? Cost/Performance? Simplest possible HPC evolution for my users? Supported platform by ISVs and/or required SW toolchains? Platform for accelerators? Lowest possible cost? Something else?Frequent answer from Azure HPC customers: “best performance and cost/performance for my mainworkloads, with as minimal user education as possible”

EPYC performance can be extremely good for a CPUTypical Haswell/Broadwell Rome/Milan will seem like enormous leap for most workloads How good depends on what your workload scales with (memory bandwidth? L3? Compute? Frequency?) Realize/explain performance or cost per job is what matters Infrastructure doesn’t scale by “cores”, you buy/rent servers (nodes) Clock frequency performance (don’t just “MOAR GIGAHURTZ!!”)Perf scales per server (or VM), or by N* scalable network endpoints (MPI) Doesn’t matter if you used all the cores (do you do this for RAM? Cache? CUDA cores in a GPU? RDMA B/W?) Exception scenario: using expensive SW licensed per core Affinitize processes explicitly and with understanding of hardware topologyGenerally advisable to evenly distributed processes by physical L3 boundaries (4 cores/Rome, 8 cores/Milan) Don’t just throw N processes at the server and assume app/OS will automagically figure out placement for you

ANSYS Fluent 2019.5Per core performance depends heavily on how X number ofcores in a node sub-divide global shared assets that havesignificant impact on performance DRAM bandwidth L3 cache capacity and bandwidth On-die and inter-socket bandwidth (“GMI” and “xGMI”) Power and thermal headroom to increase clock frequencies For MPI workloads, network bandwidth/latencyIn right circle, the cores appear to be 2.5xfaster than the cores on the right. Are they?No, they’re exact same cores in exact sameserver, just getting different allocations ofglobal shared assetsAircraft 14m cell case, 1* Azure HBv2 VMScaling from 1-4 processes per NUMA100% of best performancepossible, but 4x the coresand licenses used. Still just1 node of infrastructure.63% of best performancepossible, but ¼ of the coresper node and licenses used

Even for compute bound apps, per core performance dependson whether and to what degree global shared assets are beingexhaustedSame phenomenon will generally occur on other CPUs, too(e.g. Intel Xeon)Cores per MilanCCD1 per CCD2 per CCD4 per CCD6 per CCDBenchmark16C32C64C96CHBv3 HPL0.761.402.232.86Bare metal HPL0.760321.42571522.253442.87232Expected HPLEfficiency90%90%87%85%VM as a % ofMetal1x0.98x0.99x1xLesson – Target a EPYC CPU model with a core count thatreturns commensurate value for increase in costNote decline in expected and deliveredHPL efficiency; this is due to graduallyrunning out of data fabric (GMI)bandwidth

TL;DR – EPYC packs in so much memory bandwidth ,L3 cache, data fabric perf, etc. that for many HPC apps,even at ISO core counts, it will often outperform XeonDisclaimer - not showing this for “Azure v. AWS”purposes” (Azure Skylake in HC-series would looksimilar to AWS’ Skylake in this case)Nor using Skylake as representative of all Intel Xeon(e.g. IceLake would do better than Skylake here)Nor saying OpenFOAM is indicative of every HPCworkloadOptimizing OpenFOAM Performance and Cost on Azure HBv2 VMs - https://bit.ly/3xUbOYo

Big differences in 1 node perf can reduce at scale . and small differences in 1 node per can increase at scaleBoth scenarios can change calculus of which CPU platform to invest in, and how to configure those platforms

TL;DR - Not *likely* a big deal Few HPC apps support AVX512 as is, evenfewer are heavily optimized for it Anything that supports AVX512 also likely hasAVX2 binary (e.g. GROMACS) For those that are, EPYC core count advantagemakes up the difference (and with no need forAVX512 support)Frontera (AVX512, CLX)Frontera (AVX2, CLX)Azure HBv2 (AVX2, Rome)2.52 Scenario 1: (2 CPUs/server) * (28 cores/Cascade Lake 8280) * (32ops/cycle) * ( 1.9 GHz SIMD bound freq) 3.4 teraFLOPS FP64(peak) Scenario 2: (2 CPUs/server) * (64 cores/Rome 7742) * (16 ops/cycle)* ( 2.2 GHz SIMD bound freq) 4.5 teraFLOPS FP64 (peak)1Exception: You have AVX512 app *AND* it’s licensedper core *AND* SW costs dominate TCO *AND*problem is not communications bound0.51.50Big picture - If your app is so purely computebound, you probably want a GPU anyway163264128Nodes256512

TL;DR - Very much a “it depends”In general, MKL will run just fine on EPYCIf access to source, AMD libraries are optimized andwell supported for EPYChttps://developer.amd.com/amd-aocl/Backup option for MKL (prior to 2020) is to useDebug Mode Type 5 (not necessarilyrecommended, though)But some apps will take hard dependency on MKLand as a result deliver better perf, perf/unit of cost,and cost/performance on Intel XeonLibraryMKL (Debug Modeenabled)MKL (DebugMode disabled)BLISSingle-core DGEMM51.36 GigaFLOPS47.684 GigaFLOPS50.65 GigaFLOPSMulti-core DGEMM3239 GigaFLOPS1778 GigaFLOPS4020 GigaFLOPS

L3 as NUMA Defines NUMA boundaryDeterminism Mode Enabled 1 NUMA for every L3 slice Disabled # of NUMA will be how you define NPS(recommended)Performance bring every CPU in cluster to lowestcommon denominator of silicon yield Power let motherboard drive CPU to bestfrequencies based on frequency/power curve of givenCPU (recommended)Nodes per Socket – Determines how Interleaving is doneC-States NPS1 simplest presentation Enabled – best “Fmax” (recommended) NPS2 2-way interleaving/socket (recommended) Disabled – limited “Fmax” NPS4 4-way interleaving/socket NPS4 not an option on 6 CCD EPYC

Preferred IO key PCIe device (e.g. InfiniBand NIC (recommended)cTDP Configurable power rangeLCLK for Key PCIe device Set to 593 (improves NIC latency)Simultaneous Multi-threading (SMT) Enabled 2 threads/core Disabled 1 thread/corePackage Power Limit (PPL) Hard governor of socket power limit Depends on your DC power limit and how you areassessed OPEX costs

High Performance Computing (HPC) Tuning Guide for AMD EPYC 7003 Series Processors - https://bit.ly/3k0oiZLHigh Performance Computing (HPC) Tuning Guide for AMD EPYC 7002 Series Processors - https://bit.ly/3xRJzd1HPC Performance and Scalability Results with Azure HBv2 VMs - https://bit.ly/2XD7EbjHPC Performance and Scalability Results with Azure HBv3 VMs - https://bit.ly/37PyCOMAMD Presentation to NASA – “Why AMD for HPC” - https://go.nasa.gov/3CVOMEzAMD Optimizing CPU Libraries (AOCL) - https://bit.ly/3m9afnfOptimizing OpenFOAM Performance and Cost on Azure HBv2 VMs - https://bit.ly/3xUbOYo

Thank you!FeedbackQ&A

High Performance Computing (HPC) Tuning Guide for AMD EPYC 7003 Series Processors - https://bit.ly/3k0oiZL High Performance Computing (HPC) Tuning Guide for AMD EPYC 7002 Series Processors - https://bit.ly/3xRJzd1 HPC Performance and Scalability Results with Azure HBv2 VMs - https://bit.ly/2XD7Ebj

Related Documents:

AMD EPYC Rome Clusters Cluster / Configuration AMD Minerva cluster at the Dell EMC HPC Innovation Lab -Number of AMD EPYC Rome sub-systems with Mellanox EDR and HDR interconnect fabrics 10 x Dell EMC PowerEdge C6525 nodes with EPYC Rome CPUs running SLURM; AMD EPYC 7502 / 2.5 GHz; # of CPU Cores: 32; # of Threads: 64; Max Boost

performance with Oracle Database 19c. Supermicro AS -1014S-WTRT server, powered by AMD EPYC 7F72 For this benchmark with Oracle Database 19c, we are using the Supermicro AS -1014S-WTRT server powered by AMD EPYC 7F52 and AMD EPYC 7F72. This system is a single socket, as shown in the table below. With AMD EPYC’s core density, it is cost-

HPC Tuning Guide for AMD EPYC Processors 56420 Rev. 0.7 December 2018 6 Chapter 1 Introduction AMD launched the new ‘EPYC’ x86_64 CPU for the data center in June 2017. Based on the 14nm Z

This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) such as the features, functionality, performance, availability, timing and expected benefits of AMD products and AMD product roadmaps, which are made pursuant to the Safe . HPC Tuning Guide for AMD EPYC 7003 Series Processor AMD Covid-19 HPC Fund

XSEDE HPC Monthly Workshop Schedule January 21 HPC Monthly Workshop: OpenMP February 19-20 HPC Monthly Workshop: Big Data March 3 HPC Monthly Workshop: OpenACC April 7-8 HPC Monthly Workshop: Big Data May 5-6 HPC Monthly Workshop: MPI June 2-5 Summer Boot Camp August 4-5 HPC Monthly Workshop: Big Data September 1-2 HPC Monthly Workshop: MPI October 6-7 HPC Monthly Workshop: Big Data

chassis-000 0839QCJ01A ok Sun Microsystems, Inc. Sun Storage 7410 cpu-000 CPU 0 ok AMD Quad-Core AMD Op cpu-001 CPU 1 ok AMD Quad-Core AMD Op cpu-002 CPU 2 ok AMD Quad-Core AMD Op cpu-003 CPU 3 ok AMD Quad-Core AMD Op disk-000 HDD 0 ok STEC MACH8 IOPS disk-001 HDD 1 ok STEC MACH8 IOPS disk-002 HDD 2 absent - - disk-003 HDD 3 absent - -

PYC CPU Tuning Guide for InfiniBand HPC [AMD Public Use] EPYC-07 - Based on June 8, 2018 AMD internal testing of same-architecture product ported from 14 to 7 nm technology with similar implementation flow/methodology, using performance from

The American Revolution, 1775-1781 Where was the American Revolution fought? Building a Professional Army nWashington’s task was to defendas much territory as possible: Relied on guerrilla tactics & avoided all-out-war with Britain Washington’s Continental Army served as the symbol of the “republican cause” But, colonial militias played a major role in “forcing” neutrals .