Stan Posey, HPC Industry Development NVIDIA, Santa Clara .

2y ago
70 Views
4 Downloads
3.91 MB
38 Pages
Last View : 1m ago
Last Download : 2m ago
Upload by : Helen France
Transcription

Stan Posey, HPC Industry DevelopmentNVIDIA, Santa Clara, CA, USAsposey@nvidia.com

NVIDIA Introduction and HPC Evolution of GPUsPublic, based in Santa Clara, CA 4B revenue 5,500 employeesFounded in 1999 with primary business in semiconductor industryProducts for graphics in workstations, notebooks, mobile devices, etc.Began R&D of GPUs for HPC in 2004, released first Tesla and CUDA in 2007Development of GPUs as a co-processingprocessing accelerator for x86 CPUsHPC Evolution of GPUs2004: Began strategic investments in GPU as HPC co-processorco2006: G80 first GPU with built-inin compute features, 128 cores; CUDA SDK Beta2007: Tesla 8-series based on G80, 128 cores – CUDA 1.0, 1.12008: Tesla 10-seriesseries based on GT 200, 240 cores – CUDA 2.0, 2.33 Generations ofTesla in 3 Years2009: Tesla 20-series,series, code named “Fermi” up to 512 cores – CUDA SDK 3.02

How NVIDIA Tesla GPUs are Deployed in SystemsData Center ProductsTesla M205 /M2070 AdapterGPUsTesla S20501U SystemWorkstationTesla C2050 / C2070Workstation Board1 Tesla GPU4 Tesla GPUs1 Tesla GPUSingle Precision1030 Gigaflops4120 Gigaflops1030 GigaflopsDouble Precision515 Gigaflops2060 Gigaflops515 Gigaflops3 GB / 6 GB12 GB (3 GB / GPU)3 GB / 6 GB148 GB/s148 GB/s144 GB/sMemoryMemory B/W3

Engineering Disciplines and Related SoftwareComputational Structural Mechanics (CSM) implicit for strength (stress) and vibrationStructural strength at minimum weight, low-frequencyfrequency oscillatory loading, fatigueANSYS; ABAQUS/Standard; MSC.Nastran; NX Nastran;Nastran MarcComputational Structural Mechanics (CSM) explicit for impact loads; structural failureImpact over short duration; contacts – crashworthiness, jet engine blade failure, bird-strikebirdLS-DYNA; ABAQUS/Explicit; PAM-CRASH;CRASH; RADIOSSComputational Fluid Dynamics (CFD) for flow of liquids ( water) and gas ( air)Aerodynamics; propulsion; reacting flows; multiphase; cooling/heat transferANSYS FLUENT; STAR-CD; STAR-CCM ;CCM ; CFD ; ANSYS CFX; AcuSolve; PowerFLOWComputational Electromagnetics (CEM) for EM compatibility, interference, radarEMC for sensors, controls, antennas; low observable signatures; radar-cross-sectionradarANSYS HFSS; ANSYS Maxwell; ANSYS SIwave; XFdtd;XFdtd FEKO; Xpatch; SIGLBC; CARLOS; MM3D4

Motivation for CPU Acceleration with GPUs5

GPU Progress Status for Engineering CodesGPU StatusStructural MechanicsFluid DynamicsANSYS wCulises (OpenFOAM)ParticleworksLS-DYNA implicitMarcCFD LS-DYNA CFDProductEvaluationRADIOSS implicitPAM-CRASH implicitMD NastranNX NastranCFD-ACE plicitFLUENT/CFXSTAR-CCM AvailableTodayReleaseComingin 2011ElectromagneticsNexximEMProCST MSXFdtdSEMCAD XXpatchHFSS6

GPU Considerations for Engineering CodesInitial efforts are linear solvers on GPU, but it’s not enoughLinear solvers 50% of profile time -- only 2x speed-up is possibleMore of application will be moved to GPUs in progressive stagesMost codes use a parallel domain decomposition methodThis fits GPU model very well and preserves costly MPI investmentAll codes are parallel and scale across multiple CPU coresFair GPU vs. CPU comparisons should be CPU-socket-to-GPU-socketCPUComparisons presented here are made against 4-core4Nehalem7

Leading ISVs Who Develop Engineering CodesISVApplicationANSYSANSYS CFD (FLUENT and CFX); ANSYS Mechanical; HFSSSIMULIAAbaqus/Standard; Abaqus/ExplicitAbaqusLSTCLS-DYNAMSC.SoftwareMD Nastran;; Marc; AdamsCD-adapcoSTAR-CD; STAR-CCM AltairRADIOSSSiemensNX NastranESI GroupPAM-CRASH; PAM-STAMPSTAMPMetacompCFD ACUSIMAcuSolveAutodeskMoldflow8

GPU Priority by ISV Market Opportunity and “Fit”#1 Computational Structural Mechanics (CSM)(implicitfor strength (stress) and vibrationANSYS ABAQUS/Standard MSC.Nastran;; Marc NX Nastran LS-DYNA RADIOSSTypical Computational Profiles of CSM ImplicitTesla C2050 4x Faster DGEMM vs. QC Nehalem9

DGEMM Improved 36% With CUDA 3.2 (Nov 10)GflopsDGEMM: Multiples of 643503002502001501005064 x 64384 x 384704 x 7041024 x 10241344 x 13441664 x 16641984 x 19842304 x 23042624 x 26242944 x 29443264 x 32643584 x 35843904 x 39044224 x 42244544 x 45444864 x 48645184 x 51845504 x 55045824 x 58246144 x 61446464 x 64646784 x 67847104 x 71047424 x 74247744 x 77448064 x 80648384 x 83840cuBLAS 3.1: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHzcuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz10

Basics of Implicit CSM ImplementationsImplicit CSM – deployment of a multi--frontal direct sparse solverLarge DenseMatrix FrontsSchematic Representation of StiffnessMatrix that is Factorized in the SolverSmall Dense Matrix Fronts11

Basics of Implicit CSM ImplementationsImplicit CSM – deployment of a multi--frontal direct sparse solverLarge DenseMatrix FrontsUpper threshold:Fronts too largefor single GPUmemory needmultiple GPUsLower threshold:Fronts too small toovercome PCIedata transfer costsstay on CPU coresSmall Dense Matrix Fronts12

ANSYS Performance Study by HP and NVIDIAHP ProLiant SL390 Server ConfigurationSingle server node – 12 total CPU cores, 1 GPU2 x Xeon X5650 HC 2.67 GHz CPUs (Westmere))48 GB memory – 12 x 4GB 1333 MHz DIMMsNVIDIA Tesla M2050 GPU with 3 GB memoryRHEL5.4, MKL 10.25, NVIDIA CUDA 3.1 – 256.40Study conducted at HP by Domain EngineeringHP Z800 Workstation Configuration2 x Xeon X5570 QC 2.8 GHz CPUs (Nehalem)48 GB memoryNVIDIA Tesla C2050 with 3 GB memoryRHEL5.4, Intel MKL 10.25, NVIDIA CUDA 3.1Study conducted at NVIDIA by Performance LabNVIDIA TeslaM2050 GPUHP SL390Server HP Z800WorkstationNVIDIA TeslaC2050 GPU ANSYS Mechanical Model – V12sp-5Turbine geometry, 2,100 K DOF and SOLID187 FE’sSingle load step, static, large deflection nonlinearANSYS Mechanical 13.0 direct sparse solver13

ANSYS Mechanical for Westmere GPU ServerNOTE: Results Based on ANSYS Mechanical R13 SMP Direct Solver Sep 2010ANSYS Mechanical Times in Seconds3000Xeon 5650 2.67 GHz Westmere (Dual Socket)Xeon 5650 2.67 GHz Westmere Tesla M20502656Results from single HP-SL390SL390 server node, 2 x Xeon X5650 2.67GHz CPUs,48GB memory – 12 x 4GB at 1333MHz, MKL 10.25; Tesla M2050, CUDA 3.120001.6xLowerisbetter161610001.5x1062NOTE: Scaling Limitto One 6-Core6Socket1.6x6681.3x52101 CoreV12sp-5 Model2 Core4 Core1 Socket6 Core12 Core- Turbine geometry- 2,100 K DOF- SOLID187 FEs- Static, nonlinear- One load step- Direct sparse2 Socket14

ANSYS Mechanical for Westmere GPU ServerNOTE: Results Based on ANSYS Mechanical R13 SMP Direct Solver Sep 2010ANSYS Mechanical Times in Seconds3000Xeon 5650 2.67 GHz Westmere (Dual Socket)Xeon 5650 2.67 GHz Westmere Tesla M20502656Results from single HP-SL390SL390 server node, 2 x Xeon X5650 2.67GHz CPUs,48GB memory – 12 x 4GB at 1333MHz, MKL 10.25; Tesla M2050, CUDA 3.120004.4xNOTE: Add a C2050 touse with 4 cores: nowfaster than 12, with 8to use for other x4484381.3x52139801 Core2 Core4 Core1 SocketV12sp-5 Model6 Core12 Core- Turbine geometry- 2,100 K DOF- SOLID187 FEs- Static, nonlinear- One load step- Direct sparse2 Socket15

ANSYS Mechanical for Nehalem GPU WorkstationNOTE: Results Based on ANSYS Mechanical R13 Direct SMP Solver Sep 2010ANSYS Mechanical Times in Seconds3000Xeon 5560 2.8 GHz Nehalem (Dual Socket)Xeon 5560 2.8 GHz Nehalem Tesla C20502604Results from HP Z800 Workstation, 2 x Xeon X5560 2.8GHzCPUs, 48GB memory, MKL 10.25; Tesla M2050, CUDA 3.12000Lowerisbetter1.8x1412NOTE: Scaling Limitto One 4-Core4Socket10001.7x8301.2x6901.1x59301 CoreV12sp-5 Model2 Core1 Socket4 Core6 Core8 Core- Turbine geometry- 2,100 K DOF- SOLID187 FEs- Static, nonlinear- One load step- Direct sparse2 Socket16

ANSYS Mechanical for Nehalem GPU WorkstationNOTE: Results Based on ANSYS Mechanical R13 Sparse Direct Solver Sep 2010ANSYS Mechanical Times in Seconds3000Xeon 5560 2.8 GHz Nehalem (Dual Socket)Xeon 5560 2.8 GHz Nehalem Tesla C20502604Results from HP Z800 Workstation, 2 x Xeon X5560 2.8GHzCPUs, 48GB memory, MKL 10.25; Tesla M2050, CUDA 3.12000NOTE: Add a C2050 touse with 4 cores: nowfaster than 12, with 8to use for other tasks4.6x14121000V12sp-5 ModelLowerisbetter3.0x830 2.0x5614714266901.7x5934111.5x39001 Core2 Core1 Socket4 Core6 Core8 Core- Turbine geometry- 2,100 K DOF- SOLID187 FEs- Static, nonlinear- One load step- Direct sparse2 Socket17

Effects of System CPU Memory for V12sp-5V12sp ModelNOTE: Results Based on ANSYS Mechanical R13 SMP Direct Solver Sep 2010ANSYS Mechanical Times in Seconds20001500Xeon 5560 2.8 GHz Nehalem 4 Cores (Dual Socket)LowerisbetterXeon 5560 2.8 GHz Nehalem 4 Cores Tesla C2050Results from HP Z800 Workstation, 2 x Xeon X5560 2.8GHzCPUs, 48GB memory, MKL 10.25; Tesla M2050, CUDA 3.115241.3x12141000NOTE: Most CPU andCPU GPU benefit within-memory solution11551.7x8306825002.0x4260V12sp-5 Model24 GB32 GB48 GBOut-of-memoryOut-of-memoryIn-memory- Turbine geometry- 2,100 K DOF- SOLID187 FEs- Static, nonlinear- One load step- Direct sparse34 MB required for in-memory solution18

Effects of System CPU Memory for V12sp-5V12sp ModelNOTE: Results Based on ANSYS Mechanical R13 SMP Direct Solver Sep 2010ANSYS Mechanical Times in Seconds20001500Xeon 5560 2.8 GHz Nehalem 4 Cores (Dual Socket)LowerisbetterXeon 5560 2.8 GHz Nehalem 4 Cores Tesla C2050Results from HP Z800 Workstation, 2 x Xeon X5560 2.8GHzCPUs, 48GB memory, MKL 10.25; Tesla M2050, CUDA 3.1152432%1214100078%115539%V12sp-5 ModelNOTE: GPU results farmore sensitive to outof-memory solution83068250060%426024 GB32 GB48 GBOut-of-memoryOut-of-memoryIn-memory- Turbine geometry- 2,100 K DOF- SOLID187 FEs- Static, nonlinear- One load step- Direct sparse34 MB required for in-memory solution19

Economics of Engineering Codes in PracticeCost Trends in CAE Deployment: Costs in People and Software Continue to Increase Historically hardwarevery expensive vs. ISVsoftware and people Software budgets arenow 4x vs. hardware Increasingly importantthat hardware choicesdrive cost efficiency inpeople and software20

Abaqus/Standard/Standard for Nehalem GPU WorkstationTesla C2050 Speed-up vs. 4-Core NehalemAbaqus/Standard: Based on v6.10-EF Direct Solver – Tesla C2050, CUDA 3.1 vs. 4-core Nehalem4SolverTotal Time3.73.43Higherisbetter“Current and Future Trendsof High PerformanceComputing with Abaqus”3.02.422.0Presentation by Matt Dunbar2.0S4b: EngineBlock Modelof 5 MM DOF1CPU Profile:75% SolverCPU Profile:71% SolverCPU Profile:80% Solver0DOFsFP OpsSource: SIMULIA CustomerConference, 27 May 2010:S4b - 5 MM1.03E 13Case 2 - 3.7 MM1.68E 13Case 3 - 1.5 MM1.70E 13NOTE: Solver PerformanceIncreases with FP OperationsResults Based on 4-core CPU21

Abaqus and NVIDIA Automotive Case StudyNOTE: Preliminary Results Based on Abaqus/Standard/Standard v6.10-EFv6.10Direct SolverAbaqus/Standard Times in Seconds6000Non-Solver TimesSolver CPU GPUSolver CPU582585849672.2x TotalLowerisbetter40002.8x inSolverAuto EngineBlock Model26598502000CPU Profile:85% in Solver18090Xeon 5550 CPU2.67 GHz, 4 CoresXeon 5550 CPU Tesla C20502.67 GHz, 4 Cores- 1.5M DOF- 2 Iterations- 5.8e12 Opsper Iteration22

Abaqus and NVIDIA Automotive Case StudyResults Based on Preliminary v6.10-EFEF Direct SolverAbaqus/Standard Times in Seconds7500Xeon 5550 2.67 GHz Nehalem (Dual Socket)Xeon 5550 2.67 GHz Nehalem Tesla C2050Lowerisbetter58255000Engine Model2.2x32242500265941%1.7x188104 Core8 CoreResults from HP Z800 Workstation, 2 x Xeon X5550 2.67 GHz CPUs, 48GB memory, MKL 10.25; Tesla C2050 with CUDA 3.1- 1.5M DOF- 2 Iterations- 5.8e12 Opsper Iteration23

LS-DYNADYNA 971 Performance for GPU AccelerationNOTE: Results of LS-DYNADYNA Total Time for 300K DOF Implicit ModelTotal LS-DYNA Time in Seconds25002 x QC Xeon Nehalem (8 cores total)Lowerisbetter20002 x QC Xeon Nehalem Tesla C2050Results forCPU-only20301500OUTER3 Model1.9x1000NOTE: CPU Scalesto 8 Cores for 5.8xBenefit over 1 Core10851.8x5006051.7x350 300K DOF, 1 RHS01 Core2 Core4 Core8 Core24

LS-DYNADYNA 971 Performance for GPU AccelerationNOTE: Results of LS-DYNADYNA Total Time for 300K DOF Implicit ModelTotal LS-DYNA Time in Seconds25002 x QC Xeon Nehalem (8 cores total)Lowerisbetter20002 x QC Xeon Nehalem Tesla C2050Add GPUAcceleration2030NOTE: 1 Core GPUfaster than 6 cores1500OUTER3 Model4.8x1000NOTE: More coresspeeds-up total 5 300K DOF, 1 RHS01 Core GPU2 Core GPU4 Core GPU8 Core GPU25

Distributed CSM and NVIDIA GPU ClustersNOTE: Illustration Basedased on a Simple Example of 4 Partitions and 4 Compute NodesModel geometry is decomposed;partitions are sent to independentcompute nodes on a clusterCompute nodes operate distributedparallel using MPI communication tocomplete a solution per time stepN1N2N3N4A global solutionis developed atthe completedtime duration26

Distributed CSM and NVIDIA GPU ClustersNOTE: Illustration Basedased on a Simple Example of 4 Partitions and 4 Compute NodesModel geometry is decomposed;partitions are sent to independentcompute nodes on a clusterCompute nodes operate distributedparallel using MPI communication tocomplete a solution per time stepN1N2N3N4G1G2G3G4A partition would be mapped to a GPU andprovide shared memory OpenMP parallel – a2nd level of parallelism in a hybrid modelA global solutionis developed atthe completedtime duration27

GPU Priority by ISV Market Opportunity and “Fit”#2 Computational Fluid Dynamics (CFD)ANSYS CFD (FLUENT/CFX) STAR-CCM AcuSolve AcuSolve CFD Particleworks OpenFOAMTypical Computational Profile of CFD (implicit)NOTE: Tesla C2050 9x Faster SpMV vs. QC Nehalem28

Performance of AcuSolve 1.8 on TeslaAcuSolve: Profile is SpMV Dominant but Substantial Portion Still on CPU29

Performance of AcuSolve 1.8 on TeslaAcuSolve: Comparison of Multi-CoreCore Xeon CPU vs. Xeon CPU Tesla GPU1000Xeon Nehalem CPU750500Nehalem CPU Tesla GPULowerisbetter549S-duct with 80K DOF549Hybrid MPI/Open MPfor Multi-GPU test25027916504 Core CPU1 Core CPU 1 GPU4 Core CPU1 Core CPU 2 GPU30

CFD Developments and Publications on GPUs48th AIAA Aerospace Sciences Meeting Jan 2010 Orlando, FL, USAFEFLO:Porting of an Edge-BasedBased CFD Solver to GPUs[AIAA-2010-0523] Andrew Corrigan, Ph.D., Naval Research Lab; Rainald Lohner, Ph.D., GMUFAST3D:Using GPU on HPC Applications to Satisfy Low Power Computational Requirement[AIAA-2010-0524] Gopal Patnaik, Ph.D., US Naval Research LabOVERFLOW:Rotor Wake Modeling with a Coupled Eulerian and Vortex Particle Method[AIAA-2010-0312] Chris Stone, Ph.D., Intelligent LightCFD on Future Architectures Oct 2009 DLR BraunschweigBraunschweig,, DEVeloxi:Unstructured CFD Solver on GPUsJamil Appa, Ph.D., BAE Systems Advanced Technology CentreelsA:Recent Results with elsA on Many-CoresMichel Gazaix and Steve Champagneux, ONERA / Airbus FranceTurbostream: Turbostream: A CFD Solver for Many-CoreCore ProcessorsTobias Brandvik, Ph.D. , Whittle Lab, University of CambridgeParallel CFD 2009 May 2009 NASA Ames, Moffett Field, CA, USAOVERFLOW:Acceleration of a CFD Code with a GPUDennis Jespersen, NASA Ames Research Center31

GPU Results for Grid-BasedBased Continuum CFDSuccess Demonstrated in Full Range of Time and Spatial le]S3DImplicit[usuallyincompressible] 15x 8x 4x 2xFEFLOBldg air blastChem mixerAuto climateISVsU.S. Engine Co.Internal flowsAcuSolveDNSMoldflowStructured GridAircraft aeroUnstructuredSpeed-ups based on use of4-core Xeon X5550 2.67 GHz32

Culises:: New CFD Solver Library for OpenFOAM33

Prometech and Particle-BasedBased CFD for Multi-GPUsMultiParticleworks from Prometech SoftwareMPS-based methoddeveloped at theUniversity of Tokyo[Prof. Koshizuka]Preliminary resultsfor Particleworks2.5 with releasedplanned for 2011Performance isrelative to 4 coresof Intel i7 CPUContact Prometechfor release details34

IMPETUS AFEA Results for GPU Computing4.5h on4 cores35

Summary of Engineering Code Progress for GPUsGPUs are an Emerging HPC Technology for ISVsIndustry Leading ISV Software is GPU-EnabledGPUTodayInitial GPU Performance Gains are EncouragingJust the beginning of more performance and more applicationsNVIDIA Continues to Invest in ISV DevelopmentsJoint technicalechnical collaborations at most Engineering ISVs36

Contributors to the ISV Performance StudiesSIMULIAMr. Matt Dunbar, Technical Staff, Parallel Solver DevelopmentDr. Luis Crivelli,, Technical Staff, Parallel Solver DevelopmentANSYSMr. Jeff Beisheim, Technical Staff, Solver DevelopmentUSC Institute for Information SciencesDr. Bob Lucas, Director of Numerical MethodsACUSIM (Now a Division of Altair Engineering)Dr. Farzin Shakib, Founder and President37

Thank You, Questions ?Stan Posey CAE Market DevelopmentNVIDIA, Santa Clara, CA, USA38

GPU Status Structural Mechanics Fluid Dynamics ANSYS Mechanical AFEA Abaqus/Standard (beta) LS-DYNA implicit Marc RADIOSS implicit PAM-CRASH implicit MD Nastran NX Nastran LS-DYNA Abaqus/Explicit 6 Electromagnetics AcuSolve Moldflow Culises (OpenFOAM) Particleworks CFD-ACE FloEFD Abaqus/CFD FLUENT/CFX STAR-CCM CFD LS-DYNA CFD Nexxim EMPro .

Related Documents:

XSEDE HPC Monthly Workshop Schedule January 21 HPC Monthly Workshop: OpenMP February 19-20 HPC Monthly Workshop: Big Data March 3 HPC Monthly Workshop: OpenACC April 7-8 HPC Monthly Workshop: Big Data May 5-6 HPC Monthly Workshop: MPI June 2-5 Summer Boot Camp August 4-5 HPC Monthly Workshop: Big Data September 1-2 HPC Monthly Workshop: MPI October 6-7 HPC Monthly Workshop: Big Data

Microsoft HPC Pack [6]. This meant that all the users of the Windows HPC cluster needed to be migrated to the Linux clusters. The Windows HPC cluster was used to run engineering . HPC Pack 2012 R2. In order to understand how much the resources were being used, some monitoring statis-tics were extracted from the cluster head node. Figure 2 .

MIL-PRF- 6081D Grade 4 4 MIL-PRF-7808L Grade 3 4 MIL-PRF-23699F STD C/I HTS 4 4 4 MK-8, MS-8P, MS-8RK 4 DOD-L-85734 4 DEF-STAN 91-94 (DERD2468) 4 DEF-STAN 91-98 (DERD2487) 4 DEF-STAN 91-99 4 DEF-STAN 91-100 4 DEF-STAN 91-101 Grade OX-27 4 Shell Aviation has a portfolio of minera

2. History of Stan Meyer 3. Stan's Memos 4. Peter Lindermann - System Explained 5. Water Fuel Cell and the First and Second Laws of Thermodynamics 6. Stan Meyer Patents 7. Stan's circuits 8. Aaron's posts on Stan's Circuit (page created at November 2007 Update) 1. Introduction by MDG nov07: Stanley MEYER is the most famous inventor in the

Dr. Roderick Posey vOutstanding Educator Dr. Roderick B. Posey of Hattiesburg has been selected as the 2009 . Bob Cunningham, Jackson Secretary Bill Taylor, Water Valley Treasurer Stacy Thomas,

Superpower" Slime 11 Find your favorite recipe for homemade slime or from the store. Posey's signature color is pink, so pink it up! Or have your child use her or his favorite color. Posey's superpower is UltraSparkle, so add some shine. Add glitter to the mix. When her superpower activates, her bracelet shoots out some of favorite things.

Uni.lu HPC School 2019 PS3: [Advanced] Job scheduling (SLURM) Uni.lu High Performance Computing (HPC) Team C. Parisot University of Luxembourg (UL), Luxembourg

for use in animal nutrition. Regulation (EC) No 178/2002 laying down the general principles and requirements of food law. Directive 2002/32/EC on undesirable substances in animal feed. Directive 82/475/EEC laying down the categories of feed materials which may be used for the purposes of labelling feedingstuffs for pet animals The Animal Feed (Hygiene, Sampling etc and Enforcement) (England .