The Visual Computing Company - HPC Advisory Council

3y ago
54 Views
2 Downloads
2.28 MB
48 Pages
Last View : Today
Last Download : 3m ago
Upload by : Ryan Jay
Transcription

The Visual Computing CompanyGPU Acceleration Benefits for Applied CAEAxel Koehler, Senior Solutions Architect HPC, NVIDIAHPC Advisory Council Meeting, April 2014 , Lugano

OutlineGeneral overview about GPU efforts in CAEComputational Structural Mechanics (CSM)ANSYS Mechanical, SIMULIA Abaqus/Standard, MSC Nastran, MSC MarcComputational Fluid Dynamics (CFD)ANSYS Fluent, OpenFOAM (FluiDyna, PARALUTION)Computational Electromagnetics (CEM)CST Studio SuiteConclusion

Status Summary of ISVs and GPU ComputingEvery primary ISV has products available on GPUs or undergoing evaluationThe 4 largest ISVs all have products based on GPUs#1 ANSYS, #2 DS SIMULIA, #3 MSC Software, and #4 AltairThe top 4 out of 5 ISV applications are available on GPUs todayANSYS Fluent, ANSYS Mechanical, SIMULIA Abaqus/Standard, MSC Nastran, (LS-DYNAimplicit only)In addition several other ISV applications are already ported to GPUsAcuSolve, OptiStruct (Altair), NX Nastran (Siemens), Permas (Intes), Fire (AVL),Moldflow(Autodesk), AMLS, FastFRS (CDH), .Several new ISVs were founded with GPUs as a primary competitive strategyPrometech, FluiDyna, Vratis, IMPETUS, Turbostream, Open source CFD OpenFOAM available on GPUs today with many optionsCommercial options: FluiDyna, Vratis; Open source options: Cufflink, Symscapeofgpu, RAS, etc.

GPUs and Distributed Cluster ComputingGeometry decomposed: partitionsput on independent cluster nodes;CPU distributed parallel processingPartition on CPU1Nodes distributedparallel using MPI2 34N1N1N2N3N4G1G2G3G41Execution onCPU GPUGPUs shared memoryparallel using OpenMPunder distributed parallelGlobal Solution

CAE Priority for ISV Software Development on GPUs#4#2#3ANSYS / ANSYS FluentOpenFOAM (Various ISVs)CD-adapco / STAR-CCM Autodesk Simulation CFDESI / CFD-ACE SIMULIA / Abaqus/CFD#1LSTC / LS-DYNASIMULIA / Abaqus/ExplicitAltair / RADIOSSESI / PAM-CRASHANSYS / ANSYS MechanicalAltair / RADIOSSAltair / AcuSolve (CFD)Autodesk / MoldflowANSYS / ANSYS MechanicalSIMULIA / Abaqus/StandardMSC Software / MSC NastranMSC Software / MarcLSTC / LS-DYNA implicitAltair / RADIOSS BulkSiemens / NX NastranAutodesk / Mechanical

Computational Structural MechanicsANSYS Mechanical

ANSYS and NVIDIA Collaboration RoadmapReleaseANSYS Mechanical13.0SMP, Single GPU, Sparseand PCG/JCG SolversDec 201014.0ANSYS FluentANSYS EMANSYS Nexxim Distributed ANSYS; Multi-node SupportRadiation Heat Transfer(beta)ANSYS Nexxim Radiation HT; GPU AMG Solver (beta),Single GPUANSYS NexximNov 2012 Multi-GPU Support; Hybrid PCG; Kepler GPU Support15.0 CUDA 5 Kepler Tuning Multi-GPU AMG Solver; CUDA 5 Kepler TuningANSYS NexximANSYS HFSS (Transient)Dec 201114.5Dec 2013

ANSYS Mechanical15.0 on Tesla GPUs600K40576K20ANSYS Mechanical jobs/dayV14sp-5 Model2.2X2.1X363K403243.9X3.5X932 CPU cores932 CPU cores Tesla K20275275K202 CPU cores 2 CPU cores Tesla K40Simulation productivity (with a HPC license)Turbine geometry2,100,000 DOFSOLID187 FEsStatic, nonlinearDistributed ANSYS 15.0Direct sparse solverHigherisBetter8 CPU cores7 CPU cores Tesla K208 CPU cores 7 CPU cores Tesla K40Simulation productivity (with a HPC Pack)Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU; Tesla K20 GPU and aTesla K40 GPU with boost clocks.

Considerations for ANSYS Mechanical on GPUsProblems with high solver workloads benefit the most from GPUCharacterized by both high DOF and high factorization requirementsModels with solid elements and have 500K DOF experience good speedupsBetter performance when run on DMP mode over SMP modeGPU and system memories both play important roles in performanceSparse solver:If the model exceeds 5M DOF, then either add another GPU with 5-6 GB of memory(Tesla K20 or K20X) or use a single GPU with 12 GB memory (eg. Tesla K40)PCG/JCG solver:Memory saving (MSAVE) option should be turned off for enabling GPUsModels with lower Level of Difficulty value (Lev Diff) are better suited for GPUs

Computational Structural MechanicsAbaqus/Standard

SIMUILA and Abaqus GPU Release ProgressionAbaqus 6.11, June 2011Direct sparse solver is accelerated on the GPUSingle GPU support; Fermi GPUs (Tesla 20-series, Quadro 6000)Abaqus 6.12, June 2012Multi-GPU/node; multi-node DMP clustersFlexibility to run jobs on specific GPUsFermi GPUs Kepler Hotfix (since November 2012)Abaqus 6.13, June 2013Un-symmetric sparse solver on GPUOfficial Kepler support (Tesla K20/K20X/K40)

Rolls Royce: Abaqus 3.5x Speedup with 5M DOFSandy Bridge Tesla K20X Single ServerElapsed Time in seconds200004.71M DOF (equations); 77 TFLOPsNonlinear Static (6 Steps)Direct Sparse solver, 100GB memorySpeed up relative to 8 core32.42x150003.52.5100002.11x250001.5018c8c 1g8c 2g16cSpeed up relative to 8 core (1x) 16c 2gServer with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2

Rolls Royce: Abaqus Speedups on an HPC Cluster Sandy Bridge Tesla K20X for 4 x Servers4.71M DOF (equations); 77 TFLOPsNonlinear Static (6 Steps)Direct Sparse solver, 100GB memoryElapsed Time in seconds90002.2x60002.04X1.9x1.8X30001.8x024c24c 4g2 Servers36c36c 6g3 Servers48c48c8g4 ServersServers with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2

Abaqus/Standard 15% Gain from K20X to K402.1x – 4.8x1.9x – 4.1x15% av15% av1.7x – 2.9x1.5x – 2.5x

Abaqus 6.13-DEV Scaling on Tesla GPU ClusterPSG Cluster: Sandy Bridge CPUs with 2x E5-2670 (8-core), 2.6 GHz, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, QDR IB, CUDA 5

Abaqus Licensing in a node and across a clusterCoresTokens1526374859610711128 (1 CPU)9121013111312141314141515151616 (2 1415151616162 nodes: 2x 16 cores 2x 2 GPUs32 cores: 21 tokens32 cores 4 GPUs: 22 tokens3 nodes: 3x 16 cores 3x 2 GPUs48 cores: 25 tokens48 cores 6 GPUs: 26 tokens

Abaqus 6.12 Power consumption in a node

Computational Structural MechanicsMSC Nastran

MSC Nastran 2013Nastran direct equation solver is GPU acceleratedSparse direct factorization (MSCLDL, MSCLU)Real, Complex, Symmetric, Un-symmetricHandles very large fronts with minimal use of pinned host memoryLowest granularity GPU implementation of a sparse direct solver; solves unlimitedsparse matrix sizesImpacts several solution sequences:High impact (SOL101, SOL108), Mid (SOL103), Low (SOL111, SOL400)Support of multi-GPU and for Linux and WindowsWith DMP 1, multiple fronts are factorized concurrently on multiple GPUs; 1 GPUper matrix domainNVIDIA GPUs include Tesla 20-series, Tesla K20/K20X, Tesla K40, Quadro 6000CUDA 5 and later19

MSC Nastran 2013SMP GPU acceleration of SOL101 and SOL1036X6HigherisBetterserial4c4c 1g4.532.8X2.7X1.9X1.51X1X0SOL101, 2.4M rows, 42K frontSOL103, 2.6M rows, 18K frontLanczos solver (SOL 103)Sparse matrix factorizationIterate on a block of vectors (solve)Orthogonalization of vectorsServer node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory

MSC Nastran 2013Coupled Structural-Acoustics simulation with SOL108Europe Auto OEM1XElapsed Time (mins)1000Lower isBetter800710K nodes, 3.83M elements100 frequency increments (FREQ1)Direct Sparse solver6002.7X4004.8X5.2X2005.5X11.1X0serial1c 1g4c (smp)4c 1g8c (dmp 2)8c 2g(dmp 2)Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory

Computational Structural MechanicsMSC MARC

MARC 2013

Computational Fluid DynamicsANSYS Fluent

ANSYS and NVIDIA Collaboration RoadmapReleaseANSYS Mechanical13.0SMP, Single GPU, Sparseand PCG/JCG SolversDec 201014.0ANSYS FluentANSYS EMANSYS Nexxim Distributed ANSYS; Multi-node SupportRadiation Heat Transfer(beta)ANSYS Nexxim Radiation HT; GPU AMG Solver (beta),Single GPUANSYS NexximNov 2012 Multi-GPU Support; Hybrid PCG; Kepler GPU Support15.0 CUDA 5 Kepler Tuning Multi-GPU AMG Solver; CUDA 5 Kepler TuningANSYS NexximANSYS HFSS (Transient)Dec 201114.5Dec 2013

How to Enable NVIDIA GPUs in ANSYS FluentWindows:Linux:fluent 3ddp -g -ssh –t2 -gpgpu 1 -i journal.jouCluster specification:nprocs Total number of fluent processesM Number of machinesngpgpus Number of GPUs per machineRequirement 1nprocs mod M 0Same number of solver processes on each machineRequirement 2𝑛𝑝𝑟𝑜𝑐𝑀mod ngpgpus 0No. of processes should be an integer multiple of GPUs

Cluster Specification ExamplesSingle-node configurations:GPUCPUGPU16 mpiGPUGPUGPUGPU8 mpi 8 mpiGPUGPUGPU5 mpi5 mpi5 mpiMulti-node configurations:GPU8 mpiGPUGPU7 mpiGPU8 mpiGPUGPU7 mpiGPU8 mpiGPUGPU7 mpiGPU8 mpiGPU7 mpiNote: The problem must fit in the GPU memory for the solution to proceed

Considerations for ANSYS Fluent on GPUsGPUs accelerate the AMG solver of the CFD analysisFine meshes and low-dissipation problems have high %AMGCoupled solution scheme spends 65% on average in AMGIn many cases, pressure-based coupled solvers offer faster convergencecompared to segregated solvers (problem-dependent)The system matrix must fit in the GPU memoryFor coupled PBNS, each 1 MM cells need about 4 GB of GPU memoryHigh-memory GPUs such as Tesla K40 or Quadro K6000 are idealBetter performance with use of lower CPU core countsA ratio of 4 CPU cores to 1 GPU is recommended

NVIDIA-GPU Solution fit for ANSYS FluentCFD analysisYesIs it single-phase &flow dominant?NoNot ideal for GPUsWhich solverdo you use?NoPressure–basedCoupled solverSegregated solverIs it asteady-stateanalysis?YesBest-fit for GPUsConsider switching to the pressure-based coupled solverfor better performance (faster convergence) and furtherspeedups with GPUs.29

ANSYS Fluent GPU Performance for Large CasesBetter speed-ups on larger and harder-to-solve problems144 CPU cores36 CPU cores36 CPU cores 12 GPUsTruck Body Model144 CPU cores 48 GPUsANSYS Fluent Time (Sec)36132X1.4 X9.5LowerisBetter18 14 million cells111 million cellsANSYS Fluent 15.0 Performance – Results by NVIDIA, Dec 2013External aerodynamicsSteady, k-e turbulenceDouble-precision solverCPU: Intel Xeon E5-2667;12 cores per nodeGPU: Tesla K40, 4 per nodeNOTE: Reported times areFluent solution time insecond per iteration

GPU Acceleration of Water Jacket AnalysisANSYS Fluent 15.0 performance on pressure-based coupled Solver6391ANSYS Fluent Time (Sec)Water jacket model45572.5xLowerisBetter5.9x 2520 Unsteady RANS modelFluid: waterInternal flowCPU: Intel Xeon E5-2680;8 coresGPU: 2 X Tesla K40775CPU onlyCPU GPUAMG solver timeCPU onlyCPU GPUSolution timeNOTE: Timesfor 20 time steps

ANSYS Fluent GPU Study on Productivity Gains Same solution times:64 cores vs.32 cores 8 GPUs Frees up 32 CPUsand HPC licenses foradditional job(s) Approximate 56%increase in overallproductivity for 25%increase in costANSYS Fluent Number of Jobs Per DayANSYS Fluent 15.0 Preview 3 Performance – Results by NVIDIA, Sep 201325Truck Body ModelHigherisBetter2015161664 Cores32 Cores 8 GPUs4 x Nodes x 2 CPUs(64 Cores Total)2 x Nodes x 2 CPUs(32 Cores Total)8 GPUs (4 each Node)10514 M Mixed cellsSteady, k-e turbulenceCoupled PBNS, DPTotal solution timesCPU: AMG F-cycleGPU: FGMRES withAMG Preconditioner0NOTE: All resultsfully converged

ANSYS 15.0 New HPC Licenses for GPUsTreats each GPU socket as a CPU core, which significantly increasessimulation productivity of your HPC licensesNeeds 1 HPC task toenable a GPUAll ANSYS HPC products unlock GPUs in 15.0, including HPC,HPC Pack, HPC Workgroup, and HPC Enterprise products.

Computational Fluid DynamicsOpenFOAM

NVIDIA GPU Strategy for OpenFOAMProvide technical support for GPU solver developmentsFluiDyna (implementation of NVIDIA’s AMG), Vratis and PARALUTIONAMG development by Russian Academy of Science ISP (A. Monakov)Cufflink development by WUSTL now Engys North America (D. Combest)Invest in strategic alliances with OpenFOAM developersESI and OpenCFD Foundation (H. Weller, M. Salari)Wikki and OpenFOAM-extend community (H. Jasak)Conduct performance studies and customer evaluationsCollaborations: developers, customers, OEMs (Dell, SGI, HP, etc.)

Library CulisesConcept and FeaturesSimulation tool e.g.OpenFOAM State-of-the-art solvers for solution of linearsystems– Multi-GPU and multi-node capable– Single precision or double precision available Krylov subspace methods– CG, BiCGStab, GMRESfor symmetric /non-symmetric matrices– Preconditioning options Culises Cuda Library for Solving Linear EquationSystemsSee also www.culises.comJacobi (Diagonal)Incomplete Cholesky (IC)Incomplete LU (ILU)Algebraic Multigrid (AMG), see belowStand-alone Multigrid method– Algebraic aggregation and classical coarsening– Multitude of smoothers (Jacobi, Gauss-Seidel, ILU etc. ) Flexible interfaces for arbitrary applicationse.g.: established coupling with OpenFOAM GPU Acceleration of CFD in Industrial Applications using Culises and aeroFluidXGTC 2014

Summary hybrid approachSimulation tool e.g.OpenFOAM Advantage: Universally applicable (coupled tosimulation tool of choice) Full availability of existing flow models Easy/no validation needed Unsteady approach better for hybrid dueto large linear solver timesDisadvantages: Hybrid CPU-GPU produces overhead In case that solution of linear system notdominant Application speedup can be limitedGPU Acceleration of CFD in Industrial Applications using Culises and aeroFluidXGTC 2014

aeroFluidXan extension of the hybrid approachCPU flow solvere.g. OpenFOAM aeroFluidXGPU implementationpreprocessingdiscretizationLinear solverpostrocessingFV moduleFV moduleCulisesCulises Porting discretization of equations to GPU discretization module (Finite Volume)running on GPU Possibility of direct coupling to Culises Zero overhead from CPU-GPU-CPU memory transferand matrix format conversion Solution of momentum equations also beneficial OpenFOAM environment supported Enables plug-in solution for OpenFOAM customers But communication with otherinput/output file formats possibleGPU Acceleration of CFD in Industrial Applications using Culises and aeroFluidXGTC 2014

aeroFluidXCavity flow CPU: Intel E5-2650 (all 8 cores)GPU: Nvidia K404M grid cells (unstructured)Running 100 SIMPLE steps with:– OpenFOAM (OF) pressure: GAMGVelocitiy: Gauss-Seidel– OpenFOAM (OFC) Pressure: Culises AMGPCG (2.4x)Velocity: Gauss-Seidel– aeroFluidX (AFXC) Pressure: Culises AMGPCGVelocity: Culises JacobiTotal speedup:– OF (1x)– OFC 1.62x– AFXC 2.20xNormalized computing time1009080706050403020100all assemblyall linear solve1x1x2.1x1x1.96xOpenFOAMOpenFOAM Culises2.22xaeroFluidX Culisesall assembly assembly of all linear systems (pressure and velocity)all linear solve solution of all linear systems (pressure and velocity)GPU Acceleration of CFD in Industrial Applications using Culises and aeroFluidXGTC 2014

PARALUTIONC Library to perform various sparse iterativesolvers and preconditionerContains Krylov subspace solvers (CR, CG, BiCGStab,GMRES, IDR), Multigrid (GMG, AMG), Deflated PCG, .Multi/many-core CPU and GPU supportAllows seamless integration with otherscientific softwarePARALUTION Library is Open Source releasedunder GPL v3www.paralution.com

PARALUTION OpenFOAM pluginOpenFOAM Plugin will be released soon

Computational ElectromagneticsCST Studio Suite

CST - Company and Product Overview CST AG is one of the two largest suppliers of 3D EM simulation software. CST STUDIO SUITE is an integrated solution for 3D EM simulations it includes a parametric modeler, morethan 20 solvers, and integrated post-processing. Currently, three solvers support GPU Computing.CST – COMPUTER SIMULATION TECHNOLOGY www.cst.com

New GPU Cards - Quadro K6000/Tesla K40The Quadro K6000 is the new high-end graphics adapter of the Kepler series whereas theTesla K40 card is the new high-end computing device. CST STUDIO SUITE 2013 supportsboth cards for GPU computing with service pack 5.Speedup K20 vs Quadro K6000/Tesla K40www.nvidia.com-The Quadro K6000/Tesla K40 card is about 30.35% faster than the K20 card.-12 GB onboard RAM allows for larger model size.CST – COMPUTER SIMULATION TECHNOLOGY www.cst.com

GPU Computing PerformanceGPU computing performance hasbeen improved for CST STUDIO SUITE2014 as CPU and GPU resources areused in parallel.Speedup of Solver Loop181614Speedup12GPU108CPU6CST STUDIO SUITE 20134CST STUDIO SUITE 20142000.511.522.533.54Number of GPUs (Tesla K40)Benchmark performed on system equipped with dual Xeon E5-2630 v2 (Ivy Bridge EP) processors, and four Tesla K40 cards. Model has 80 million mesh cells.CST – COMPUTER SIMULATION TECHNOLOGY www.cst.com

MPI Computing — PerformanceCST STUDIO SUITE offers native support for high speed/low latency networksSpeedup of Solver LoopMPI Cluster SystemCST STUDIO SUITE Frontend25CPU204 GPUs (K20) per NodeInterconnectSpeedupCluster2 GPUs (K20) per Node15105GPU HardwareNote: A GPU acceleratedcluster system requireshigh-speed network inorder to perform well!011.522.5Number of Cluster Nodes33.5Benchmark Model Features:-Open boundaries4 -Dispersive and lossymaterialBase model size is 80 million cells. Problem size is scaled up linearly with the number of cluster nodes(i.e., weak scaling). Hardware: dual Xeon E5-2650 processors, 128GB RAM per node (1600MHz),Infiniband QDR interconnect (40Gb/s).CST – COMPUTER SIMULATION TECHNOLOGY www.cst.com

ConclusionGPUs provide significant performance acceleration for solverintensive large jobsShorter product engineering cycles (Faster Time-to-Market) with improvedproduct qualityCut down energy consumption in the CAE processBetter Total Cost of Ownership (TCO)GPUs for 2nd level parallelism, preserves costly MPI investmentGPU acceleration contributing to growth in emerging CAENew ISV developments in particle based CFD (LBM, SPH, etc.)Rapid growth for range of CEM applications and GPU adoptionSimulations recently considered intractable are now possibleLarge Eddy Simulation (LES) with a high degree of arithmetic intensityParameter optimization with highly increased number of jobs

The Visual Computing CompanyAxel Koehlerakoehler@nvidia.comNVIDIA, the NVIDIA logo, GeForce, Quadro, Tegra, Tesla, GeForce Experience, GRID, GTX, Kepler, ShadowPlay, GameStream, SHIELD, and The Way It’s Meant To Be Played are trademarks and/orregistered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. 2014 NVIDIA Corporation. All rights reserved.

Computational Structural Mechanics ANSYS Mechanical . ANSYS and NVIDIA Collaboration Roadmap Release ANSYS Mechanical ANSYS Fluent ANSYS EM 13.0 SMP, Single GPU, Sparse Dec 2010 and PCG/JCG Solvers ANSYS Nexxim 14.0 ANSYS Dec 2011 Distributed ANSYS; Multi-node Support Radiation Heat Transfer (beta) Nexxim

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

XSEDE HPC Monthly Workshop Schedule January 21 HPC Monthly Workshop: OpenMP February 19-20 HPC Monthly Workshop: Big Data March 3 HPC Monthly Workshop: OpenACC April 7-8 HPC Monthly Workshop: Big Data May 5-6 HPC Monthly Workshop: MPI June 2-5 Summer Boot Camp August 4-5 HPC Monthly Workshop: Big Data September 1-2 HPC Monthly Workshop: MPI October 6-7 HPC Monthly Workshop: Big Data

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

Microsoft HPC Pack [6]. This meant that all the users of the Windows HPC cluster needed to be migrated to the Linux clusters. The Windows HPC cluster was used to run engineering . HPC Pack 2012 R2. In order to understand how much the resources were being used, some monitoring statis-tics were extracted from the cluster head node. Figure 2 .