1y ago

31 Views

3 Downloads

1.73 MB

36 Pages

Transcription

Speeding up a Finite ElementComputation on GPUNelson Inoue

Summary Introduction Finite element implementation on GPU Results Conclusions2

University and Researchers Pontifical Catholic University of Rio de Janeiro – PUC- Rio Group of Technology in Petroleum Engineering - GTEP Research TeamPhD Sergio FontouraLeader ResearcherPhD Nelson InoueSenior ResearcherPhD Carlos EmmanuelResearcherMSc Guilherme RighettoResearcherMSc Rafael AlbuquerqueResearcher3

Introduction Research & Development (R&D) project with Petrobras The project began in 2010 The subject of the project is Reservoir Geomechanics There are great interest by oil and gas industry in this subject This subject is still little researched4

Introduction What is Reservoir Geomechanics?– Branch of the petroleum engineering that studies the coupling between the problems offluid flow and rock deformation (stress analysis) Hydromechanical Coupling– Oil production causes rock deformation– Rock deformation contributes to oil production5

Motivation Geomechanical effects during reservoir production1. Surface subsidence2. Bedding-parallel slip3. Fault reactivation4. Caprock integrity5. Reservoir compaction6

Challenge Evaluate geomechanical effects in a real reservoir Overcome two major challenges1. To use a reliable coupling scheme between fluid flow and stressanalysis2. To speed up the stress analysis (Finite Element Method)Finite Element Analysis spends most part of the simulationtime7

Hydromechanical coupling Theoretical ApproachCoupling program flowchart8

Finite Element Method Partial Differential Equations arise in the mathematical modelling of manyengineering problems Analytical solution or exact solution is very complicated Alternative: Numerical Solution– Finite element method, finite difference method, finite volume method, boundaryelement method, discrete element method, etc.9

Finite Element Method Finite element method (FEM) is widelyapplied in stress analysis The domain is an assembly of finiteelements nite ElementDomain10

CHRONOS: FE Program Chronos has been implemented on GPUCETUS Computer with 4 GPUs– Motivation: to reduce the simulation time inthe hydromechanical analysis– Why to use GPU? Much more processing powerGPU2880 cores CPU4 - 8 cores4 x GPUsGeForce GTX Titan11

Motivation GPU Features: (Cuda C Programming Guide)– Highly parallel, multithreaded and manycore processor– Tremendous computational horsepower and very high memory bandwidthNumber of FLoating-point Operations Per SecondBandwidth12

Our Implementation GPUs have good performance We have developed and implemented an optimized and parallelfinite element program on GPU Programming Language CUDA is used to implement the finite elementcode We have Implemented on GPU:––––Assembly of the stiffness matrixSolution of the system of linear equationEvaluation of the strain stateEvaluation of the stress state13

Global Memory Access on GPU Getting maximum performance on GPUCoalesced AccessSequential/AlignedStridedRandomGoodNot so goodBad– Memory accesses are fully coalesced as long as all threads in a warp access the samerelative address14

Development on CPU The assembly of the global stiffness matrix in the conventional FEM– Simple 1D problem– Element Stiffness Matrixa) k11 1 1 k 21k12 1 1 k 22 k 2 k11 2 k 21 2 k12 2 k 22 k k11 3 3 k 21k12 3 3 k 22 Element1 k Element2 Element3 1 Real modelb)1234Model discretizationc)1121221Three Finite elements 2 3 32 Continuous model is discretized by elements15

Development on CPU In terms of CPU implementationFor i 1, i numel 3i 1Evaluate ElementStiffness Matrix k k 1 11 1 k 21Assembly GlobalStiffness Matrix k11 1 1 k 21 0 0k12 1 0 0 1 k 220 0 0 0 0 0 0 0 1 element k globali 3i 2k12 1 1 k 22 k 2element k global k11 1 1 k 21 0 0 k 2 11 2 k 21k12 1 1 0 2 2 k 22 k11k1200 2 k 21 2 k 22 3 11321 k k k12 2 2 k 22 3 k0 0 0 0 k global k11 1 1 k 21 0 0k12 1 1 k 22 k11 2 2 k 2100k12 2 2 k 22 k11 3 3 k 21 3 k12 3 k22 0 0 k12 3 3 k 22 – The Storage in the memoryi 1 k element k11 1 1 k12 1 0 0 k 21 1 k 220 0 0 0 0 0 0 0 0 0i 2 k element k11 1 1 k12 1 0 0 k 21 1 k 22 k11 2 1 k12 1 0 0 k 21i 3 k element k11 1 1 k12 1 0 0 k 21 1 k 22 k11 2 2 k12 2 0 0 k 21 Memory access is not coalesced 1 k 220 0 0 0 0 2 k 22 k11 3 3 k12 3 0 0 k 21 3 k 22 16

Development on GPU The assembly of the global stiffness matrix on GPU– Simple 1D problem– Each row of the global stiffness matrix Node1[k row 1 ] [k11 k 22 k11 1 k12 1 ] Node2 1 [k row 2 ] [k 21 1 k 22 k11 2 k12 2 ] Node3 2 [k row 3 ] [k 21 2 k 22 k11 3 k12 3 ] Node3 3 [k row 4 ] [k 21 3 k 22 k11 k12 ]Real model1112122223334334Four finite elements nodes Continuous model is discretized by nodes17

Development on GPU In terms of GPU implementationThread 1[k row 1 ] [0 k11 1 Column 1k12 1 ]Thread 2[krow 2Thread 1 1 ] [k21 1 2 k22 k11 k global 2 k12 ]Thread 2Thread 3Thread 3 2 [k row 3 ] [k 21 2 k 22 k11 3 k12 3 ] 0 1 k 21 2 k 21 3 k 21 All the threads do the same calculation– The Storage in the memoryColumn 1 k global 0Thread 1 1 k 21Thread 2 2 k 21 3 k 21 Thread 3The memory access is sequential and aligned18

Development on GPU In terms of GPU implementationThread 1[k row 1 ] [0 k11 1 Column 2k12 1 ]Thread 2 1 [k row 2 ] [k21 1 2 k22 k11 k global 2 k12]Thread 3 2 [k row 3 ] [k 21 2 k 22 k11 3 k12 3 ] k12 1 0Thread 1 1 2Thread 1 k 21k 22 kk 21k 22 k k 21k 22 2 3Thread 3 2 3 211311 – The Storage in the memoryMemory access is coalescedColumn 2 k global 0 1 k 21 2 k 21 3 k 21k12 1 1 k 22 k11 2 2 k 22 k11 3 Thread 1Thread 2Thread 3 3 k 22 19

Development on GPU Solution of the systems of linear equations Ax b– Direct solver– Iterative Solver– A stiffness matrix, x nodal displacement vector (unknown values) and b nodal force vector– A is a symmetric and positive-definiteConjugate Gradient Algorithm It was chosen the Conjugate Gradient Method– Iterative algorithm– Parallelizable algorithm on GPU– The operations of a conjugate gradient algorithm issuitable to implement on GPU20

Development on GPU Additional remarksStiffness Matrix – sparse matrix– Stiffness matrix K sparse matrix– Sparse matrix most of the elements are zero– Assembling the stiffness matrix by nodes compressedstiffness matrix– The bottleneck Compressed Matrix-Vector Multiplication to map the compressed stiffness matrix21

Development on GPU Conjugate Gradient Method on GPU– To show two operations of the Conjugate Gradient Method– The algorithm has been implemented on 4 GPUs– Each GPU receives a fourth part of the K and ff K Stiffness Matrix 128 columnsNodal ForceVector22

Development on GPU Conjugate Gradient Method on GPUConjugate gradient algorithm– Vector-Vector Multiplication d new r T da)dnew rT x d c)b)ReductionrTd rTd dnew 1d cudaMemcpyPeer dnew 3d dnewnew 2new 423

Development on GPU Conjugate Gradient Method on GPU– Matrix-Vector MultiplicationConjugate gradient algorithmq Ada)d d1 d2 d3 d4cudaMemcpyPeerd db)q A x d 24

Development on GPU Conjugate Gradient Method on GPUConjugate gradient algorithmq Ad– Matrix-Vector Multiplicationc)xq xd A d)xq Vaux ReductionVauxSharedMemory25

Previous Results Linear Equation Solution– Conjugate Gradient Solution for an Optimized GPU and Naïve CPU Algorithm (2010)TABLE 1: Hardware ConfigurationDeviceGPUCPUTypeGeForce GTX 285 1.476 GHzIntel Xeon X3450 2.67GHzNumber of cores2404Memory size1 GB Global Memory8 GBTABLE 2: ResultsCPUSimulation Time (s)8600 GT9800 GTXNumber of Elements10.000GTX 2851.261.210.370.36 (3.5 x)40.00010.909.050.990.61 (17.87 x)250.000130.5136.313.135.38 (24.25 x)26

Previous Results Assembly of the Stiffness Matrix– Comparison for an Optimized GPU and Naïve CPU Algorithm (2011)TABLE 3: Hardware ConfigurationDeviceGPUCPUTypeGeForce GTX 460M 1.35 GHzIntel Core i7-740QM 1.73 GHzNumberof cores1924Memory size1 GB Global Memory6 GBTABLE 4: ResultsNumber of nodes6400810010000Simulation Time (ms)CPUGTX 460M82.280.86 (96 x)122.771.02 (120 x)323.201.24 (261 x)27

Current Results Finite Element Mesh - 4 discretization200.000 elements500.000 elementsOil and Gas Reservoir81.000 cells1.000.000 elements2.000.000 elements28

Current Results The time spent in each operation in ChronosTABLE 5: Time of each operationElements200.000OperationsReading of the Input DataPreparation of the DataAssembly of the Stiffness MatrixSolution of the System of Linear EquationEvaluation of the Strain StateWriting of the Displacement FieldWriting of the Strain StateTotal TimeTime (s)0,3900,9850,0017,3750,0010,4025,31114Time (%)2,706,810,0150,990,012,7836,72100500.000Time (s)1,4072,6160,00118,9850,0010,95013,56838Time (%)3,756,970,0050,590,002,5336,151001.000.000Time (s)2,2535,6000,00137,8410,0011,92328,46376Time (%)2,967,360,0049,740,002,5337,411002.000.000Time (s)4,1459,4680,00182,6970,0013,52153,506153Time (%)2,706,170,0053,930,002,3034,8910029

Current Results The time spent in each operation in Chronos30

Current Results The accuracy verification: Chronos vs. Well known FE program200.000 elements31

Current Results Time Comparison: Chronos vs. Well known FE programTABLE 6: Hardware ConfigurationDevice4 x GPUCPUTypeGeForce GTX Titan 0.876 GHzIntel Core i7-4770 3.40 GHzNumber of cores26884Memory size6 GB Global Memory32 GBTABLE 7: ResultsNumber on Time (s)Well KnownChronos 4 GPUsFE Program21516 (8.6 min)433407 (56.78 min)83Insufficient Memory168Insufficient MemoryPerformanceImprovement24,57 x79,23 xxx32

NVIDIA CUDA Research Center Pontifical Catholic University of Rio de Janeiro is a NVIDIA CUDA ResearchCenterPUC-Rio HomepageCUDA Research Center LogoCUDA Research Center award letter33

Conclusions GPUs has showed great potential to speed up numerical analyses However, the speed-up may only be reached, in general, if new programsor algorithms are implemented and optimized in a parallel way for GPUs34

Acknowledgements The authors would like to thank Petrobras for the financial support andSIMULIA and CMG for providing the academic licenses for the programsAbaqus and Imex, respectively And NVIDIA for the opportunity to show our work in this Conference35

Thank You36

Finite Element Method Partial Differential Equations arise in the mathematical modelling of many engineering problems Analytical solution or exact solution is very complicated Alternative: Numerical Solution – Finite element method, finite difference method, finite volume method, boundary element method, discrete element method, etc. 9

Related Documents: