Accelerating Tensor Contractions In High-Order FEM With MAGMA Batched

1y ago
19 Views
3 Downloads
9.29 MB
24 Pages
Last View : 9d ago
Last Download : 3m ago
Upload by : Luis Waller
Transcription

Accelerating Tensor Contractions inHigh-Order FEM with MAGMA BatchedAhmad Abdelfattah1, Marc Baboulin2, Veselin Dobrev3, Jack Dongarra1,4,Chris Earl3, Joel Falcou2, Azzam Haidar1, Ian Karlin3, Tzanio Kolev3,Ian Masliah2, and Stan Tomov11 Innovative Computing Laboratory, University of Tennessee, Knoxville2 University of Paris-Sud, France3 Lawrence Livermore National Laboratory, Livermore, CA, USA4University of Manchester, Manchester, UKSIAM Conference on Computational Science and Engineering (SIAM CSE’17)Algorithms and Libraries for Tensor Contractions (MS47)https://www.siam.org/meetings/cse17/Atlanta, GA, U.S.A.February 26–March 3, 2017

Outline Introduction Tensors in numerical libraries Tensor formulation for high-order FEM Tensor contractions interfaces and code generation Algorithms design and tuning Performance Conclusions

IntroductionNumerous important applications: High-order FEM simulationsSignal ProcessingNumerical Linear AlgebraNumerical AnalysisData MiningDeep LearningGraph AnalysisNeuroscienceand morecan be expressed through tensors.The goal is to design a: High-performance package forTensor algebra; Built-in architecture-awareness(GPU, Xeon Phi, multicore); User-friendly interface.e.g., relational dataItemó scalar(0)Itemsó vector(1)Relations of pairs ó matrix(2)tensorsRelations of 3-tuple ó 3-D array(3) Relations of N-tuplesó N-D array(N)

ExamplesNeed of Batched and/or Tensor contraction routines in machine learninge.g., Convolutional Neural Networks (CNNs) used in computer visionKey computation is convolution of Filter Fi (feature detector) and input image D nnectedData DOutput On.ODkFilters FFnOutputpredictionschicken 0.4person 0.1boat 0.3dog 0.01n,kConvolution of Filters Fi (feature detection) and input image D: For every filter Fn and every channel, the computation forevery pixel value On,k is a tensor contraction:O n,k D k,i F n,iiPlenty of parallelism; small operations that must be batchedWith data “reshape” the computation can be transformed intoa batched GEMM (for efficiency; among other approaches)

ExamplesMulti-physics problems need small & many tensor contractionsCollaboration with ORNL and UTK physics department (Mike Guidry, Jay Billings, Ben Brock, Daniel Shyles, Andrew Belt) Many physical systems can be modeled by a fluid dynamics plus kinetic approximatione.g., in astrophysics, stiff equations must be integrated numerically: Implicitly; standard approach, leading to need of batched solvers (e.g., as in XNet library) Explicitly; a new way to stabilize them with Macro- plus Microscopic equilibrationneed batched tensor contractions of variable sizesSpeedup of the solver for matrix size 150Additional acceleration achieved through MAGMA Batched4Speedup3CUDA streams2Batchedcomputation10MKLMA48MAGMA3.7x speedupReference: A. Haidar, S. Tomov, A. Abdelfattah, M. Guidry, J. Billings, and J. Dongarra, Optimisation Techniques Toward Accelerating Explicit Integration for Large Kinetic Networks. 5 / 19International Conference on Parallel Processing, Philadelphia,PA, USA ICPP 2016.7x speedup

Tensor abstractions andnumerical dense linear algebran0123m456780 1 2 3 4 5 6 7Matrix A in tiled data-layoutas a 4th-order tensor:Ai,j,m,nA rank-64 update as tensor contraction on index k:.jifor i 0.63for j 0.63for m 1.8for n 1.7Matrix AIn tile data layoutAi, j,m,n A i,k,m,0 A k, j,0,nk

Tensor abstractions andnumerical dense linear algebra How to design it?n0123m45678//Declare a 4th-order Tensor A on the GPU︎Tensor 64, 64, 9, 8, gpu t A; ︎0 1 2 3 4 5 6 7.jiAi,j,m,n// DSEL design using Einstein notation: repeated// index k means a summation/contraction. ︎// Range of the other indices is full/range as︎// given through the left assignment operand︎A(i, j, m:1.8, n:1.7) - A(i,k,m,0) * A(k, j,0,n); ︎How to implement it? Can be casted to BLAS︎Can be very inefficient, e.g., if implementedas dot-products (Level 1 BLAS)︎Better, if︎ Recognized as Level 2 BLAS︎ Recognized as Level 3 BLAS︎ Batched Level 3 BLAS, e.g., GEMM︎ On the fly data reshape︎ ︎

Must determine (in software) if possible to do it throughbatchedformulationGEMM kernelsTensorsfor high-order FEM[1] V. Dobrev, T.Kolev, R.Rieben. High order curvilinear finite element methods for Lagrangianhydrodynamics. SIAM J.Sci.Comp.34(5), B606–B641. (36 pages)Thiswork was performedunder theauspicesJ.ofDongarra,the U.S. DepartmentEnergyA.by Haidar,LawrenceLivermoreunderContract DA. Abdelfattah,M. Baboulin,V. Dobrev,C. Earl, J.ofFalcou,I. Karlin,Tz.NationalKolev, I.LaboratoryMasliah, S.Tomov,LLNLrelease number LLNL-POST-676632High-PerformanceTensor Contractionsfor GPUs,8 / 19The International Conference on Computational Science (ICCS 2016), San Diego, CA, June 6—8, 2016.

Tensors formulation for high-order FEM Consider the FE mass matrix ME for anelement E with weight ρ, as a 2-D tensor Using FE of order p, we have nd O(pd) and nq O(pd), so B is dense O(pd) x O(pd) matrix. If the FE basis and the quadrature rule have tensor product structure, we can decomposedofs and quadrature point indices in logical coordinate axesi (i1, ., id), j (j1, ., jd), k (k1, ., kd)so in 3D for example (d 3), Mij can be viewed as 6-dimensional tensorA. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, Tz. Kolev, I. Masliah, S. Tomov,High-Performance Tensor Contractions for GPUs,9 / 19The International Conference on Computational Science (ICCS 2016), San Diego, CA, June 6—8, 2016.

aluations of M times V, referred as equations (3) & (4) belowsomething better: code geautotuning will be used, e.implemented as tensor indTensor kernels for assembly/evaluationBatched LATensor contractions are tredu/magma/ (including LUFor example:Ci1,i2,i3 A k,i1B k,i2,i3kCan be written as[2] nd1 (nd2nd3)A.Haidar, T.Dong,Reshape(C) S.Tomov, P.LuszczGermany, TJuly 12-16, 2015. nq1 (nd2nd3)A Reshape(B)

Tensor contraction interfaces andcode generation Design Convenience of use (dimension and data layout abstraction) Readability (considered DSEL; decided C 14 is expressive enough) Performance (reshape to GEMMs, design, autotuning, compiler – code gen/templates) Use C 14 standard and in particular constexpr specifier(to evaluate value of function or variable at compile time)

Algorithm designs Importance of reshaping to GEMMs: as illustrated, not all flops are equalDGEMM (NN), batch count 500, 1 Tesla K40c GPU450400121035030082506Matrix size M N, K 32Matrix size M GPUbatched41615020820016Gflop/sDAXPY, batch count 500, 1 Tesla K40c GPU

Batched routines released in MAGMA

Implementation oncurrent hardware Memory hierarchiesis becoming challengingREGISTERSL1 CACHE & GPU SHARED MEMORYL2 CACHEL3 CACHEMAIN MEMORYMAIN MEMORY BANDWIDTHPCIEXPRESS GEN3 X16INTERCONNECTCRAY GEMINIE5-2650 v3HaswellKNL 7250DDR5 MCDRAMARMK40cP10010 cores68 cores4 cores15 SM x192 cores56 SM x64 cores16/core AVX232/core AVX-51232/core256 KB/SM256 KB/SM32 KB/core32 KB/core32 KB/core64 KB/SM64 KB/SM256 KB/core1024 KB/2cores2 MB1.5 MB4 MB25 MB0.16 GBN/AN/AN/A64 GB384 16 GB4 GB12 GB16 GB26 GB/s288 GB/s720 GB/s68 GB/s115 421 GB/s16 GB/s16 GB/s16 GB/s16 GB/s16 GB/s6 GB/s6 GB/s6 GB/s6 GB/s6 GB/sMemory hierarchies for different type of architecturesWorkshop on Batched, Reproducible,and Reduced Precision BLASGeorgia TechComputational Science and EngineeringAtlanta, GAFebruary 23—25, 2017http://bit.ly/Batch-BLAS-2017Draft ReportsBatched BLAS Draft batched api 03 30 2016.pdf?dl 0Batched BLAS atched%20BLAS%20Poster%2012.pdf?dl 0Batched BLAS atchedBLAS-1.pptx?dl 0Webpage on fficient Reproducible Floating Point Summation and 15/EECS-2015-229.pdf

Algorithm designs Reshape to GEMMs GEMM is multilevel blockedcode from MAGMA to map toGPU’s hierarchical memory Parametrized forautotuningNBLKkKBLKNBKBLKkMBLKNBLKMBLKMAC Use Batched execution In general 1 TB per matrix Use vectorization across matrices ina TB for very small matrices;we denote by TB Concurrency (tbc) Templates and constexpr to avoid param.checking and compiler-unrolled code No pointers to batched matrices: passedthrough formulas in the tensor abstraction General kernel organization:1) Read A and B (or parts if blocking) in fast memory- through functions in the tensor abstraction for layout- allows for on-the-fly reshape (data for indices in theoperation may not be in standard GEMM form)2) Compute, e.g., A B3) Update C

Autotuning1) Kernel variants: performance parameters are exposed through a templated kernel interfacetemplate typename T, int DIM X, int DIM Y,int BLK M, int BLK N, int BLK K,int DIM XA, int DIM YA, int DIM XB, int DIM YB,int THR M, int THR N, int CONJA, int CONJB static device void tensor template device gemm nn( int M, int N, int K, 2) CPU interfaces that call the GPU kernels as a Batched computationtemplate typename T, int DIM X, int DIM Y, void tensor template batched gemm nn( int m, int n, int k, ) { tensor template device gemm nn T, DIM X, DIM Y, dimGrid, dimBlock, 0, queue (m, n, k, );}3) Python scripts that generate the search space for theparameters DIM X, DIM Y index,DIM X, DIM Y, #define NN V 0#define NN V 1#define NN V 2 ,4,4,4,8884) Scripts that run all versions in the search space, analyze theresults, and return the best combination of parameters, whichis stored in the library forsubsequent use.16 / 57

Performance modelFlops for the computationFPmax Fastest time to solutionTmin For square matricesF 2n3,Tmin minT (TRead(A,B,C) TCompute(C) TWrite(C) ) Need to read/write 4 n2 elements, i.e., 32n2 Bytes in DP if max bandwidth is B, we can take Tmin 32 n2 / B in DP. Thus, With ECC on, peak on B on a K40c is 180 GB/s, so when n 16 for example,we expect theoretical max performance of 180 Gflop/s in DP

Performance resultsPerformance comparison of tensor contraction versions usingbatched C αAB βC on 100,000 square matrices of size n on aK40c GPU and 16 cores of Intel Xeon E5-2670, 2.60 GHz CPUs.Effect of a Thread Block Concurrency (tbc) techniqueswhere several DGEMMs are performed on one TBsimultaneouslyNvidia K40 / Intel Xeon E5-2650 v3 (Haswell) 10 cores400Our design MAGMA K40Cublas K40Rocache designMKL openMP on CPURoofline bound350300Gflop/sGflop/s25020015010050002468 10 12 14 16 18 20 22 24 26 28 30 32Matrix Size

Performance resultsNvidia P1001100Magma tensor dgemm predefined size at compile timeMagma batched dgemm generic smallcuBLAS v8.01000900800Gflop/s70060050040030020010000246810 12 14 16 18 20 22 24 26 28 30 32Matrix Size

Performance resultsGflop/sBatched DGEMM on Tegra ARM14131211109876543210magmaopenblasijk loopikj loopup0510rpe15bnoud20Matrix Size253035

Performance Batched DGEMM on CPUs2 x Intel Xeon E5-2650 v3 (Haswell), 20 cores908580757065605550454035302520151050gen codemkl codeijk codeikj codemkl batchedunup0510pedorb1520Matrix SizeGflops/sGflops/sIntel Xeon E5-2650 v3 (Haswell), 10 302010020 cores custom numa20 cores interleave all20 cores10 coresunup0510pedorb1520Matrix Size25I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, and J. Dongarra,High-performance matrix-matrix multiplications of very small matrices,Euro-Par’16, Grenoble, France, August 22-26, 2016.3035

Performance resultsBatched DGEMM on Intel Xeon PhiIntel Xeon Phi KNC 7120 60 coresbatch dgemm 10000 matrices140batch dgemm 10000 matrices250120100200Gflop/sGflop/sIntel Xeon Phi KNL 7250 68 cores3008060150100405020002468 10 12 14 16 18 20 22 24 26 28 30 32Matrix Size002468 10 12 14 16 18 20 22 24 26 28 30 32Matrix Size

Conclusions and future workIn conclusion:Developed tensor abstractions for high-order FEMMultidisciplinary effortAchieve 90 % of theoretical maximum on GPUs and multicore CPUsUse on-the-fly tensor reshaping to cast tensor contractions assmall but many GEMMs, executed using batched approaches Custom designed GEMM kernels for small matrices and autotuning Future directions: To release a tensor contractions package through the MAGMA library Integrate developments in BLAST Complete autotuning and develop all kernels

Collaborators and SupportMAGMA teamhttp://icl.cs.utk.edu/magmaPLASMA teamhttp://icl.cs.utk.edu/plasmaCollaborating partnersUniversity of Tennessee, KnoxvilleUniversity of Manchester, Manchester, UKUniversity of Paris-Sud, FranceLawrence Livermore National Laboratory,Livermore, CAUniversity of California, BerkeleyUniversity of Colorado, DenverINRIA, France (StarPU team)KAUST, Saudi Arabia

Contractions by the first index can often be represented as tensor index reordering plus gemm, which is a key factor to achieve high-performance. We present ongoing work on the design of a high-performance package in MAGMA for Tensor algebra that includes techniques to organize tensor contractions, data storage, and parametrization related to

Related Documents:

02 - tensor calculus 1 02 - tensor calculus - tensor algebra tensor calculus 2 tensor the word tensor was introduced in 1846 by william rowan hamilton . it was used in its current meaning by woldemar voigt in 1899. tensor calculus was deve-loped around 1890 by

LA, a tensor contraction, and a possible tensor contractions design using Einstein summation notation and a Domain Speci c Embedded Language (or DSEL ) . Right: Illustration of tensor contractions needed and viable approaches to solve them in machine learning. Figure 1 Left illustrates the notion of tensor and tensor contraction in DLA, as well .

on tensor decomposition, which aims to dissolve tensors into separable representations, while Song et al.(2019) reviewed tensor completion, which aims to impute the unobserved entries of a partially observed tensor. Tensor decomposition and tensor completion are both fundamen-tal problems i

tensor. Similarly, the vector is a rst order tensor, and a scalar is a zeroth order tensor. The order of the tensor equals the number of free indices (see SectionA.3) that the tensor has, and it can also be interpreted as the dimensionality of the array needed to represent the

Choose two contractions from the word bank, and write the words that are used to form each one. 7. Write two contractions from the word bank that are formed using the word is. 8. Write a sentence using a contraction from the word bank. 9. Write a sentence using a possessive from the word bank. (Form B) Spelling LESSON WEEK 5: Contractions and .

More than 5 contractions in 10 minutes averaged over 30 minutes Contractions lasting 2 minutes or more Insufficient return of uterine resting tone between contractions via palpation or intraamniotic pressure above 25 mmHg between contractions via IUPC Simpson, 2020, p. s 23; s37 The tracing continued like this for 30 minutes .

contractions considerably differ from concentric or isometric contractions (Duchateau and Baudry,2014). Differences are detected on the level of the contracting muscle as well as on the cortical level. Most studies indicate a reduced central activation (evidenced by a lower EMG amplitude) during maximal eccentric contractions than maximal .

1 Advanced Engineering Mathematics C. Ray Wylie, Louis C. Barrett McGraw-Hill Book Co 6th Edition, 1995 2 Introductory Methods of Numerical Analysis S. S. Sastry Prentice Hall of India 4th Edition 2010 3 Higher Engineering Mathematics B.V. Ramana McGraw-Hill 11th Edition,2010 4 A Text Book of Engineering Mathematics N. P. Baliand ManishGoyal