Accelerating Tensor Contractions In High-Order FEM With MAGMA Batched

1y ago

19 Views

3 Downloads

9.29 MB

24 Pages

Last View : 9d ago

Last Download : 3m ago

Upload by : Luis Waller

Report this link

Download PDF

Transcription

Accelerating Tensor Contractions inHigh-Order FEM with MAGMA BatchedAhmad Abdelfattah1, Marc Baboulin2, Veselin Dobrev3, Jack Dongarra1,4,Chris Earl3, Joel Falcou2, Azzam Haidar1, Ian Karlin3, Tzanio Kolev3,Ian Masliah2, and Stan Tomov11 Innovative Computing Laboratory, University of Tennessee, Knoxville2 University of Paris-Sud, France3 Lawrence Livermore National Laboratory, Livermore, CA, USA4University of Manchester, Manchester, UKSIAM Conference on Computational Science and Engineering (SIAM CSE’17)Algorithms and Libraries for Tensor Contractions (MS47)https://www.siam.org/meetings/cse17/Atlanta, GA, U.S.A.February 26–March 3, 2017

Outline Introduction Tensors in numerical libraries Tensor formulation for high-order FEM Tensor contractions interfaces and code generation Algorithms design and tuning Performance Conclusions

IntroductionNumerous important applications: High-order FEM simulationsSignal ProcessingNumerical Linear AlgebraNumerical AnalysisData MiningDeep LearningGraph AnalysisNeuroscienceand morecan be expressed through tensors.The goal is to design a: High-performance package forTensor algebra; Built-in architecture-awareness(GPU, Xeon Phi, multicore); User-friendly interface.e.g., relational dataItemó scalar(0)Itemsó vector(1)Relations of pairs ó matrix(2)tensorsRelations of 3-tuple ó 3-D array(3) Relations of N-tuplesó N-D array(N)

ExamplesNeed of Batched and/or Tensor contraction routines in machine learninge.g., Convolutional Neural Networks (CNNs) used in computer visionKey computation is convolution of Filter Fi (feature detector) and input image D nnectedData DOutput On.ODkFilters FFnOutputpredictionschicken 0.4person 0.1boat 0.3dog 0.01n,kConvolution of Filters Fi (feature detection) and input image D: For every filter Fn and every channel, the computation forevery pixel value On,k is a tensor contraction:O n,k D k,i F n,iiPlenty of parallelism; small operations that must be batchedWith data “reshape” the computation can be transformed intoa batched GEMM (for efficiency; among other approaches)

ExamplesMulti-physics problems need small & many tensor contractionsCollaboration with ORNL and UTK physics department (Mike Guidry, Jay Billings, Ben Brock, Daniel Shyles, Andrew Belt) Many physical systems can be modeled by a fluid dynamics plus kinetic approximatione.g., in astrophysics, stiff equations must be integrated numerically: Implicitly; standard approach, leading to need of batched solvers (e.g., as in XNet library) Explicitly; a new way to stabilize them with Macro- plus Microscopic equilibrationneed batched tensor contractions of variable sizesSpeedup of the solver for matrix size 150Additional acceleration achieved through MAGMA Batched4Speedup3CUDA streams2Batchedcomputation10MKLMA48MAGMA3.7x speedupReference: A. Haidar, S. Tomov, A. Abdelfattah, M. Guidry, J. Billings, and J. Dongarra, Optimisation Techniques Toward Accelerating Explicit Integration for Large Kinetic Networks. 5 / 19International Conference on Parallel Processing, Philadelphia,PA, USA ICPP 2016.7x speedup

Tensor abstractions andnumerical dense linear algebran0123m456780 1 2 3 4 5 6 7Matrix A in tiled data-layoutas a 4th-order tensor:Ai,j,m,nA rank-64 update as tensor contraction on index k:.jifor i 0.63for j 0.63for m 1.8for n 1.7Matrix AIn tile data layoutAi, j,m,n A i,k,m,0 A k, j,0,nk

Tensor abstractions andnumerical dense linear algebra How to design it?n0123m45678//Declare a 4th-order Tensor A on the GPU︎Tensor 64, 64, 9, 8, gpu t A; ︎0 1 2 3 4 5 6 7.jiAi,j,m,n// DSEL design using Einstein notation: repeated// index k means a summation/contraction. ︎// Range of the other indices is full/range as︎// given through the left assignment operand︎A(i, j, m:1.8, n:1.7) - A(i,k,m,0) * A(k, j,0,n); ︎How to implement it? Can be casted to BLAS︎Can be very inefficient, e.g., if implementedas dot-products (Level 1 BLAS)︎Better, if︎ Recognized as Level 2 BLAS︎ Recognized as Level 3 BLAS︎ Batched Level 3 BLAS, e.g., GEMM︎ On the fly data reshape︎ ︎

Must determine (in software) if possible to do it throughbatchedformulationGEMM kernelsTensorsfor high-order FEM[1] V. Dobrev, T.Kolev, R.Rieben. High order curvilinear finite element methods for Lagrangianhydrodynamics. SIAM J.Sci.Comp.34(5), B606–B641. (36 pages)Thiswork was performedunder theauspicesJ.ofDongarra,the U.S. DepartmentEnergyA.by Haidar,LawrenceLivermoreunderContract DA. Abdelfattah,M. Baboulin,V. Dobrev,C. Earl, J.ofFalcou,I. Karlin,Tz.NationalKolev, I.LaboratoryMasliah, S.Tomov,LLNLrelease number LLNL-POST-676632High-PerformanceTensor Contractionsfor GPUs,8 / 19The International Conference on Computational Science (ICCS 2016), San Diego, CA, June 6—8, 2016.

Tensors formulation for high-order FEM Consider the FE mass matrix ME for anelement E with weight ρ, as a 2-D tensor Using FE of order p, we have nd O(pd) and nq O(pd), so B is dense O(pd) x O(pd) matrix. If the FE basis and the quadrature rule have tensor product structure, we can decomposedofs and quadrature point indices in logical coordinate axesi (i1, ., id), j (j1, ., jd), k (k1, ., kd)so in 3D for example (d 3), Mij can be viewed as 6-dimensional tensorA. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, Tz. Kolev, I. Masliah, S. Tomov,High-Performance Tensor Contractions for GPUs,9 / 19The International Conference on Computational Science (ICCS 2016), San Diego, CA, June 6—8, 2016.

aluations of M times V, referred as equations (3) & (4) belowsomething better: code geautotuning will be used, e.implemented as tensor indTensor kernels for assembly/evaluationBatched LATensor contractions are tredu/magma/ (including LUFor example:Ci1,i2,i3 A k,i1B k,i2,i3kCan be written as[2] nd1 (nd2nd3)A.Haidar, T.Dong,Reshape(C) S.Tomov, P.LuszczGermany, TJuly 12-16, 2015. nq1 (nd2nd3)A Reshape(B)

Tensor contraction interfaces andcode generation Design Convenience of use (dimension and data layout abstraction) Readability (considered DSEL; decided C 14 is expressive enough) Performance (reshape to GEMMs, design, autotuning, compiler – code gen/templates) Use C 14 standard and in particular constexpr specifier(to evaluate value of function or variable at compile time)

Algorithm designs Importance of reshaping to GEMMs: as illustrated, not all flops are equalDGEMM (NN), batch count 500, 1 Tesla K40c GPU450400121035030082506Matrix size M N, K 32Matrix size M GPUbatched41615020820016Gflop/sDAXPY, batch count 500, 1 Tesla K40c GPU

Batched routines released in MAGMA

Implementation oncurrent hardware Memory hierarchiesis becoming challengingREGISTERSL1 CACHE & GPU SHARED MEMORYL2 CACHEL3 CACHEMAIN MEMORYMAIN MEMORY BANDWIDTHPCIEXPRESS GEN3 X16INTERCONNECTCRAY GEMINIE5-2650 v3HaswellKNL 7250DDR5 MCDRAMARMK40cP10010 cores68 cores4 cores15 SM x192 cores56 SM x64 cores16/core AVX232/core AVX-51232/core256 KB/SM256 KB/SM32 KB/core32 KB/core32 KB/core64 KB/SM64 KB/SM256 KB/core1024 KB/2cores2 MB1.5 MB4 MB25 MB0.16 GBN/AN/AN/A64 GB384 16 GB4 GB12 GB16 GB26 GB/s288 GB/s720 GB/s68 GB/s115 421 GB/s16 GB/s16 GB/s16 GB/s16 GB/s16 GB/s6 GB/s6 GB/s6 GB/s6 GB/s6 GB/sMemory hierarchies for diﬀerent type of architecturesWorkshop on Batched, Reproducible,and Reduced Precision BLASGeorgia TechComputational Science and EngineeringAtlanta, GAFebruary 23—25, 2017http://bit.ly/Batch-BLAS-2017Draft ReportsBatched BLAS Draft batched api 03 30 2016.pdf?dl 0Batched BLAS atched%20BLAS%20Poster%2012.pdf?dl 0Batched BLAS atchedBLAS-1.pptx?dl 0Webpage on fficient Reproducible Floating Point Summation and 15/EECS-2015-229.pdf

Algorithm designs Reshape to GEMMs GEMM is multilevel blockedcode from MAGMA to map toGPU’s hierarchical memory Parametrized forautotuningNBLKkKBLKNBKBLKkMBLKNBLKMBLKMAC Use Batched execution In general 1 TB per matrix Use vectorization across matrices ina TB for very small matrices;we denote by TB Concurrency (tbc) Templates and constexpr to avoid param.checking and compiler-unrolled code No pointers to batched matrices: passedthrough formulas in the tensor abstraction General kernel organization:1) Read A and B (or parts if blocking) in fast memory- through functions in the tensor abstraction for layout- allows for on-the-fly reshape (data for indices in theoperation may not be in standard GEMM form)2) Compute, e.g., A B3) Update C

Autotuning1) Kernel variants: performance parameters are exposed through a templated kernel interfacetemplate typename T, int DIM X, int DIM Y,int BLK M, int BLK N, int BLK K,int DIM XA, int DIM YA, int DIM XB, int DIM YB,int THR M, int THR N, int CONJA, int CONJB static device void tensor template device gemm nn( int M, int N, int K, 2) CPU interfaces that call the GPU kernels as a Batched computationtemplate typename T, int DIM X, int DIM Y, void tensor template batched gemm nn( int m, int n, int k, ) { tensor template device gemm nn T, DIM X, DIM Y, dimGrid, dimBlock, 0, queue (m, n, k, );}3) Python scripts that generate the search space for theparameters DIM X, DIM Y index,DIM X, DIM Y, #define NN V 0#define NN V 1#define NN V 2 ,4,4,4,8884) Scripts that run all versions in the search space, analyze theresults, and return the best combination of parameters, whichis stored in the library forsubsequent use.16 / 57

Performance modelFlops for the computationFPmax Fastest time to solutionTmin For square matricesF 2n3,Tmin minT (TRead(A,B,C) TCompute(C) TWrite(C) ) Need to read/write 4 n2 elements, i.e., 32n2 Bytes in DP if max bandwidth is B, we can take Tmin 32 n2 / B in DP. Thus, With ECC on, peak on B on a K40c is 180 GB/s, so when n 16 for example,we expect theoretical max performance of 180 Gflop/s in DP

Performance resultsPerformance comparison of tensor contraction versions usingbatched C αAB βC on 100,000 square matrices of size n on aK40c GPU and 16 cores of Intel Xeon E5-2670, 2.60 GHz CPUs.Effect of a Thread Block Concurrency (tbc) techniqueswhere several DGEMMs are performed on one TBsimultaneouslyNvidia K40 / Intel Xeon E5-2650 v3 (Haswell) 10 cores400Our design MAGMA K40Cublas K40Rocache designMKL openMP on CPURoofline bound350300Gflop/sGflop/s25020015010050002468 10 12 14 16 18 20 22 24 26 28 30 32Matrix Size

Performance resultsNvidia P1001100Magma tensor dgemm predefined size at compile timeMagma batched dgemm generic smallcuBLAS v8.01000900800Gflop/s70060050040030020010000246810 12 14 16 18 20 22 24 26 28 30 32Matrix Size

Performance resultsGflop/sBatched DGEMM on Tegra ARM14131211109876543210magmaopenblasijk loopikj loopup0510rpe15bnoud20Matrix Size253035

Performance Batched DGEMM on CPUs2 x Intel Xeon E5-2650 v3 (Haswell), 20 cores908580757065605550454035302520151050gen codemkl codeijk codeikj codemkl batchedunup0510pedorb1520Matrix SizeGflops/sGflops/sIntel Xeon E5-2650 v3 (Haswell), 10 302010020 cores custom numa20 cores interleave all20 cores10 coresunup0510pedorb1520Matrix Size25I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, and J. Dongarra,High-performance matrix-matrix multiplications of very small matrices,Euro-Par’16, Grenoble, France, August 22-26, 2016.3035

Performance resultsBatched DGEMM on Intel Xeon PhiIntel Xeon Phi KNC 7120 60 coresbatch dgemm 10000 matrices140batch dgemm 10000 matrices250120100200Gflop/sGflop/sIntel Xeon Phi KNL 7250 68 cores3008060150100405020002468 10 12 14 16 18 20 22 24 26 28 30 32Matrix Size002468 10 12 14 16 18 20 22 24 26 28 30 32Matrix Size

Conclusions and future workIn conclusion:Developed tensor abstractions for high-order FEMMultidisciplinary effortAchieve 90 % of theoretical maximum on GPUs and multicore CPUsUse on-the-fly tensor reshaping to cast tensor contractions assmall but many GEMMs, executed using batched approaches Custom designed GEMM kernels for small matrices and autotuning Future directions: To release a tensor contractions package through the MAGMA library Integrate developments in BLAST Complete autotuning and develop all kernels

Collaborators and SupportMAGMA teamhttp://icl.cs.utk.edu/magmaPLASMA teamhttp://icl.cs.utk.edu/plasmaCollaborating partnersUniversity of Tennessee, KnoxvilleUniversity of Manchester, Manchester, UKUniversity of Paris-Sud, FranceLawrence Livermore National Laboratory,Livermore, CAUniversity of California, BerkeleyUniversity of Colorado, DenverINRIA, France (StarPU team)KAUST, Saudi Arabia

Contractions by the first index can often be represented as tensor index reordering plus gemm, which is a key factor to achieve high-performance. We present ongoing work on the design of a high-performance package in MAGMA for Tensor algebra that includes techniques to organize tensor contractions, data storage, and parametrization related to

Related Documents:

tensor calculus 02 - tensor calculus - tensor algebra

02 - tensor calculus 1 02 - tensor calculus - tensor algebra tensor calculus 2 tensor the word tensor was introduced in 1846 by william rowan hamilton . it was used in its current meaning by woldemar voigt in 1899. tensor calculus was deve-loped around 1890 by

48 Views

2y ago

High-Performance Tensor Contractions for GPUs

LA, a tensor contraction, and a possible tensor contractions design using Einstein summation notation and a Domain Speci c Embedded Language (or DSEL ) . Right: Illustration of tensor contractions needed and viable approaches to solve them in machine learning. Figure 1 Left illustrates the notion of tensor and tensor contraction in DLA, as well .

24 Views

1y ago

Tensors in Modern Statistical Learning - Purdue University

on tensor decomposition, which aims to dissolve tensors into separable representations, while Song et al.(2019) reviewed tensor completion, which aims to impute the unobserved entries of a partially observed tensor. Tensor decomposition and tensor completion are both fundamen-tal problems i

21 Views

2y ago

Appendix A Vector Algebra - MIT

tensor. Similarly, the vector is a rst order tensor, and a scalar is a zeroth order tensor. The order of the tensor equals the number of free indices (see SectionA.3) that the tensor has, and it can also be interpreted as the dimensionality of the array needed to represent the

21 Views

2y ago

Contractions - Reading Horizons

Choose two contractions from the word bank, and write the words that are used to form each one. 7. Write two contractions from the word bank that are formed using the word is. 8. Write a sentence using a contraction from the word bank. 9. Write a sentence using a possessive from the word bank. (Form B) Spelling LESSON WEEK 5: Contractions and .

27 Views

1y ago

11/30/2020 Fetal Assessment - Learning Stream

More than 5 contractions in 10 minutes averaged over 30 minutes Contractions lasting 2 minutes or more Insufficient return of uterine resting tone between contractions via palpation or intraamniotic pressure above 25 mmHg between contractions via IUPC Simpson, 2020, p. s 23; s37 The tracing continued like this for 30 minutes .

22 Views

1y ago

Eccentric Muscle Contractions: Risks and Benefits - uliege.be

contractions considerably diﬀer from concentric or isometric contractions (Duchateau and Baudry,2014). Diﬀerences are detected on the level of the contracting muscle as well as on the cortical level. Most studies indicate a reduced central activation (evidenced by a lower EMG amplitude) during maximal eccentric contractions than maximal .

22 Views

1y ago

K.S.INSTITUTE OF TECHNOLOGY, BANGALORE

1 Advanced Engineering Mathematics C. Ray Wylie, Louis C. Barrett McGraw-Hill Book Co 6th Edition, 1995 2 Introductory Methods of Numerical Analysis S. S. Sastry Prentice Hall of India 4th Edition 2010 3 Higher Engineering Mathematics B.V. Ramana McGraw-Hill 11th Edition,2010 4 A Text Book of Engineering Mathematics N. P. Baliand ManishGoyal

68 Views

3y ago

Recent Views

Finance Management for Schools Bromcom eFinance, powered .

eFinance. The Bromcom Financial Accounting System (FAS) is a purpose designed configuration of one of the world's leading financial management solutions now available to UK maintained schools, academies and multi academy trusts (MATs). Known as eFinance, at its core is a suite of modules from Unit4 Business World.

1y ago

104 Views

eFinance Budget Entry Schools

eFinance Plus Entry The boom poron of the "Expendi ture Budget Process" window will be accessible on your screen. You are now ready to enter your budget for next ﬁscal year. Enter the amount you want to allocate for your next ﬁscal year's budget in the Requested

1y ago

122 Views

Siebel eFinance for Teller Connector to IBM WebSphere .

12 Siebel eFinance for Teller Connector to IBM WebSphere Business Component Composer Guide Version 7.0, Rev. H Siebel Teller Architecture The Siebel Connector for Teller extends the functionality of the Siebel Connector for IFX XML to provide Teller-specific data exchange between Siebel and other systems.

1y ago

101 Views

Siebel eFinance ガイドバージョン6.0

siebelﬁ ebusiness applications siebel efinance ガイド siebel 2000 バージョン6.0.2 2000 年7 月 6jpa1-fb00-06020 sfsbank.book 1 ページ 2001年5月29日火曜日午後5時42分

1y ago

97 Views

1 2 4 5 7 8 9 10

The eFinance Plus Accounting, Human Resources and Payroll System are supported by D&N. This system is an online interactive package designed to handle all phases of K-12 school business. ESU#3/D&N is supporting a new time clocking system called Time Clock Plus. This clocking system will integrate with eFinance Plus as well

1y ago

105 Views

IHRE ONLINE FINANZIERUNG: eFINANCE

eFinance bietet Ihnen die Möglichkeit, Finanzierungs-produkte ab sofort ganz einfach online zu beantragen. In einem transparenten und strukturierten Prozess können Sie die notwendigen Dokumente sicher übermitteln, mit uns verhandeln, und auch elektronisch unterzeichnen. Außerdem können Sie mit Ihrem Kunden-

1y ago

144 Views

Relatório Anual 2014

Prêmio efinance 2014 O Sicredi foi o vencedor da categoria Plataforma de Canais do XIII Prêmio efinance com o case Plataforma Multicanal. A Plataforma Multicanal foi desenvolvida para renovar a tecnologia utilizada nos canais de relacionamento da instituição financeira cooperativa com os associados. Julho

1y ago

102 Views

eFinance Travel Voucher Guide - National Defense University

filling out the travel voucher (CONUS-CONUS). - If it is your current address, check the box. 5 America’s Airmen Dependents - Add all dependents. - If the individual will be claimed on the voucher, click “auto-claim this dependent” before adding them. 6 America’s Airmen

2y ago

102 Views

E-Finance in the Philippines: Status and Prospects for Digital .

the role of digital technology in financial inclusion has not been studied in detail. There has been very limited information available in the existing literature that examines the role of efinance in achieving the objective- of inclusive growth. This paper is an attempt to study the contribution of technology towards financial inclusion in

1y ago

100 Views

Wiener Processes and Ito's Lemma - efinance .cn

Categorization of Stochastic Processes Discrete time; discrete variable Random walk: if can only take on discrete values Discrete time; continuous variable

1y ago

102 Views

AIC eServices for Financing Schemes (eFASS) Navigation Guide

Schemes (eFASS) platform at https://eFinance.aic.sg For detailed steps, refer to page 3 of this navigation guide. Yes, you can apply on behalf of someone in your family.

1y ago

113 Views

2016-2017 Financial Services Guidelines

Receiving POs in eFinance 41 Staff Travel 42 Student Travel 43 Accounts Payable Forms and Instructions 43 TRAVEL PROCEDURES GUIDELINES 45 Required Documentation and Steps 46 Step 1 - Conference Approval 46 Step 2 - Conference Requisition Request 46 . 4 Step 3 - Conference Purchase Order/Payment 46 .

1y ago

103 Views

Introducing the New and Revised Data Points in HMDA

added two e numerations ( "cash -out r efinance" an d " other p urpose") t o Loan P urpose, an d s plit the "non-owner o ccupied" category o f Occupancy Type i nto " se cond r esidence" a nd " in vestment propert y." In ad dition, un der t he 20 15 H MDA R ule, ap plicants h ave t he o ption t o s elf -identify

1y ago

93 Views

Data Point: 2018 Mortgage Market Activity and Trends

The number of r efinance o riginations declined from 2.5 million in 2017 to 1.9 million in 2018. The number of reported home improvement loans declined from 549 ,000 in 2017 to 183,000 in 2018 , a drop that resulted primarily from a change in reporting requirements that excluded unsecured home improvement loans . 5

1y ago

96 Views

Ankeny Community Schools 306 Sw School St. Fixed Asset Inventory and .

reconciliation. ACSD is currently using the Fixed Assets Module of eFinance Plus software to track assets. Vendor will perform all labor to conduct a comprehensive inventory at ACSD site locations. During the inventory process, all of the following information will be captured for each item Asset Identification Information

1y ago

121 Views

Accelerating Tensor Contractions In High-Order FEM With MAGMA Batched

It looks like you're using an ad-blocker