Direct Sparse Linear Solvers, Preconditioners

3y ago

25 Views

2 Downloads

5.05 MB

71 Pages

Last View : 16d ago

Last Download : 3m ago

Upload by : Abram Andresen

Report this link

Download PDF

Transcription

Argonne Training Program on Extreme-Scale ComputingDirect Sparse Linear Solvers, Preconditioners- SuperLU, STRUMPACK, with hands-on examplesATPESC 2020X. Sherry Li, Pieter GhyselsLawrence Berkeley National LaboratoryAugust 4, 2020exascaleproject.org

Tutorial ContentPart 1. Sparse direct solvers: SuperLU and STRUMPACK (30 min)– Sparse matrix representations– Algorithms Gaussian elimination, sparsity and graph, ordering, symbolic factorization– Different types of factorizations– Parallelism exploiting sparsity (trees, DAGs) Task scheduling, avoiding communication– Parallel performancePart 2. Rank-structured approximate factorizations: STRUMPACK (15 min)– Hierarchical matrices, Butterfly matrixPart 3. Hands-on examples in SuperLU or STRUMPACK (15 min)2

Strategies of solving sparse linear systems§ Iterative methods: (e.g., Krylov, multigrid, )§ A is not changed (read-only)§ Key kernel: sparse matrix-vector multiply Easier to optimize and parallelize§ Low algorithmic complexity, but may not converge§ Direct methods:§ A is modified (factorized) : A L*U Harder to optimize and parallelize§ Numerically robust, but higher algorithmic complexity§ Often use direct method to precondition iterative method§ Solve an easier system: M-1Ax M-1b3

Sparse matrix representation: Compressed Row Storage (CRS)§ Store nonzeros row by row contiguously§ Example: N 7, NNZ 19§ 3 arrays:§ Storage: NNZ reals, NNZ N 1 integers1nzval31 a 2 b58111317c d 3 e 4 f 5 g h i 6 j k l 7colind1 4 2 5 1 2 3 2 45 5 7 4 567 3 5 7rowptr1 3 5 8 11 13 17 2020æ1ç2ççc dçeçççççèMany other data structures: “Templates for the Solution of Linear Systems: Building Blocksfor Iterative Methods”, R. Barrett et al.4ab34hkf5il6ö g j 7 ø

Distributed input interface§ Matrices involved:§ A, B (turned into X) – input, users manipulate them§ L, U – output, users do not need to see them§ A (sparse) and B (dense) are distributed by block rowsP0xP1xP25xxxxxxAxxxxxLocal A stored in Compressed Row FormatB

Distributed input interface§ Each process has a structure to store local part of ADistributed Compressed Row Storagetypedef struct {int t nnz loc; // number of nonzeros in the local submatrixint t m loc; // number of rows local to this processorint t fst row; // global index of the first rowvoid *nzval; // pointer to array of nonzero values, packed by rowint t *colind; // pointer to array of column indices of the nonzerosint t *rowptr; // pointer to array of beginning of rows in nzval[]and colind[]} NRformat loc;6

Distributed Compressed Row StorageSuperLU DIST/FORTRAN/f 5x5.f90sl ulP0A is distributed on 2 processors:P1l§ Processor P0 data structure:§§§§§§7nnz loc 5m loc 2fst row 0 // 0-based indexingnzval { s, u, u, l, u }colind { 0, 2, 4, 0, 1 }rowptr { 0, 3, 5 }lupueur§ Processor P1 data structure:§§§§§§nnz loc 7m loc 3fst row 2 // 0-based indexingnzval { l, p, e, u, l, l, r }colind { 1, 2, 3, 4, 0, 1, 4 }rowptr { 0, 2, 4, 7 }

Algorithms: review of Gaussian Elimination (GE) First step of GE: 1 T T 0αwαwA v B v / α I 0 C C B Repeat GE on Cv wTa Result in LU factorization (A LU)– L lower triangular with unit diagonal, U upper triangular Then, x is obtained by solving two triangular systems with L and U, easier to solve88

Fill-in in sparse LUU123L45679

Direct solver solution phases1. Preprocessing: Reorder equations to minimize fill, maximize parallelism ( 10% time) Sparsity structure of L & U depends on A, which can be changed by row/column permutations(vertex re-labeling of the underlying graph) Ordering (combinatorial algorithms; “NP-complete” to find optimum [Yannakis ’83]; useheuristics)2. Preprocessing: predict the fill-in positions in L & U ( 10% time) Symbolic factorization (combinatorial algorithms)3. Preprocessing: Design efficient data structure for quick retrieval of the nonzeros Compressed storage schemes4. Perform factorization and triangular solutions ( 80% time) Numerical algorithms (F.P. operations only on nonzeros) Usually dominate the total runtimeFor sparse Cholesky and QR, the steps can be separate. For sparse LU with pivoting,steps 2 and 4 must be interleaved.10

Numerical pivoting for stabilityGoal of pivoting is to control element growth in L & U for stability– For sparse factorizations, often relax the pivoting rule to trade with better sparsity andparallelism (e.g., threshold pivoting, static pivoting , . . .)Partial pivoting used in dense LU, sequential SuperLU and SuperLU MT (GEPP)– Can force diagonal pivoting (controlled by diagonal threshold)– Hard to implement scalably for sparse factorizationRelaxed pivoting strategies:sxbxxx xStatic pivoting used in SuperLU DIST (GESP)Before factor, scale and permute A to maximize diagonal: Pr Dr A Dc A’During factor A’ LU, replace tiny pivots by e A , w/o changing data structures for L & UIf needed, use a few steps of iterative refinement after the first solutionquite stable in practiceRestricted pivotingx

Can we reduce fill? -- various ordering algorithmsReordering ( permutation of equations and variables)!######"1 2 3 4 5 &2 2&&33&44&55 &%!1#1# #1#1## 1"(all filled after elimination) !&#&#&#&#&#&#%"1 2 3 4 5 !5 1 ! 5&#&& #2 2144 &&#& #&#& #33133 &&#& #1442 2 &&#&& ##&&#&55 %" 1% " 5 4 3 2 1 %(no fill after elimination)

Ordering to preserve sparsity : Minimum DegreeLocal greedy strategy: minimize upper bound on fill-in1ijkléxêêêxêêêxêêêxêêêëxixjx kxlxùúúúúúúúúúúúúûEliminate 1lij1kl131ijkEliminate 1éxêêêxêêêxêêêxêêêëxixjx kx ijlklxùúú úúú úúú úúú úû

Ordering to preserve sparsity : Nested DissectionModel problem: discretized system Ax b from certain PDEs, e.g., 5-point stencilon k x k grid, N k2– Factorization flops: O( k3 ) O( N3/2 )Theorem: ND ordering gives optimal complexity in exact arithmetic [George ’73,Hoffman/Martin/Rose]GeometryReordered MatrixSeparator Tree25201041481418

ND OrderingGeneralized nested dissection [Lipton/Rose/Tarjan ’79]– Global graph partitioning: top-down, divide-and-conqure– Best for large problems– Parallel codes available: ParMetis, PT-Scotcho First leveléA 0 xùAo Recurse on A and BSBê0êêë xBxx úúS úûGoal: find the smallest possible separator S at each level– Multilevel schemes:– Chaco [Hendrickson/Leland 94], Metis [Karypis/Kumar 95]– Spectral bisection [Simon et al. 90- 95, Ghysels et al. 2019- ]– Geometric and spectral bisection [Chan/Gilbert/Teng 94]15

ND Ordering2D meshA, with ND ordering16A, with row-wise orderingL &U factors

Ordering for LU with non-symmetric patterns Can use a symmetric ordering on a symmetrized matrix Case of partial pivoting (serial SuperLU, SuperLU MT):– Use ordering based on AT*A Case of static pivoting (SuperLU DIST):– Use ordering based on AT A Can find better ordering based solely on A, without symmetrization– Diagonal Markowitz [Amestoy/Li/Ng 06] Similar to minimum degree, but without symmetrization– Hypergraph partition [Boman, Grigori, et al. 08] Similar to ND on ATA, but no need to compute ATA17

User-controllable options in SuperLU DISTFor stability and efficiency, need to factorize a transformed matrix:Pc ( Pr (Dr A Dc ) ) PcT“Options” fields with C enum constants: Equil:{NO, YES} RowPerm: {NOROWPERM, LargeDiag MC64, LargeDiag AWPM, MY PERMR} ColPerm: {NATURAL, MMD ATA, MMD AT PLUS A, COLAMD,METIS AT PLUS A, PARMETIS, ZOLTAN, MY PERMC}Call routine set default options dist(&options) to set default values.18

Algorithm variants, codes . depending on matrix propertiesMatrix propertiesSymmetricPos. Def.: Cholesky LL’indefinite: LDL’Supernodal(updates in-place)Multifrontal(partial updates floating around)symPACK (DAG)MUMPS (tree)Symmetric pattern, butnon-symmetric valueMUMPS (tree)STRUMPACK (binary tree)Non-symmetric everything SuperLU (DAG)UMFPACK (DAG) Remark: SuperLU, MUMPS, UMFPACK can use any sparsity-reducing ordering STRUMPACK can only use nested dissection (restricted to binary tree) Survey of sparse direct solvers (codes, algorithms, parallel e/superlu/SparseDirectSurvey.pdf19

1Sparse LU: two algorithm variantsTree basedMultifrontal: STRUMPACK, MUMPSUL678LUUL9S(j) ß (( S(j) - D(k1) ) - D(k2) ) - 203L depending on how updates are accumulatedDAG basedSupernodal: SuperLU134522S(j) ß S(j) - (.(D(k1) D(k2) ) )U4567

SupernodeExploit dense submatrices in the factors Can use Level 3 BLAS Reduce inefficient indirect addressing (scatter/gather) Reduce graph traversal time using a coarser graph21

Distributed L & U factored matrices (internal to SuperLU)§ 2D block cyclic layout – specified by user.§ Rule: process grid should be as square as possible.Or, set the row dimension (nprow) slightly smaller than the column dimension(npcol).§ For example: 2x3, 2x4, 4x4, 4x8, 30120120look ahead window22MPI Process Grid0

Distributed separator-tree-based parallelism (internal to STRUMPACK)§ Supernode separator frontal matrix§ Map sub-tree to sub-process grid§ Proportional to estimated work§ ScaLAPACK 2D block cyclic layout at each node§ Multi-threaded ScaLAPACK through system MTBLAS§ Allow idle processes for better communication§ e.g.: 2x3 process grid is better than 1x723

Comparison of LU time from 3 direct solvers§ Pure MPI on 8 nodes Intel Ivy Bridge, 192 cores (2x12 cores / node), NERSC Edison§ METIS ordering2.5strumpacksuperlu distmumpstime (tSTRUMPACK r23a9iliEmmosmat24

SuperLU DIST recent improvements GPUCommunication avoiding & hidingSpLUSpTRSV2D algorithm(baseline) GPU off-load (master)3x3D Comm-Avoiding27x @ 32,000 cores3.5x @ 4096 Titan nodes(Version-7)2D algorithm(baseline)GPU (gpu trisolve)8.5x @1 Summit GPU3D Comm-Avoiding7x @ 12,000 cores251-sided MPI (trisolve-fompi)2.4x @12,000 KNL cores

Tips for Debugging Performance§ Check sparsity ordering§ Diagonal pivoting is preferable§ E.g., matrix is diagonally dominant, . . .§ Need good BLAS library (vendor, OpenBLAS, ATLAS)§ May need adjust block size for each architecture( Parameters modifiable in routine sp ienv() ) Larger blocks better for uniprocessor Smaller blocks better for parallellism and load balanceGPTune: ML algorithms for selection of best parametershttps://github.com/gptune/GPTune/

Algorithm complexity (in bigO sense) Dense LU: O(N3) Model PDEs with regular mesh, nested dissection ordering3D problemsN k32D problemsN k2Factor flopsSolve flopsMemoryFactor flopsSolve flopsMemoryExact sparseLUN3/2N log(N)N log(N)N2N4/3N4/3STRUMPACKwith low-rankcompressionNNNNα polylog(N)(α 2)N log(N)N log(N)

Software summary§ SuperLU: conventional direct solver for general unsymmetric linear systems.(X.S. Li, J. Demmel, J. Gilbert, L. Grigori, Y. Liu, P. Sao

3 Strategiesof solving sparse linear systems §Iterative methods: (e.g., Krylov, multigrid, ) §A is not changed (read-only) §Key kernel: sparse matrix-vector multiply Easier to optimize and parallelize §Low algorithmic complexity, but may not converge §Direct methods: §A is modified (factorized) : A L*U Harder to optimize and parallelize §Numerically robust, but higher .

Related Documents:

Conﬂict-Driven Clause Learning SAT Solvers

chapter surveys the organization of CDCL solvers, from the original solvers that inspired modern CDCL SAT solvers, to the most recent and proven techniques. The organizationof CDCL SAT solvers is primarily inspired by DPLL solvers. As a result, and even though the chapter is self-contained, a reasonable knowledge of the organization of DPLL is .

35 Views

3y ago

A Chain Method for Preconditioned Iterative Linear Solvers ...

methods will always outperform iterative methods for sparse systems due to convergence uncertainty of iterative methods. However, as the size of systems under consideration have in-creased, iterative solvers have become more competitive due to the poor scalability of direct methods. Even the best sparse di-rect solvers require roughly O n1.4

15 Views

2y ago

1 Schur complement preconditioners for distributed general sparse ...

1.1 Distributed sparse linear systems The parallel solution of a linear systems of the form Ax b, (1.1) where A is an n n large sparse matrix, typically begins by subdividing the problem into p parts with the help of a graph partitioner [25, 14, 16, 24, 9, 17]. Generally, this consists of assigning sets of equations along with the

7 Views

1y ago

Structure-Adaptive Parallel Solution of Sparse Triangular ...

Solving sparse triangular systems of linear equations is a performance bottle-neck in many methods for solving more general sparse systems. Both for direct methods and for many iterative preconditioners, it is used to solve the system or improve an approximate solution, often across many iterations. Solving tri-

35 Views

2y ago

LSQR: An Algorithm for Sparse Linear Equations and Sparse ...

LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares CHRISTOPHER C. PAIGE McGill University, Canada and MICHAEL A. SAUNDERS Stanford University An iterative method is given for solving Ax ffi b and minU Ax - b 112, where the matrix A is large and sparse.

11 Views

2y ago

Combinatorial Preconditioners and Multilevel Solvers for ...

structure and operators of multigrid algorithms, but embeds into them powerful and algebraically sound combinatorial preconditioners, based on novel tools from support graph theory. In order to present the derivation of CMG, we review and exemplify key notions of support graph theory that

6 Views

2y ago

NVIDIA CUDA Toolkit 10.0

CUDA Toolkit Major Components www.nvidia.com NVIDIA CUDA Toolkit 10.0.153 RN-06722-001 _v10.0 2 ‣ cudadevrt (CUDA Device Runtime) ‣ cudart (CUDA Runtime) ‣ cufft (Fast Fourier Transform [FFT]) ‣ cupti (Profiling Tools Interface) ‣ curand (Random Number Generation) ‣ cusolver (Dense and Sparse Direct Linear Solvers and Eigen Solvers) ‣ cusparse (Sparse Matrix)

12 Views

3m ago

Advanced Higher Accounting - SQA

Advanced Higher Accounting Course code: C800 77 Course assessment code: X800 77 SCQF: level 7 (32 SCQF credit points) Valid from: session 2019–20 This document provides detailed information about the course and course assessment to ensure consistent and transparent assessment year on year. It describes the structure of the course and the course assessment in terms of the skills, knowledge .

81 Views

3y ago

Recent Views

Steuern und Finanzen in Ausbildung und Praxis

Günter Maus Bilanzsteuerrecht und Buchführung Band 2 Steuern und Finanzen in Ausbildung und Praxis Mit E-Book-Zugang Übungsklausuren

3y ago

104 Views

Zahlen, Daten und Fakten im Überblick - DVB

Zahlen, Daten und Fakten im Überblick Stand: 31. Dezember 2018. Fahrbetrieb Schienenfahrzeuge Kraftfahrzeuge Infrastruktur Personal/Bildung Finanzen Einkauf und Materialwirtschaft Zentrale Verwaltungsdienste Vorstand Betrieb und Personal Vorstand Finanzen und Technik Betriebsleiter, Stabsbereiche*

3y ago

100 Views

Jahrbuch - DRK

Das Rote Kreuz in Deutschland Das Rote Kreuz im Ausland Das Rote Kreuz im Überblick Das Rote Kreuz und Finanzen DRK-JAHRBUCH 2017 Jahresabschluss Erläuterungen zu ausgewählten Positionen der Bilanz Bilanz des DRK-Bundesverbandes Gewinn- und Verlustrechnung des DRK-Bundesverbandes

3y ago

107 Views

Tipps und Arbeitsmaterialien für den Einsatz im Unterricht .

Arbeitsblätter A2 Weiter geht‘s! 1 Geld und Finanzen 2 Personen beschreiben 3 Aufgaben im Haushalt 4 Verkehrsmittel und Reisen 5 Über Gefühle sprechen 6 Interessen und Vorlieben 7 Soziale .

3y ago

93 Views

Verzeichnis der Abkürzungen - Springer

Sammlung der Entscheidungen und Gutachten des Bundesfinanzhofs "Betriebswirtschaftliche Forschung und Praxis" Bundesgesetzblatt Bundesgerichtshof Bilanzrichtliniengesetz Bundesministerium der Finanzen Bundesrats-Drucksache Bundessteuerblatt Bundestags-Drucksache Buchführung Buchstabe beziehungsweise . 234 d. d.h. DB DBW Diss. DM

3y ago

82 Views

Zeitlose Eleganz kostet Geld. - DEUTSCHLAND TEST

Die Besten im Alltag 6 Einkaufen, Essen und Trinken, Finanzen und Vorsorge, Mode und Bekleidung Interview 16 BHW-Vorstand Jörg Koschate über ökonomische und emotionale Werte Interview 26 Wie paydirekt-Chef Christian von Hammel- Bonten den Bezahldienst ausbauen will Die Besten im Beruf

2y ago

88 Views

SSL Validation Checking vs. Go(ing) to Fail

BNY Mellon Business Banking BNY Mellon Private Banking Börse Frankfurt Börse, Aktien, Aktienkurse - finanzen.net Brokerjet BTV Banking bwin Sports bwin.com Poker cfd Banking Services Commerzbank DenizBank AG – Österreich easybank E-Central mobile Banking E-POST KontoPilot - Banking App

2y ago

95 Views

Anweisungen für das Ministerium der Finanzen Formular W

W-8BEN-E eine mit dem Formular W-8BEN-E verbundene Zahlung als Zahlung an eine ausländische Person behandeln, die der wirtschaftliche Eigentümer der gezahlten Beträge ist. Der Abzugsverpflichtete kann, sofern zutreffend, mittels des Formulars W-8BEN-E einen ermäßigten A

2y ago

321 Views

All Europe Austria Newspapers - Naxtech

Format New Business Profil Trend Television ATV ORF Radio Hitradio Ö3 Online ariva.de finanzen.net Finanznachrichten.de Live-PR News news.at Newsroom.at oe24.at onvista.de pressetext.austria Vorarlberg Online Financial Community Bankhaus Krentschker & Co.AG Belgium Newspapers De Morgen

1y ago

83 Views

weiter aUf erfolg SetZen! - cdu-bad-endbach.de

Dominic Kunz Marie-Sophie Kunkel Achim Kriegsmann Michael Ferron Jens Jung Jurgen Henrich CDU Broschuere 022016-V6.qxp_Layout 1 19.02.16 10:33 Seite 5. Finanzen:Unsere Ansprüche und Ziele basieren auf einer soliden Haushaltsführung. Bürgermeister Markus Schäfer ist es mit der Ver-

1y ago

72 Views

Direct Sparse Linear Solvers, Preconditioners

It looks like you're using an ad-blocker