Direct Sparse Linear Solvers, Preconditioners

3y ago
25 Views
2 Downloads
5.05 MB
71 Pages
Last View : 16d ago
Last Download : 3m ago
Upload by : Abram Andresen
Transcription

Argonne Training Program on Extreme-Scale ComputingDirect Sparse Linear Solvers, Preconditioners- SuperLU, STRUMPACK, with hands-on examplesATPESC 2020X. Sherry Li, Pieter GhyselsLawrence Berkeley National LaboratoryAugust 4, 2020exascaleproject.org

Tutorial ContentPart 1. Sparse direct solvers: SuperLU and STRUMPACK (30 min)– Sparse matrix representations– Algorithms Gaussian elimination, sparsity and graph, ordering, symbolic factorization– Different types of factorizations– Parallelism exploiting sparsity (trees, DAGs) Task scheduling, avoiding communication– Parallel performancePart 2. Rank-structured approximate factorizations: STRUMPACK (15 min)– Hierarchical matrices, Butterfly matrixPart 3. Hands-on examples in SuperLU or STRUMPACK (15 min)2

Strategies of solving sparse linear systems§ Iterative methods: (e.g., Krylov, multigrid, )§ A is not changed (read-only)§ Key kernel: sparse matrix-vector multiply Easier to optimize and parallelize§ Low algorithmic complexity, but may not converge§ Direct methods:§ A is modified (factorized) : A L*U Harder to optimize and parallelize§ Numerically robust, but higher algorithmic complexity§ Often use direct method to precondition iterative method§ Solve an easier system: M-1Ax M-1b3

Sparse matrix representation: Compressed Row Storage (CRS)§ Store nonzeros row by row contiguously§ Example: N 7, NNZ 19§ 3 arrays:§ Storage: NNZ reals, NNZ N 1 integers1nzval31 a 2 b58111317c d 3 e 4 f 5 g h i 6 j k l 7colind1 4 2 5 1 2 3 2 45 5 7 4 567 3 5 7rowptr1 3 5 8 11 13 17 2020æ1ç2ççc dçeçççççèMany other data structures: “Templates for the Solution of Linear Systems: Building Blocksfor Iterative Methods”, R. Barrett et al.4ab34hkf5il6ö g j 7 ø

Distributed input interface§ Matrices involved:§ A, B (turned into X) – input, users manipulate them§ L, U – output, users do not need to see them§ A (sparse) and B (dense) are distributed by block rowsP0xP1xP25xxxxxxAxxxxxLocal A stored in Compressed Row FormatB

Distributed input interface§ Each process has a structure to store local part of ADistributed Compressed Row Storagetypedef struct {int t nnz loc; // number of nonzeros in the local submatrixint t m loc; // number of rows local to this processorint t fst row; // global index of the first rowvoid *nzval; // pointer to array of nonzero values, packed by rowint t *colind; // pointer to array of column indices of the nonzerosint t *rowptr; // pointer to array of beginning of rows in nzval[]and colind[]} NRformat loc;6

Distributed Compressed Row StorageSuperLU DIST/FORTRAN/f 5x5.f90sl ulP0A is distributed on 2 processors:P1l§ Processor P0 data structure:§§§§§§7nnz loc 5m loc 2fst row 0 // 0-based indexingnzval { s, u, u, l, u }colind { 0, 2, 4, 0, 1 }rowptr { 0, 3, 5 }lupueur§ Processor P1 data structure:§§§§§§nnz loc 7m loc 3fst row 2 // 0-based indexingnzval { l, p, e, u, l, l, r }colind { 1, 2, 3, 4, 0, 1, 4 }rowptr { 0, 2, 4, 7 }

Algorithms: review of Gaussian Elimination (GE) First step of GE: 1 T T 0αwαwA v B v / α I 0 C C B Repeat GE on Cv wTa Result in LU factorization (A LU)– L lower triangular with unit diagonal, U upper triangular Then, x is obtained by solving two triangular systems with L and U, easier to solve88

Fill-in in sparse LUU123L45679

Direct solver solution phases1. Preprocessing: Reorder equations to minimize fill, maximize parallelism ( 10% time) Sparsity structure of L & U depends on A, which can be changed by row/column permutations(vertex re-labeling of the underlying graph) Ordering (combinatorial algorithms; “NP-complete” to find optimum [Yannakis ’83]; useheuristics)2. Preprocessing: predict the fill-in positions in L & U ( 10% time) Symbolic factorization (combinatorial algorithms)3. Preprocessing: Design efficient data structure for quick retrieval of the nonzeros Compressed storage schemes4. Perform factorization and triangular solutions ( 80% time) Numerical algorithms (F.P. operations only on nonzeros) Usually dominate the total runtimeFor sparse Cholesky and QR, the steps can be separate. For sparse LU with pivoting,steps 2 and 4 must be interleaved.10

Numerical pivoting for stabilityGoal of pivoting is to control element growth in L & U for stability– For sparse factorizations, often relax the pivoting rule to trade with better sparsity andparallelism (e.g., threshold pivoting, static pivoting , . . .)Partial pivoting used in dense LU, sequential SuperLU and SuperLU MT (GEPP)– Can force diagonal pivoting (controlled by diagonal threshold)– Hard to implement scalably for sparse factorizationRelaxed pivoting strategies:sxbxxx xStatic pivoting used in SuperLU DIST (GESP)Before factor, scale and permute A to maximize diagonal: Pr Dr A Dc A’During factor A’ LU, replace tiny pivots by e A , w/o changing data structures for L & UIf needed, use a few steps of iterative refinement after the first solutionquite stable in practiceRestricted pivotingx

Can we reduce fill? -- various ordering algorithmsReordering ( permutation of equations and variables)!######"1 2 3 4 5 &2 2&&33&44&55 &%!1#1# #1#1## 1"(all filled after elimination) !&#&#&#&#&#&#%"1 2 3 4 5 !5 1 ! 5&#&& #2 2144 &&#& #&#& #33133 &&#& #1442 2 &&#&& ##&&#&55 %" 1% " 5 4 3 2 1 %(no fill after elimination)

Ordering to preserve sparsity : Minimum DegreeLocal greedy strategy: minimize upper bound on fill-in1ijkléxêêêxêêêxêêêxêêêëxixjx kxlxùúúúúúúúúúúúúûEliminate 1lij1kl131ijkEliminate 1éxêêêxêêêxêêêxêêêëxixjx kx ijlklxùúú úúú úúú úúú úû

Ordering to preserve sparsity : Nested DissectionModel problem: discretized system Ax b from certain PDEs, e.g., 5-point stencilon k x k grid, N k2– Factorization flops: O( k3 ) O( N3/2 )Theorem: ND ordering gives optimal complexity in exact arithmetic [George ’73,Hoffman/Martin/Rose]GeometryReordered MatrixSeparator Tree25201041481418

ND OrderingGeneralized nested dissection [Lipton/Rose/Tarjan ’79]– Global graph partitioning: top-down, divide-and-conqure– Best for large problems– Parallel codes available: ParMetis, PT-Scotcho First leveléA 0 xùAo Recurse on A and BSBê0êêë xBxx úúS úûGoal: find the smallest possible separator S at each level– Multilevel schemes:– Chaco [Hendrickson/Leland 94], Metis [Karypis/Kumar 95]– Spectral bisection [Simon et al. 90- 95, Ghysels et al. 2019- ]– Geometric and spectral bisection [Chan/Gilbert/Teng 94]15

ND Ordering2D meshA, with ND ordering16A, with row-wise orderingL &U factors

Ordering for LU with non-symmetric patterns Can use a symmetric ordering on a symmetrized matrix Case of partial pivoting (serial SuperLU, SuperLU MT):– Use ordering based on AT*A Case of static pivoting (SuperLU DIST):– Use ordering based on AT A Can find better ordering based solely on A, without symmetrization– Diagonal Markowitz [Amestoy/Li/Ng 06] Similar to minimum degree, but without symmetrization– Hypergraph partition [Boman, Grigori, et al. 08] Similar to ND on ATA, but no need to compute ATA17

User-controllable options in SuperLU DISTFor stability and efficiency, need to factorize a transformed matrix:Pc ( Pr (Dr A Dc ) ) PcT“Options” fields with C enum constants: Equil:{NO, YES} RowPerm: {NOROWPERM, LargeDiag MC64, LargeDiag AWPM, MY PERMR} ColPerm: {NATURAL, MMD ATA, MMD AT PLUS A, COLAMD,METIS AT PLUS A, PARMETIS, ZOLTAN, MY PERMC}Call routine set default options dist(&options) to set default values.18

Algorithm variants, codes . depending on matrix propertiesMatrix propertiesSymmetricPos. Def.: Cholesky LL’indefinite: LDL’Supernodal(updates in-place)Multifrontal(partial updates floating around)symPACK (DAG)MUMPS (tree)Symmetric pattern, butnon-symmetric valueMUMPS (tree)STRUMPACK (binary tree)Non-symmetric everything SuperLU (DAG)UMFPACK (DAG) Remark: SuperLU, MUMPS, UMFPACK can use any sparsity-reducing ordering STRUMPACK can only use nested dissection (restricted to binary tree) Survey of sparse direct solvers (codes, algorithms, parallel e/superlu/SparseDirectSurvey.pdf19

1Sparse LU: two algorithm variantsTree basedMultifrontal: STRUMPACK, MUMPSUL678LUUL9S(j) ß (( S(j) - D(k1) ) - D(k2) ) - 203L depending on how updates are accumulatedDAG basedSupernodal: SuperLU134522S(j) ß S(j) - (.(D(k1) D(k2) ) )U4567

SupernodeExploit dense submatrices in the factors Can use Level 3 BLAS Reduce inefficient indirect addressing (scatter/gather) Reduce graph traversal time using a coarser graph21

Distributed L & U factored matrices (internal to SuperLU)§ 2D block cyclic layout – specified by user.§ Rule: process grid should be as square as possible.Or, set the row dimension (nprow) slightly smaller than the column dimension(npcol).§ For example: 2x3, 2x4, 4x4, 4x8, 30120120look ahead window22MPI Process Grid0

Distributed separator-tree-based parallelism (internal to STRUMPACK)§ Supernode separator frontal matrix§ Map sub-tree to sub-process grid§ Proportional to estimated work§ ScaLAPACK 2D block cyclic layout at each node§ Multi-threaded ScaLAPACK through system MTBLAS§ Allow idle processes for better communication§ e.g.: 2x3 process grid is better than 1x723

Comparison of LU time from 3 direct solvers§ Pure MPI on 8 nodes Intel Ivy Bridge, 192 cores (2x12 cores / node), NERSC Edison§ METIS ordering2.5strumpacksuperlu distmumpstime (tSTRUMPACK r23a9iliEmmosmat24

SuperLU DIST recent improvements GPUCommunication avoiding & hidingSpLUSpTRSV2D algorithm(baseline) GPU off-load (master)3x3D Comm-Avoiding27x @ 32,000 cores3.5x @ 4096 Titan nodes(Version-7)2D algorithm(baseline)GPU (gpu trisolve)8.5x @1 Summit GPU3D Comm-Avoiding7x @ 12,000 cores251-sided MPI (trisolve-fompi)2.4x @12,000 KNL cores

Tips for Debugging Performance§ Check sparsity ordering§ Diagonal pivoting is preferable§ E.g., matrix is diagonally dominant, . . .§ Need good BLAS library (vendor, OpenBLAS, ATLAS)§ May need adjust block size for each architecture( Parameters modifiable in routine sp ienv() ) Larger blocks better for uniprocessor Smaller blocks better for parallellism and load balanceGPTune: ML algorithms for selection of best parametershttps://github.com/gptune/GPTune/

Algorithm complexity (in bigO sense) Dense LU: O(N3) Model PDEs with regular mesh, nested dissection ordering3D problemsN k32D problemsN k2Factor flopsSolve flopsMemoryFactor flopsSolve flopsMemoryExact sparseLUN3/2N log(N)N log(N)N2N4/3N4/3STRUMPACKwith low-rankcompressionNNNNα polylog(N)(α 2)N log(N)N log(N)

Software summary§ SuperLU: conventional direct solver for general unsymmetric linear systems.(X.S. Li, J. Demmel, J. Gilbert, L. Grigori, Y. Liu, P. Sao

3 Strategiesof solving sparse linear systems §Iterative methods: (e.g., Krylov, multigrid, ) §A is not changed (read-only) §Key kernel: sparse matrix-vector multiply Easier to optimize and parallelize §Low algorithmic complexity, but may not converge §Direct methods: §A is modified (factorized) : A L*U Harder to optimize and parallelize §Numerically robust, but higher .

Related Documents:

chapter surveys the organization of CDCL solvers, from the original solvers that inspired modern CDCL SAT solvers, to the most recent and proven techniques. The organizationof CDCL SAT solvers is primarily inspired by DPLL solvers. As a result, and even though the chapter is self-contained, a reasonable knowledge of the organization of DPLL is .

methods will always outperform iterative methods for sparse systems due to convergence uncertainty of iterative methods. However, as the size of systems under consideration have in-creased, iterative solvers have become more competitive due to the poor scalability of direct methods. Even the best sparse di-rect solvers require roughly O n1.4

1.1 Distributed sparse linear systems The parallel solution of a linear systems of the form Ax b, (1.1) where A is an n n large sparse matrix, typically begins by subdividing the problem into p parts with the help of a graph partitioner [25, 14, 16, 24, 9, 17]. Generally, this consists of assigning sets of equations along with the

Solving sparse triangular systems of linear equations is a performance bottle-neck in many methods for solving more general sparse systems. Both for direct methods and for many iterative preconditioners, it is used to solve the system or improve an approximate solution, often across many iterations. Solving tri-

LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares CHRISTOPHER C. PAIGE McGill University, Canada and MICHAEL A. SAUNDERS Stanford University An iterative method is given for solving Ax ffi b and minU Ax - b 112, where the matrix A is large and sparse.

structure and operators of multigrid algorithms, but embeds into them powerful and algebraically sound combinatorial preconditioners, based on novel tools from support graph theory. In order to present the derivation of CMG, we review and exemplify key notions of support graph theory that

CUDA Toolkit Major Components www.nvidia.com NVIDIA CUDA Toolkit 10.0.153 RN-06722-001 _v10.0 2 ‣ cudadevrt (CUDA Device Runtime) ‣ cudart (CUDA Runtime) ‣ cufft (Fast Fourier Transform [FFT]) ‣ cupti (Profiling Tools Interface) ‣ curand (Random Number Generation) ‣ cusolver (Dense and Sparse Direct Linear Solvers and Eigen Solvers) ‣ cusparse (Sparse Matrix)

Advanced Higher Accounting Course code: C800 77 Course assessment code: X800 77 SCQF: level 7 (32 SCQF credit points) Valid from: session 2019–20 This document provides detailed information about the course and course assessment to ensure consistent and transparent assessment year on year. It describes the structure of the course and the course assessment in terms of the skills, knowledge .