Fault Tolerant Programming Abstractions And Failure Recovery Models For .

1y ago
13 Views
2 Downloads
1.84 MB
29 Pages
Last View : 19d ago
Last Download : 3m ago
Upload by : Jayda Dunning
Transcription

Fault Tolerant Programming Abstractions and Failure Recovery Models for MPI Applications Ignacio Laguna Center for Applied Scientific Computing Salishan Conference on High-speed Computing, Apr 27-30, 2015 LLNL-PRES-670002. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

MPI IS WIDELY USED, AND WILL CONTINUE TO BE We use MPI workloads to design future machines 75% 46,600 CORAL tier-1 benchmarks use MPI CORAL is the recent DOE procurement to deliver next-generation (petaflops) supercomputers MPI is widely cited Hits are returned by Google Scholar for the term “message passing interface” Many implementations are available C/C , Java, Matlab, Python, R, MPI X will remain a common programming model 2

MOST NODE/PROCESS FAILURES SHOW UP IN MPI MPI is the dominant “glue” for HPC applications MPI Node MPI Node MPI Node Node MPI MPI Process Node MPI Process MPI Process Examples: Application error (bug) Hardware error (soft error) 3 Process

MPI DOES NOT PROVIDE FAULT TOLERANCE Failures are not an option in MPI From the MPI standard: ! “. after an error is detected, the state of MPI is undefined” ! “MPI itself provides no mechanisms for handling processor failures.” MPI doesn’t provide guaranties about failure detection and/or notifications Resource manager kills the job (by default) 4

WHY TO INVEST IN FAULT TOLERANCE IN MPI? 1 MPI will continue to be used 2 3 Nice layer to detect failures No resilience abstractions in the standard Solution? 5

PUZZLE PIECES OF THE PROBLEM Roadmap of the talk 2 1 Problem Description Why adding FT to MPI is difficult? Challenges & areas of concern 4 Approaches Current solutions to the problem Proposals in the MPI forum 3 Lessons Learned Where do we go from here? Summary 6 Experimental Evaluation Modeling & simulation Early evaluation results

FIXING A FAILED MPI RANK TRANSPARENTLY IS HARD The devil is on the details Ideal fault-tolerance strategy: Replace transparently a failed process Some implementation questions / considerations: 1 How to bring a new MPI process up-to-date? 2 How to handle in-transit messages and operations? 3 Where to re-inject control in the application? This is difficult to implement correctly and efficiently in MPI libraries 7

MOST CODES ASSUME NO ERROR CHECKING Reasoning about error propagation in a complex code is hard Ideal world Real world for (.) err MPI Isend();, if (err) recover(); for (.) err MPI Irecv();, if (err) recover(); err MPI Waitall();, if (err) recover(); err MPI Barrier();, if (err) recover(); for (.) MPI Isend();, for (.) MPI Irecv();, MPI Waitall();, MPI Barrier();, MPI programs don’t check for errors Fault detection that rely on error codes would be hard to use Most codes will recover from failures via checkpoint/restart 8

OPEN CHALLENGES AND QUESTIONS What failures to consider in the MPI standard? ! Node / process failures? ! Communication errors? ! Silent errors? Should the application continue executing after a failure? How? ! Forward vs. backward recovery Fault-tolerant APIs that don’t involve much code changes Should fault tolerance be provided as a library? 9

PUZZLE PIECES OF THE PROBLEM Roadmap of the talk 2 1 Problem Description Why adding FT to MPI is difficult? Challenges & areas of concern 4 Approaches Current solutions to the problem Proposals in the MPI forum 3 Lessons Learned Where do we go from here? Summary 10 Experimental Evaluation Modeling & simulation Early evaluation results

POSSIBLE SOLUTIONS TO THE PROBLEM Resilient programming abstractions for MPI ULFM: User level failure mitigation Local shrinking recovery strategy Reinit interface Global non-shrinking recovery strategy Fault tolerant libraries e.g., Local Failure Local Recovery (LFLR) 1 2 3 4 ? 11

ULFM: USER LEVEL FAILURE MITIGATION Current proposal for MPI 4.0 Shrinking recovery strategy Shrinking recovery: the available resources after a failure are shrunk or reduced Focus on process failures ! Communication that involves a failed process would fail New error codes: MPI ERR PROC FAILED New MPI calls: MPI COMM REVOKE MPI COMM SHRINK MPI COMM AGREE MPI COMM FAILURE ACK Communicators can be revoked ! Enables fault propagation Communicators can be shrunk ! Code must create new communicators with fewer processes 12

PROS AND CONS OF ULFM Master-slave Works well for master-slave codes ! Only few processes need to know of a failure Difficult to use in bulk synchronous codes ! All processes need to know of failures (global recovery) ! Codes must rollback to a previous checkpoint Most codes cannot handle shrinking recovery Some may rollback Bulk synchronous ! Cannot re-decompose problem in fewer processes ! Requires load balancing Everyone must rollback 13

DELAYED DETECTION IS DIFFICULT TO USE FOR ALGORITHMS THAT USE NON-BLOCKING OPERATIONS Data exchange patter for (i 0; i nsends; i) { /* computation */ MPI Isend(.); } for (i 0; i nrecvs; i) { /* computation */ MPI Irecv(.); } MPI Waitall(.); /* computation */ MPI Barrier(.); Failure? Failure? Failure? Delayed detection? Where in the loop do we re-inject control? With ULFM, faults are “eventually” delivered to the application Global recovery avoids this issue—all processes roll back to a known safe state 14

REINIT INTERFACE Global non-shrinking recovery strategy MPI Init();, MPI Reinit(); MPI Error handlers(); MPI library performs: " Failure detection " Failure notification " Code specifies cleanup functions " Emulates exception handling for (.) MPI Isend();, for (.) MPI Irecv();, Stack of error handlers Error handler 1 MPI Waitall();, MPI Barrier();, Error handler 2 , MPI Finalize();, Error handler 3 Job is not killed Advantages Faster checkpoint/restart Difficult to clean up state of multithreaded code (OpenMP) Disadvantages Won’t work if application’s initialization takes too much time 15

FAULT TOLERANT LIBRARIES Approach: use ULFM’s functionality to provide fault tolerance as a library Example: Local Failure Local Recovery (LFLR) Rank 0 Run " Run Rank N Rank N 1 Wait Run Run Fault Run Stand by Join Reference: Keita Teranishi and Michael A. Heroux. Toward Local Failure Local Recovery Resilience Model using MPI-ULFM, EuroMPI/ASIA '14. Advantages Handles fault tolerance transparently Applications cannot use other tools / libraries Disadvantages Inherits any performance issues and/or bottlenecks from ULFM 16

POSSIBLE SOLUTIONS TO THE PROBLEM Resilient programming abstractions for MPI ULFM: User level failure mitigation Local shrinking recovery strategy Reinit interface Global non-shrinking recovery strategy Fault tolerant libraries e.g., Local Failure Local Recovery (LFLR) integrate fault tolerance into MPI ? Don’t Rely in Checkpoint/Restart 17 1 2 3 4

PUZZLE PIECES OF THE PROBLEM Roadmap of the talk 2 1 Problem Description Why adding FT to MPI is difficult? Challenges & areas of concern 4 Approaches Current solutions to the problem Proposals in the MPI forum 3 Lessons Learned Where do we go from here? Summary 18 Experimental Evaluation Modeling & simulation Early evaluation results

TESTBED APPLICATION: ddcMD Scalable molecular dynamics application ! Not a proxy / mini / benchmark code Problem can be decomposed onto any number of processes Includes load balancing Uses a few communicators ! Simplifies implementing shrinking recovery ! We have to shrink only one communicator ( MPI COMM SHRINK) 19

ELIMINATING A PROCESS FROM A COMMUNICATOR TAKES TOO MUCH TIME Time to shrink MPI COMM WORLD when a process fails 12 Time (sec) 10 8 6 4 2 0 0 50 100 150 200 MPI processes 250 300 Open MPI 1.7, Sierra cluster at LLNL (InfiniBand) 20

SHRINKING RECOVERY IS ONLY USEFUL IN SOME CASES Most codes will use non-shrinking recovery at large scale Penalty factor 10 Non-shrinking recovery 1 Shrinking recovery 0.1 0" 10" 20" 30" 40" Mean time between failures (hours) Shrinking recovery only works when: ! Application can balance loads quickly after failures ! System experiences high failure rates ! Application can re-decompose problem on fewer processes/nodes Most codes/systems don’t have these capabilities 21

REINIT PERFORMANCE MEASUREMENTS ARE PROMISING Recovery time is reduced compared to traditional job restarts Time to recover from a failure using Reinit versus a standard job restart Tests on Cray XC30 system (BTL network) Applications: ! Lattice Bolzmann transport code (LBMv3) ! Molecular dynamics code (ddcMD) Time (sec) Prototype Reinit in Open MPI 45" 40" 35" 30" 25" 20" 15" 10" 5" 0" Job restart Using Reinit 64 128 MPI processes Insight With Reinit, we believe that data of recent checkpoints is likely cached in the filesystem buffers since the job is not killed 22 200

PUZZLE PIECES OF THE PROBLEM Roadmap of the talk 2 1 Problem Description Why adding FT to MPI is difficult? Challenges & areas of concern 4 Approaches Current solutions to the problem Proposals in the MPI forum 3 Lessons Learned Where do we go from here? Summary 23 Experimental Evaluation Modeling & simulation Early evaluation results

SOME LESSONS LEARNED # The MPI community should evaluate carefully the pros and cons of current fault-tolerant proposals # It is important to consider a broad range of applications # Pay special attention to legacy scalable codes (e.g., BSP) # Viewing the problem only from the system perspective doesn’t work # We must design interfaces after consulting with several users 24

FUTURE DIRECTIONS How do we solve this problem? and only then we propose modifications to the MPI standard 3 Evaluate not only performance, but also programmability 2 Test models on a broad range of applications 1 Evaluate multiple resilient programming abstractions (other than ULFM and Reinit) 25

ACKNOWLEDGMENTS Smart people that contribute to this effort Martin Schulz, LLNL David Richards, LLNL Bronis R. de Supinski, LLNL Kathryn Mohror, LLNL Todd Gamblin, LLNL Howard Pritchard, LANL Adam Moody, LLNL 26

Thank you! 27

ULFM IS SUITABLE ONLY FOR A SUBSET APPLICATIONS It is hard to use ULFM in bulk synchronous codes ULFM Suitable for ULFM (easy to implement with few changes in the application) APP Application can “naturally” support this model Applications Bulk synchronous Master-slave Shrinking Recovery Non-shrinking Recovery ULFM ULFM Global Recovery APP Backward Recovery APP Forward Recovery ULFM APP APP Local Recovery APP APP ULFM APP ULFM Reference: Ignacio Laguna, David F. Richards, Todd Gamblin, Martin Schulz, Bronis R. de Supinski, “Evaluating User-Level Fault Tolerance for MPI Applications”, EuroMPI/ASIA, Kyoto, Japan, Sep 9-12, 2014. 28

REINIT SUPPORTS BACKWARD RECOVERY In contrast, the focus of ULFM is forward recovery Time Backward recovery Attempts to restart the application from a previously saved state Forward recovery Failure Attempts to find a new state from which the application can continue. ULFM Fix communicators and continue Attempt to “fix” MPI state Reinit Interface Restart from a checkpoint Get “fresh” MPI state 29

Shrinking Recovery Local Recovery Backward Recovery Non-shrinking Recovery Global Recovery Forward Recovery Bulk synchronous Master-slave Applications Reference: Ignacio Laguna, David F. Richards, Todd Gamblin, Martin Schulz, Bronis R. de Supinski, "Evaluating User-Level Fault Tolerance for MPI Applications", EuroMPI/ASIA, Kyoto, Japan, Sep .

Related Documents:

Objective 5.2: Plan and implement VMware fault tolerance Identify VMware Fault Tolerance requirements Configure VMware Fault Tolerance networking Enable/Disable VMware Fault Tolerance on a virtual machine Test a Fault Tolerant configuration Determine use case for enabling VMware Fault Tolerance on a virtual machine

Massachusetts Institute of Technology CLU is a new programming language designed to support the use of abstractions in program construction. Work in programming methodology has led to the realization that three kinds of abstractions- procedural, control, and especially data abstractions-

Designing Fault Resilient and Fault Tolerant Systems with InfiniBand Dhabaleswar K. (DK) Panda The Ohio State University E-ma

CDS G3 Fault List (Numerical Order) Fault codes may be classified as sticky or not sticky: Type of fault Method to clear Not sticky Clears immediately after the fault is resolved Sticky Requires a key cycle (off and on) after the fault is resolved to clear. CDS G3 Fault Tables 90-8M0086113 SEPTEMBER 2013 Page 2G-3

In this paper, the calculation of double line to ground fault on 66/33 kV transmission line is emphasized. Figure 3. Double-Line-to-Ground Fault 2.4 Three-Phase Fault In this case, falling tower, failure of equipment or . ground fault, fault current on phase a is I a 0. Fault voltages at phase b and c are: V b .

Capacitors 5 – 6 Fault Finding & Testing Diodes,Varistors, EMC capacitors & Recifiers 7 – 10 Fault Finding & Testing Rotors 11 – 12 Fault Finding & Testing Stators 13 – 14 Fault Finding & Testing DC Welders 15 – 20 Fault Finding & Testing 3 Phase Alternators 21 – 26 Fault Finding & Testing

illustrated in Figure 3. This is a fault occurring with phase A to ground fault at 8 km measured from the sending bus as depicted in Figure 1. The fault signals generated using ATP/EMTP are interfaced to the MATLAB for the fault detection algorithm. (a) Sending end (b) Receiving end Figure 3. Example of ATP/EMTP simulated fault signals for AG fault

Rad-Tolerant / Rad-Hard Integrated Circuits 3 Rad-Tolerant Concept Microchip’s Rad-Tolerant devices are based on com