Fault Tolerant Programming Abstractions And Failure Recovery Models For .

1y ago

13 Views

2 Downloads

1.84 MB

29 Pages

Last View : 19d ago

Last Download : 3m ago

Upload by : Jayda Dunning

Report this link

Download PDF

Transcription

Fault Tolerant Programming Abstractions and Failure Recovery Models for MPI Applications Ignacio Laguna Center for Applied Scientific Computing Salishan Conference on High-speed Computing, Apr 27-30, 2015 LLNL-PRES-670002. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

MPI IS WIDELY USED, AND WILL CONTINUE TO BE We use MPI workloads to design future machines 75% 46,600 CORAL tier-1 benchmarks use MPI CORAL is the recent DOE procurement to deliver next-generation (petaflops) supercomputers MPI is widely cited Hits are returned by Google Scholar for the term “message passing interface” Many implementations are available C/C , Java, Matlab, Python, R, MPI X will remain a common programming model 2

MOST NODE/PROCESS FAILURES SHOW UP IN MPI MPI is the dominant “glue” for HPC applications MPI Node MPI Node MPI Node Node MPI MPI Process Node MPI Process MPI Process Examples: Application error (bug) Hardware error (soft error) 3 Process

MPI DOES NOT PROVIDE FAULT TOLERANCE Failures are not an option in MPI From the MPI standard: ! “. after an error is detected, the state of MPI is undefined” ! “MPI itself provides no mechanisms for handling processor failures.” MPI doesn’t provide guaranties about failure detection and/or notifications Resource manager kills the job (by default) 4

WHY TO INVEST IN FAULT TOLERANCE IN MPI? 1 MPI will continue to be used 2 3 Nice layer to detect failures No resilience abstractions in the standard Solution? 5

PUZZLE PIECES OF THE PROBLEM Roadmap of the talk 2 1 Problem Description Why adding FT to MPI is difficult? Challenges & areas of concern 4 Approaches Current solutions to the problem Proposals in the MPI forum 3 Lessons Learned Where do we go from here? Summary 6 Experimental Evaluation Modeling & simulation Early evaluation results

FIXING A FAILED MPI RANK TRANSPARENTLY IS HARD The devil is on the details Ideal fault-tolerance strategy: Replace transparently a failed process Some implementation questions / considerations: 1 How to bring a new MPI process up-to-date? 2 How to handle in-transit messages and operations? 3 Where to re-inject control in the application? This is difficult to implement correctly and efficiently in MPI libraries 7

MOST CODES ASSUME NO ERROR CHECKING Reasoning about error propagation in a complex code is hard Ideal world Real world for (.) err MPI Isend();, if (err) recover(); for (.) err MPI Irecv();, if (err) recover(); err MPI Waitall();, if (err) recover(); err MPI Barrier();, if (err) recover(); for (.) MPI Isend();, for (.) MPI Irecv();, MPI Waitall();, MPI Barrier();, MPI programs don’t check for errors Fault detection that rely on error codes would be hard to use Most codes will recover from failures via checkpoint/restart 8

OPEN CHALLENGES AND QUESTIONS What failures to consider in the MPI standard? ! Node / process failures? ! Communication errors? ! Silent errors? Should the application continue executing after a failure? How? ! Forward vs. backward recovery Fault-tolerant APIs that don’t involve much code changes Should fault tolerance be provided as a library? 9

PUZZLE PIECES OF THE PROBLEM Roadmap of the talk 2 1 Problem Description Why adding FT to MPI is difficult? Challenges & areas of concern 4 Approaches Current solutions to the problem Proposals in the MPI forum 3 Lessons Learned Where do we go from here? Summary 10 Experimental Evaluation Modeling & simulation Early evaluation results

POSSIBLE SOLUTIONS TO THE PROBLEM Resilient programming abstractions for MPI ULFM: User level failure mitigation Local shrinking recovery strategy Reinit interface Global non-shrinking recovery strategy Fault tolerant libraries e.g., Local Failure Local Recovery (LFLR) 1 2 3 4 ? 11

ULFM: USER LEVEL FAILURE MITIGATION Current proposal for MPI 4.0 Shrinking recovery strategy Shrinking recovery: the available resources after a failure are shrunk or reduced Focus on process failures ! Communication that involves a failed process would fail New error codes: MPI ERR PROC FAILED New MPI calls: MPI COMM REVOKE MPI COMM SHRINK MPI COMM AGREE MPI COMM FAILURE ACK Communicators can be revoked ! Enables fault propagation Communicators can be shrunk ! Code must create new communicators with fewer processes 12

PROS AND CONS OF ULFM Master-slave Works well for master-slave codes ! Only few processes need to know of a failure Difficult to use in bulk synchronous codes ! All processes need to know of failures (global recovery) ! Codes must rollback to a previous checkpoint Most codes cannot handle shrinking recovery Some may rollback Bulk synchronous ! Cannot re-decompose problem in fewer processes ! Requires load balancing Everyone must rollback 13

DELAYED DETECTION IS DIFFICULT TO USE FOR ALGORITHMS THAT USE NON-BLOCKING OPERATIONS Data exchange patter for (i 0; i nsends; i) { /* computation */ MPI Isend(.); } for (i 0; i nrecvs; i) { /* computation */ MPI Irecv(.); } MPI Waitall(.); /* computation */ MPI Barrier(.); Failure? Failure? Failure? Delayed detection? Where in the loop do we re-inject control? With ULFM, faults are “eventually” delivered to the application Global recovery avoids this issue—all processes roll back to a known safe state 14

REINIT INTERFACE Global non-shrinking recovery strategy MPI Init();, MPI Reinit(); MPI Error handlers(); MPI library performs: " Failure detection " Failure notification " Code specifies cleanup functions " Emulates exception handling for (.) MPI Isend();, for (.) MPI Irecv();, Stack of error handlers Error handler 1 MPI Waitall();, MPI Barrier();, Error handler 2 , MPI Finalize();, Error handler 3 Job is not killed Advantages Faster checkpoint/restart Difficult to clean up state of multithreaded code (OpenMP) Disadvantages Won’t work if application’s initialization takes too much time 15

FAULT TOLERANT LIBRARIES Approach: use ULFM’s functionality to provide fault tolerance as a library Example: Local Failure Local Recovery (LFLR) Rank 0 Run " Run Rank N Rank N 1 Wait Run Run Fault Run Stand by Join Reference: Keita Teranishi and Michael A. Heroux. Toward Local Failure Local Recovery Resilience Model using MPI-ULFM, EuroMPI/ASIA '14. Advantages Handles fault tolerance transparently Applications cannot use other tools / libraries Disadvantages Inherits any performance issues and/or bottlenecks from ULFM 16

PUZZLE PIECES OF THE PROBLEM Roadmap of the talk 2 1 Problem Description Why adding FT to MPI is difficult? Challenges & areas of concern 4 Approaches Current solutions to the problem Proposals in the MPI forum 3 Lessons Learned Where do we go from here? Summary 18 Experimental Evaluation Modeling & simulation Early evaluation results

TESTBED APPLICATION: ddcMD Scalable molecular dynamics application ! Not a proxy / mini / benchmark code Problem can be decomposed onto any number of processes Includes load balancing Uses a few communicators ! Simplifies implementing shrinking recovery ! We have to shrink only one communicator ( MPI COMM SHRINK) 19

ELIMINATING A PROCESS FROM A COMMUNICATOR TAKES TOO MUCH TIME Time to shrink MPI COMM WORLD when a process fails 12 Time (sec) 10 8 6 4 2 0 0 50 100 150 200 MPI processes 250 300 Open MPI 1.7, Sierra cluster at LLNL (InfiniBand) 20

SHRINKING RECOVERY IS ONLY USEFUL IN SOME CASES Most codes will use non-shrinking recovery at large scale Penalty factor 10 Non-shrinking recovery 1 Shrinking recovery 0.1 0" 10" 20" 30" 40" Mean time between failures (hours) Shrinking recovery only works when: ! Application can balance loads quickly after failures ! System experiences high failure rates ! Application can re-decompose problem on fewer processes/nodes Most codes/systems don’t have these capabilities 21

REINIT PERFORMANCE MEASUREMENTS ARE PROMISING Recovery time is reduced compared to traditional job restarts Time to recover from a failure using Reinit versus a standard job restart Tests on Cray XC30 system (BTL network) Applications: ! Lattice Bolzmann transport code (LBMv3) ! Molecular dynamics code (ddcMD) Time (sec) Prototype Reinit in Open MPI 45" 40" 35" 30" 25" 20" 15" 10" 5" 0" Job restart Using Reinit 64 128 MPI processes Insight With Reinit, we believe that data of recent checkpoints is likely cached in the filesystem buffers since the job is not killed 22 200

PUZZLE PIECES OF THE PROBLEM Roadmap of the talk 2 1 Problem Description Why adding FT to MPI is difficult? Challenges & areas of concern 4 Approaches Current solutions to the problem Proposals in the MPI forum 3 Lessons Learned Where do we go from here? Summary 23 Experimental Evaluation Modeling & simulation Early evaluation results

SOME LESSONS LEARNED # The MPI community should evaluate carefully the pros and cons of current fault-tolerant proposals # It is important to consider a broad range of applications # Pay special attention to legacy scalable codes (e.g., BSP) # Viewing the problem only from the system perspective doesn’t work # We must design interfaces after consulting with several users 24

FUTURE DIRECTIONS How do we solve this problem? and only then we propose modifications to the MPI standard 3 Evaluate not only performance, but also programmability 2 Test models on a broad range of applications 1 Evaluate multiple resilient programming abstractions (other than ULFM and Reinit) 25

ACKNOWLEDGMENTS Smart people that contribute to this effort Martin Schulz, LLNL David Richards, LLNL Bronis R. de Supinski, LLNL Kathryn Mohror, LLNL Todd Gamblin, LLNL Howard Pritchard, LANL Adam Moody, LLNL 26

Thank you! 27

ULFM IS SUITABLE ONLY FOR A SUBSET APPLICATIONS It is hard to use ULFM in bulk synchronous codes ULFM Suitable for ULFM (easy to implement with few changes in the application) APP Application can “naturally” support this model Applications Bulk synchronous Master-slave Shrinking Recovery Non-shrinking Recovery ULFM ULFM Global Recovery APP Backward Recovery APP Forward Recovery ULFM APP APP Local Recovery APP APP ULFM APP ULFM Reference: Ignacio Laguna, David F. Richards, Todd Gamblin, Martin Schulz, Bronis R. de Supinski, “Evaluating User-Level Fault Tolerance for MPI Applications”, EuroMPI/ASIA, Kyoto, Japan, Sep 9-12, 2014. 28

REINIT SUPPORTS BACKWARD RECOVERY In contrast, the focus of ULFM is forward recovery Time Backward recovery Attempts to restart the application from a previously saved state Forward recovery Failure Attempts to find a new state from which the application can continue. ULFM Fix communicators and continue Attempt to “fix” MPI state Reinit Interface Restart from a checkpoint Get “fresh” MPI state 29

Shrinking Recovery Local Recovery Backward Recovery Non-shrinking Recovery Global Recovery Forward Recovery Bulk synchronous Master-slave Applications Reference: Ignacio Laguna, David F. Richards, Todd Gamblin, Martin Schulz, Bronis R. de Supinski, "Evaluating User-Level Fault Tolerance for MPI Applications", EuroMPI/ASIA, Kyoto, Japan, Sep .

Related Documents:

VCP6-DCV Foundations Exam

Objective 5.2: Plan and implement VMware fault tolerance Identify VMware Fault Tolerance requirements Configure VMware Fault Tolerance networking Enable/Disable VMware Fault Tolerance on a virtual machine Test a Fault Tolerant configuration Determine use case for enabling VMware Fault Tolerance on a virtual machine

54 Views

3y ago

Abstraction Mechanisms in CLU

Massachusetts Institute of Technology CLU is a new programming language designed to support the use of abstractions in program construction. Work in programming methodology has led to the realization that three kinds of abstractions- procedural, control, and especially data abstractions-

35 Views

2y ago

Designing Fault Resilient and Fault Tolerant Systems with ...

Designing Fault Resilient and Fault Tolerant Systems with InfiniBand Dhabaleswar K. (DK) Panda The Ohio State University E-ma

21 Views

2y ago

CDS G3 and the PCM 09 - 2G - CDS G3 Fault Tables

CDS G3 Fault List (Numerical Order) Fault codes may be classified as sticky or not sticky: Type of fault Method to clear Not sticky Clears immediately after the fault is resolved Sticky Requires a key cycle (off and on) after the fault is resolved to clear. CDS G3 Fault Tables 90-8M0086113 SEPTEMBER 2013 Page 2G-3

60 Views

2y ago

Study and Analysis of Double-Line-To-Ground Fault - IJSEA

In this paper, the calculation of double line to ground fault on 66/33 kV transmission line is emphasized. Figure 3. Double-Line-to-Ground Fault 2.4 Three-Phase Fault In this case, falling tower, failure of equipment or . ground fault, fault current on phase a is I a 0. Fault voltages at phase b and c are: V b .

16 Views

3m ago

Fault Finding Guide ET/HP/MC Series Sincro Alternators

Capacitors 5 – 6 Fault Finding & Testing Diodes,Varistors, EMC capacitors & Recifiers 7 – 10 Fault Finding & Testing Rotors 11 – 12 Fault Finding & Testing Stators 13 – 14 Fault Finding & Testing DC Welders 15 – 20 Fault Finding & Testing 3 Phase Alternators 21 – 26 Fault Finding & Testing

47 Views

2y ago

Identification of Fault Locations in Underground Distribution System ...

illustrated in Figure 3. This is a fault occurring with phase A to ground fault at 8 km measured from the sending bus as depicted in Figure 1. The fault signals generated using ATP/EMTP are interfaced to the MATLAB for the fault detection algorithm. (a) Sending end (b) Receiving end Figure 3. Example of ATP/EMTP simulated fault signals for AG fault

25 Views

1y ago

Rad-Tolerant/Rad-Hard Integrated Circuits

Rad-Tolerant / Rad-Hard Integrated Circuits 3 Rad-Tolerant Concept Microchip’s Rad-Tolerant devices are based on com

35 Views

2y ago

Recent Views

Grammar as a Foreign Language - List of Proceedings

Grammar as a Foreign Language Oriol Vinyals Google vinyals@google.com Lukasz Kaiser Google lukaszkaiser@google.com Terry Koo Google terrykoo@google.com Slav Petrov Google slav@google.com Ilya Sutskever Google ilyasu@google.com Geoffrey Hinton Google geoffhinton@google.com Abstract Synta

2y ago

445 Views

Attention is All you Need - NIPS

Google Brain avaswani@google.com Noam Shazeer Google Brain noam@google.com Niki Parmar Google Research nikip@google.com Jakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.com Aidan N. Gomezy University of Toronto aidan@cs.toronto.edu Łukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhinz illia .

1y ago

303 Views

GSA Implementation of Google (G) Suite

Google Meet Classic Hangouts Google Chat Google Calendar Google Drive and Shared Drive Google Docs Google Sheets Google Slides Google Forms Google Sites Google Keep Apps Script D

2y ago

316 Views

Google Drive (Google Docs, Google Sheets, Google Slides)

Google Drive (Google Docs, Google Sheets, Google Slides) Employees are automatically issued a Kyrene Google account. Navigate to drive.google.com. Use Kyrene email address and network password to login. Launch in Chrome browser for best experience. Google Drive is a cloud storage sys

2y ago

388 Views

Quick Guide of Using Google Home to Control Smart Devices

Configuration needs Google Home app. Search "Google Home" in App Store or Google Play to install the app. 3.1 Set up Google Home with Google Home app You can skip this part if your Google Home is already set up. 1. Make sure your Google Home is energized. 2. Open the Google Home app by tapping the app icon on your mobile device. 3.

1y ago

326 Views

Elaboração de Provas Online usando o Formulário Google Docs

2 Após o login acesse o Google Drive ou o Google Docs e selecione a ferramenta Google Forms (Formulários). Clique na caixa de Ferramentas do Google, localizada no canto direito superior da tela e selecione o Google Drive. Na tela do Google Drive clique em New , opção More e selecione Google Forms. OBS: É possível acessar o google

11m ago

123 Views

ACS WASC Templates

File upload, Folder upload, Google Docs, Google Sheets, or Google Slides. You can also create Google Forms, Google Drawings, Google My Maps, etc. Share with exactly who you want — without email attachments. Search or sort your list of files, folders, and Google Docs. Preview files and Google Docs.

2y ago

366 Views

Share a Google Doc in Schoology - fcps.edu

After you have connected your Google Drive to Schoology (directions in a separate handout), another way to share a Doc with students is to use the Google Drive Resource App. To share a Google Doc using the Google Drive Resources App: 1. From the Add Materials drop down menu, select Import from Resources. 2. Select Apps. Then Google Drive .

1y ago

92 Views

Google Drive - San Bernardino City Unified School District

Google Apps All of the Google applications that are available upon logging into Google.com (G , Gmail, Gphotos, Gdrive, etc.). Google Suite Google’s online cloud based office companion applications (Docs, Sheets, Slides). Google Drive Google’s online cloud storage and file sharing/collaboration application.

2y ago

378 Views

Single Sign On for Google Apps with NetScaler Unified Gateway

Google Apps for Work is a suite of cloud computing productivity and collaboration applications provided by Google on a subscription basis. It includes Google’s popular web applications including Gmail, Google Drive, Google Hangouts, Google Calendar and Google

2y ago

295 Views

Serviceteil

Google 84, 87, 124 Google 110 Google AdWords 101, 103 Google Alerts 127 Google Analytics 89 Google Maps 100, 110, 173 Google-Maps 63 Google Places 100, 103, 124 Graphiken 66 H Haftung 170 Haftungsausschluss 72 Hausfarbe 11 Headline 35 Heilmittelwerbegesetz 14, 69, 163 Heilversprechen 164 HONcode 78 HTML 58 HWG 31 I Imagefilm 31

2y ago

336 Views

Best practices for managing identities when you move to Google Cloud

Google Cloud. To provide t he informat ion an organizat ion would ne e d to transfer data and ownership from one Google Account to anot her for s ome of t he noncore Google s er vice s, such as Google Ads, Google Analyt ics, or DV360. Intende d audience Organizat ion administrators. Sta planning Google Cloud / Google Wor kspace migrat ion. Key .

1y ago

481 Views

Google Analytics 101 - Content Jam

Google Analytics 101 201 301 Google Ads 101 201 Google Tag Manager 101 Google Data Studio 101 Google Optimize 101. Welcome Fun Facts: Share . Google Analytics 301 35 Web Property The web property ID is of the form UA-XXXXXX-YY. It's often called the "UA number" since it starts with

1y ago

107 Views

Introduction - Google Earth User Guide

Google Earth Community: Learn from other Google Earth users by asking questions and sharing answers on the Google Earth Community forums. Using Google Earth: This blog describes how you can use some of the interesting features of Google Earth. Selecting a Server Note: This section is relevant to Google Earth Pro and EC users.

3y ago

288 Views

Using Google Forms to Manage Officials Signups

Google Sheets, deleting a response from the form or sheet will not affect the other. Once the Google Form is linked to a Google Sheet, clicking on the spreadsheet icon will open the linked Google Sheet. Google Responses Sheet Google automatically creates and populates the sp

2y ago

276 Views

Fault Tolerant Programming Abstractions And Failure Recovery Models For .

It looks like you're using an ad-blocker