Designing For Understandability: The Raft Consensus Algorithm

2y ago
25 Views
2 Downloads
774.61 KB
29 Pages
Last View : 6d ago
Last Download : 3m ago
Upload by : Ronan Garica
Transcription

Designing for Understandability:the Raft Consensus AlgorithmDiego OngaroJohn OusterhoutStanford University

Algorithms Should Be Designed For ity!August 29, 2016The Raft Consensus AlgorithmSlide 2

Overview Consensus: Allows collection of machines to work as coherent group Continuous service, even if some machines fail Paxos has dominated discussion for 25 years Hard to understand Not complete enough for real implementations New consensus algorithm: Raft Primary design goal: understandability (intuition, ease of explanation) Complete foundation for implementation Different problem decomposition Results: User study shows Raft more understandable than Paxos Widespread adoptionAugust 29, 2016The Raft Consensus AlgorithmSlide 3

State Machine Responds to external stimuli Manages internal staterequest Examples: many storagesystems, services Clients resultMemcachedRAMCloudHDFS name node.August 29, 2016StateMachineThe Raft Consensus AlgorithmSlide 4

Replicated State MachineClientsz chineConsensusModuleStateMachineServersx 1Logy 3 x 4z xx 1Logy 3 x 4z xx 1Logy 3 x 4z x Replicated log ensures state machines execute same commands in same order Consensus module ensures proper log replication System makes progress as long as any majority of servers are up Failure model: delayed/lost messages, fail-stop (not Byzantine)August 29, 2016The Raft Consensus AlgorithmSlide 5

Paxos (Single Decree)ProposersAcceptorsChoose unique proposal #proposal # any previous?Majority? Select value forhighest proposal # returned;if none, choose own valueproposal # any previous?Majority? Value chosenAugust 29, 2016The Raft Consensus AlgorithmSlide 6

Paxos Problems Impenetrable: hard to develop intuitions Why does it work? What is the purpose of each phase? Incomplete Only agrees on single value Doesn’t address liveness Choosing proposal values? Cluster membership management? Inefficient Two rounds of messages to choose one value“The dirty little secret of the NSDIcommunity is that at most fivepeople really, truly understand everypart of Paxos :-)”— NSDI reviewer“There are significant gaps betweenthe description of the Paxosalgorithm and the needs of a realworld system . the final system willbe based on an unproven protocol”— Chubby authors No agreement on the detailsNot a good foundation for practical implementationsAugust 29, 2016The Raft Consensus AlgorithmSlide 7

Raft Challenge Is there a different consensus algorithm that’s easier tounderstand? Make design decisions based on understandability: Which approach is easier to explain? Techniques: Problem decomposition Minimize state space Handle multiple problems with a single mechanism Eliminate special cases Maximize coherence Minimize nondeterminismAugust 29, 2016The Raft Consensus AlgorithmSlide 8

Raft Decomposition1. Leader election: Select one server to act as leader Detect crashes, choose new leader2. Log replication (normal operation) Leader accepts commands from clients, appends to its log Leader replicates its log to other servers (overwrites inconsistencies)3. Safety Keep logs consistent Only servers with up-to-date logs can become leaderAugust 29, 2016The Raft Consensus AlgorithmSlide 9

Server States and RPCsstartFollowerdiscoverhighertermPassive (but expectsregular heartbeats)noheartbeatCandidateIssues RequestVote RPCsto get elected as leaderwinelectionLeaderAugust 29, 2016Issues AppendEntries RPCs: Replicate its log Heartbeats to maintain leadershipThe Raft Consensus AlgorithmSlide 10

TermsTerm 1Term 2Term 3Term 4Term 5timeElectionsNormalOperation At most 1 leader per termSplitVote Some terms have no leader (failed election) Each server maintains current term value (no global view) Exchanged in every RPC Peer has later term? Update term, revert to follower Incoming RPC has obsolete term? Reply with errorTerms identify obsolete informationAugust 29, 2016The Raft Consensus AlgorithmSlide 11

Leader ElectionBecome candidatetimeoutcurrentTerm ,vote for selfSend RequestVote RPCsto other serversvotes from majorityBecome leader,send heartbeatsAugust 29, 2016The Raft Consensus AlgorithmRPC from leaderBecomefollowerSlide 12

Election Correctness Safety: allow at most one winner per term Each server gives only one vote per term (persist on disk) Majority required to win electionB can’t alsoget majorityVoted forcandidate AServers Liveness: some candidate must eventually win Choose election timeouts randomly in [T, 2T] (e.g. 150-300 ms) One server usually times out and wins election before others time out Works well if T broadcast time Randomized approach simpler than rankingAugust 29, 2016The Raft Consensus AlgorithmSlide 13

Normal Operation Client sends command to leader Leader appends command to its log Leader sends AppendEntries RPCs to all followers Once new entry committed: Leader executes command in its state machine, returns result to client Leader notifies followers of committed entries in subsequent AppendEntries RPCs Followers execute committed commands in their state machines Crashed/slow followers? Leader retries AppendEntries RPCs until they succeed Optimal performance in common case: One successful RPC to any majority of serversAugust 29, 2016The Raft Consensus AlgorithmSlide 14

Log Structureterm123456789101112233333x 3 q 8 j 2 x q z 5 y 1 y 3 q j x 4 z 6commandlog indexleader for term 31112233x 3 q 8 j 2 x q z 5 y 1 y 31112233333x 3 q 8 j 2 x q z 5 y 1 y 3 q j x 4 z 611x 3 q 8followers11122333x 3 q 8 j 2 x q z 5 y 1 y 3 q jcommitted entries Must survive crashes (store on disk) Entry committed if safe to execute in state machines Replicated on majority of servers by leader of its termAugust 29, 2016The Raft Consensus AlgorithmSlide 15

Log InconsistenciesCrashes can result in log inconsistencies:123456789s111122333x 3 q 8 j 2 x q z 5 y 1 y 3 q js21112233x 3 q 8 j 2 x q z 5 y 1 y 3s3111223333x 3 q 8 j 2 x q z 5 y 1 y 3 q j x 4s411x 3 q 8s5111222222x 3 q 8 j 2 x q z 5 y 3 q j x 8 x 410log indexleader for term 4followersRaft minimizes special code for repairing inconsistencies: Leader assumes its log is correct Normal operation will repair all inconsistenciesAugust 29, 2016The Raft Consensus AlgorithmSlide 16

Log Matching PropertyGoal: high level of consistency between logs If log entries on different servers have same index and term: They store the same command The logs are identical in all preceding entries123456789101112233333x 3 q 8 j 2 x q z 5 y 1 y 3 q j x 4 z 611122344x 3 q 8 j 2 x q z 5 y 1 x z y 7 If a given entry is committed, all preceding entries are alsocommittedAugust 29, 2016The Raft Consensus AlgorithmSlide 17

AppendEntries Consistency Check AppendEntries RPCs include index, term of entry preceding new one(s) Follower must contain matching entry; otherwise it rejects request Leader retries with lower log index Implements an induction step, ensures Log Matching Property1leader:follower before:follower after:August 29, 201623412345123451123x 3 q 8 x q y 11123x 3 q 8 x q y 11123x 3 q 8 x q y 1112x 3 q 8 x q11111x 3 q 8 j 2 y 6 a x11111x 3 q 8 j 2 y 6 a x1123x 3 q 8 x q y 111111x 3 q 8 j 2 y 6 y 61123x 3 q 8 x q y 1Example #1: successExample #2: mismatchExample #3: successThe Raft Consensus AlgorithmSlide 18

Safety: Leader Completeness Once log entry committed, all futureleaders must store that entryLeader election for term 4: Servers with incomplete logs must notget elected: Candidates include index and term of lastlog entry in RequestVote RPCs Voting server denies vote if its log is moreup-to-date Logs ranked by lastTerm, lastIndex August 29, 2016The Raft Consensus Algorithm1 2 3 4 5 6 7 8 9s1 1 1 1 2 2 3 3 3s2 1 1 1 2 2 3 3s3 1 1 1 2 2 3 3 3 3s4 1 1 1 2 2 3 3 3s5 1 1 1 2 2 2 2 2 2Slide 19

Raft Evaluation Formal proof of safety Ongaro dissertation UW mechanically checked proof (50 klines) C implementation (2000 lines) 100’s of clusters deployed by Scale Computing Performance analysis of leader election Converges quickly even with 12-24 ms timeouts User study of understandabilityAugust 29, 2016The Raft Consensus AlgorithmSlide 20

User Study: Is Raft Simpler than Paxos? 43 students in 2 graduate OS classes (Berkeley and Stanford) Group 1: Raft video, Raft quiz, then Paxos video, Paxos quiz Group 2: Paxos video, Paxos quiz, then Raft video, Raft quiz Instructional videos: Same instructor (Ousterhout) Covered same functionality: consensus, replicated log, cluster reconfiguration Fleshed out missing pieces for Paxos Videos available on YouTube Quizzes: Questions in 3 general categories Same weightings for both tests Experiment favored Paxos slightly: 15 students had prior experience with PaxosAugust 29, 2016The Raft Consensus AlgorithmSlide 21

User Study ResultsAugust 29, 2016The Raft Consensus AlgorithmSlide 22

ImpactHard to publish:Widely adopted: Rejected 3 times at major 25 implementations before paperconferencespublished Finally published in USENIX ATC2014 Challenges: PCs uncomfortable withunderstandability as metric Hard to evaluate Complexity impresses PCsAugust 29, 2016 83 implementations currently listed onRaft home page 10 versions in production Taught in graduate OS classes MIT, Stanford, Washington, Harvard, Duke,Brown, Colorado, .The Raft Consensus AlgorithmSlide 23

Additional Information Other aspects of Raft (see paper or Ongaro dissertation): Communication with clients (linearizability) Cluster liveness Log truncation Other consensus algorithms: Viewstamped Replication (Oki & Liskov, MIT) ZooKeeper (Hunt, Konar, Junqueira, Read, Yahoo!)August 29, 2016The Raft Consensus AlgorithmSlide 24

Conclusions Understandability deserves more emphasis in algorithm design Decompose the problem Minimize state space Making a system simpler can have high impact Raft better than Paxos for teaching and implementation: Easier to understand More completeAugust 29, 2016The Raft Consensus AlgorithmSlide 25

Why “Raft”?ReplicatedAndFaultTolerantPaxosAugust 29, 2016The Raft Consensus AlgorithmSlide 26

Extra Slides

Raft Properties Election Safety: at most one leader can be elected in a given term Leader Append-Only: a leader never modifies or deletes entries in itslog Log Matching: if two logs contain an entry with the same index andterm, then the logs are identical in all entries up through the given index Leader Completeness: if a log entry is committed, then that entry willbe present in the logs of all future leaders State Machine Safety: if a server has applied a log entry at a givenindex to its state machine, no other server will ever apply a different logentry for the same indexAugust 29, 2016The Raft Consensus AlgorithmSlide 28

Leader Changes Logs may be inconsistentafter leader changelog indexleader forterm 8 No special steps by new Leader’s log is “the truth”1 1 1 4 4 5 5 6 6 6f1 1 1 1 4 4 5 5 6 6leader: Start normal operation Followers’ logs will eventuallymatch leader1 2 3 4 5 6 7 8 9 10 11 12f2 1 1 1 4possiblefollowersMissingEntriesf3 1 1 1 4 4 5 5 6 6 6 6f4 1 1 1 4 4 5 5 6 6 6 7 7f5 1 1 1 4 4 4 4ExtraneousEntriesf6 1 1 1 2 2 2 3 3 3 3August 29, 2016The Raft Consensus AlgorithmSlide 29

Log Consensus Module State Machine x 1 y 3 x 4 August 29, 2016 The Raft Consensus Algorithm Slide 5 Replicated State Machine Replicated log ensures state machines execute same commands in same order Consensus module ensures proper log replication System makes progress as long as any majority of servers are up Failure model: delayed/lost messages, fail-stop (not Byzantine)

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

analyses of un-piled and piled raft foundations with sandy soil. For the un-piled raft, the normalized settlement parameter (IR) for the raft sizes of 8mx8m and 15mx15m ranged as 1.03-1.17mm and 0.66- 0.83mm respectively. In the case of the piled raft with raft thickness of 0.25, 0.40, 0.80, 1.50, 3.0m, the corresponding maximum settlements

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

, respectively. Raft foundations fail when the resistance of the raft is less than the action caused by the applied load. The raft resistance is measured using the design moment strength M c while the raft action is measured by the external bending moments M e as shown in Figure 1. The raft limit state function is given by the following equation:

analysis and design. For foundations of such high rise building, normally raft foundation, pile foundation or piled-raft foundation are used. The raft will be used for economical consideration. The raft foundation is a kind of