Presentation Title A Generous Amount Of Presented Space .

2y ago

8 Views

3 Downloads

1.15 MB

37 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Aliana Wahl

Report this link

Download PDF

Transcription

Chaos EngineeringBuilding confidence in your application and teamthrough failure experimentationChristopher MeiklejohnOctober 29, 20201

AdministriviaHomework 4B was due Nov 3rd, now the 4th.Homework 4C was due Nov 5th, now the 6th.2

Learning GoalsIdentify the need for chaos and resilience engineeringUnderstand the principles of chaos engineering3

Exercise: Monolithic ApplicationPostgreSQLML ModelWhat kind of failures can happenhere?How likely is that error to happen?How do I fix it?Mayan EDMSProcess CallContainerMicroservice4

Exercise: Microservice ApplicationMayan EDMSWhat kind of failures can happenhere?How likely is that error to happen?Remember, these calls are messagessent on an unreliable network.ContainerPostgreSQLML ModelRemote CallContainerContainerMicroservice5

Failures in Microservice ArchitecturesAll of these issuescan be indistinguishablefrom one another!1. Network may be partitioned2. Server instance may be down3. Communication between services mayMaking the calls across the network tomultiple machines makes thebeprobabilitydelayedthat the system is operatingunder failure much higher.4. Server could be overloaded and responses delayed5. Server could run out of memory or CPUThese are the problems oflatency and partial failure.6

Where Do We Start?How do we even begin to test these scenarios?Is there any software that can be used to test these typesof failures?Let’s look at a few ways companies do this.7

Game DaysPurposely injecting failures into critical systems in order to: Identify flaws and “latent defects” Identify subtle dependencies (which may or may not lead to a flaw/defect) Prepare a response for a disastrous eventComes from “resilience engineering” typical in high-risk industriesPracticed by Amazon, Google, Microsoft, Etsy, Facebook, Flickr, etc.8

Game DaysOur applications are built on and with “unreliable” componentsFailure is inevitable (fraction of percent; at Google scale, multiple times)Goals: Preemptively trigger the failure, observe, and fix the error Script testing of previous failures and ensure system remains resilient Build the necessary relationships between teams before disaster strikes9

Example: Amazon GameDayFull data center destruction (Amazon EC2 region) No advancednotice of which data center will be taken offlineNot all failures can be actuallyNonotice ofwhenthedata center will be taken offlineperformedandmust besimulated!Only advance notice (months) that a GameDay will be happeningReal failures in the production environmentDiscovered latent defect where the monitoring infrastructure responsible fordetecting errors and paging employees was located in the zone of thefailure!10

Yes, and the exercise should be designed to make people feel a littleuncomfortable. The truth is that things often break in ways thatpeople cannot possibly imagine.John AllspawFormer CTOEtsyI’ve got a crazy story Kelly SommersCassandra MVP11

Google GameDaysExperiments take roughly 24 – 96 hours: 00 - 24h: the initial response, the appearance of the ‘big’ problems. 24 – 48h: team to team testing and response, bi-directional testing 72 – 96h: exhaustion; part of the test to identify the human response in a real emergency, you might not have the option ofhanding off work at the end of your shift.Kripa KrishnanDirector, Cloud Ops & Site Reliability EngineeringGoogle12

AWS Outage: April 21, 2011EC2 outage in us-east-1 (Northern Virginia)Outage affects: Foursquare Quora RedditOutage results in performance problems and in some cases data loss13

AWS Outage: April 21, 2011At 12:47 AM PDT on April 21st, a network change was performed as part of our normal AWSscaling activities in a single Availability Zone in the US East Region. The configuration change was toupgrade the capacity of the primary network. During the change, one of the standard steps is to shifttraffic off of one of the redundant routers in the primary EBS network to allow the upgrade tohappen. The traffic shift was executed incorrectly and rather than routing the traffic to theother router on the primary network, the traffic was routed onto the lower capacityredundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, thismeant that they did not have a functioning primary or secondary network because traffic waspurposely shifted away from the primary network and the secondary network couldn’t handlethe traffic level it was receiving. As a result, many EBS nodes in the affected Availability Zone werecompletely isolated from other EBS nodes in its cluster. Unlike a normal network interruption, thischange disconnected both the primary and secondary network simultaneously, leaving the affectednodes completely isolated from one another.14

AWS Outage: April 21, 2011When this network connectivity issue occurred, a large number of EBS nodes in a single EBS clusterlost connection to their replicas. When the incorrect traffic shift was rolled back and networkconnectivity was restored, these nodes rapidly began searching the EBS cluster for availableserver space where they could re-mirror data. Once again, in a normally functioning cluster, thisoccurs in milliseconds. In this case, because the issue affected such a large number of volumesconcurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of thenodes “stuck” in a loop, continuously searching the cluster for free space. This quickly led to a “remirroring storm,” where a large number of volumes were effectively “stuck” while the nodessearched the cluster for the storage space it needed for its new replica. At this point, about 13%of the volumes in the affected Availability Zone were in this “stuck” state.15

Cornerstones of Resilence“[resilient is the] ability to sustain operations before,during, and after an unexpected disturbance”1. Anticipation: know what to expect2. Monitoring:know what to look for3. Response:know what to do4. Learning:know what just happened(e.g, postmortems)17

Anticipation“[ ] get people throughout the organization to startbuilding their anticipation muscles by thinking aboutwhat might possibly go wrong.”These experiments form a cycle where developersbegin to anticipate what might possibly go wrongduring development, which adds to the overallresilience of the system.18

Response: Etsy’s Substitution TestDeveloper runs command that brings down the site Grab another engineer who had no involvement in the incidentExplain the context of the problemFill developer in on the details known by the developer at the timeAsk what they would doDeveloper says they would run the same command almost every timeIdentify the reasons for why this seemed the right decision at the time19

Some Example Google IssuesTerminate network in Sao Paulo for testing: Hidden dependency takes down links in Mexico which would haveremained undiscovered without testingTurn of data center to find that machines won’t come back: Ran out of DHCP leases (for IP address allocation) when a large number ofmachines come back online unexpectedly.Complexities are introduced as new capabilities are developed. [ ]It gets progressively harder to see where our dependencies areand what might lead to cascading failures.20

Netflix: BackgroundStarted as a DVD-by-mail business because Reed Hastings wasannoyed with Blockbuster late feesProblem: when new movies come out, there’s only hundreds of DVDs toservice multiple thousands of demandStream movies instead of purchasing and mailing DVDs out to customersProblem: must purchase enough compute to handle peaks (7pm weekends) vs valleys (noon, weekday)21

Netflix: Cloud ComputingSignificant deployment in Amazon Web Services in order to remainelastic in times of high and low load (first public, 100% w/o content delivery.)Pushes code into production and modifies runtime configurationhundreds of times a dayKey metric: availabilitya customer who can’t watch a video because of a serviceoutrage might not be a customer for long.“Chaos Engineering”Basiri et al., IEEE Software 201622

Chaos Engineering: The HistoryExperimentation to build confidence around a system to withstandturbulent conditions in productionChaos Monkey has proven successful; today all Netflixengineers design their services to handle instance failuresas a matter of course.Netflix’s Simian Army (the original) Chaos Monkey:Randomly terminates EC2 instances in production Chaos Kong:Simulates the failure of an entire EC2 region in AWS Latency Monkey:Chaos Monkeycrashdevelopmentinstances, too!Injects latency to simulateHaveoverloadof serviceandensures upstreamservices react appropriately23

Netflix UI: AppBootWhat happens if the bookmarkservice is down?My ListBookmarksSearchUser ProfilesAppBootRatingsRecommendationsRemote CallMicroservice24

Principles of Chaos Engineering1. Build a hypothesis around steady state behavior2. Vary real-world eventsexperimental events, crashes, etc.3. Run experiments in productioncontrol group vs. experimental groupdraw conclusions, invalidate hypothesisDoes everything seem to beworking properly?Are users complaining?However, “works properly” is too vague a basis fordesigning experiments.4. Automate experiments to run continuously25

Graceful Degradation: Anticipating FailureAllow the system to degrade in a way it’s still usableFallbacks: Cache miss due to failure of cache; Go to the bookmarks service and use value at possible latency penaltyPersonalized content, use a reasonable default instead: What happens if recommendations are unavailable? What happens if bookmarks are unavailable? default to starting videos at the beginning rather thanproviding a “resume from previous location” option.26

Steady State BehaviorBack to quality attributes: availability!SPS is theprimary indicatorof the system’soverall health.Ultimately, what we care about is whether users can findcontent to watch and successfully watch it.27

Netflix UI: AppBootWhat happens if the bookmarkservice is down?My ListBookmarksSearchUser ProfilesAppBootRatingsRecommendationsRemote CallMicroservice28

AppBoot: Bookmarks Down Scenario (Imaginary)SPS as core metric.Experiment 1:Outage of bookmarks service causes UI to fail to load, SPS decreases. Codefixed to hide bookmarks if call fails.Experiment 2:Outage of bookmarks service hides booksmarks on UI, SPS stays normal.29

Exercise: Quality Attributes1. What would a quality attribute be for an e-commerce website tocharacterize the stead-state behavior of the system?2. What would a quality attribute be for an advertisement platform tocharacterize the stead-state behavior of the system?3. What would a quality attribute be for an admissions system tocharacterize the stead-state behavior of the system?30

Making HypothesesNo trivial hypotheses Overloading the system will increase the CPU, etc. Hypothesis should be made w.r.t overall system health metricMonitor finer-grained metrics Monitor the CPU, other resources Indicators of degraded mode operation, etc. Use alerting to identify these issues to catch them early and anticipate31

Varying Real-World Events1.2.3.4.5.6.7.8.Clients send malformed requestsServers may send malformed responsesServers dieHard disks fill upMemory is exhaustedCPU is overloadedLatencies spikeLoad from clients can spikeA recent study reported that 92% of catastrophic systemfailures resulted from incorrect handling of nonfatal errors.32

Sampling of Netflix’s Candidate Faults1. Terminate virtual machine instances2. Inject latency into requests between different services3. Fail requests between services4. Fail an entire service5. Make an entire Amazon region unavailable33

Two Example Netflix Errors1. Server is overloaded and takes longer and longer to respondClients requests are placed in a queue to be servicedLocal queue becomes exhausted, run out of memoryClient service crash2. Client makes a request to a server that uses a cacheError (transient) is returned to the clientServer caches the errorFuture clients read the cached error value34

Chaos Engineering as Continuous ProcessOur system at Netflix changes continuously.Because of these changes, our confidence in past experiments’results decreases over time.Chaos Monkey runs continuously during weekdays, and werun Chaos Kong exercises monthly. (2016)35

Netflix Today: CHaPAutomatic experimentationand failure injection withFITAutomatic instrumentationof key performance metricsAutomatic terminationbased on key metricsAutomatic experiment design with Monoclereference: https://www.youtube.com/watch?v 3WRVgC8SiGc36

How to run a Chaos Experiment1. Define steady-state as some measurable output of a system thatindicates normal behavior2. Hypothesize that this steady state will continue in both the control groupand experimental group3. Introduce variables that reflect real-world events such server crashes,hard drives malfunctioning, and network connections being severed4. Try to disprove the hypothesis by looking for a difference in steady statebetween the control group and the experimental group.37

Latency Monkey: Injects latency to simulate overload of service and ensures upstream services react appropriately 23 Chaos Monkey has proven successful; today all Netflix engineers design their services to handle instance failures as

Related Documents:

Guide to Completing the Loan Estimate - nmsigroup.com

Title - Lender's Title Policy 535 Title - Settlement Agent Fee 502 Title - Title Search 1,261 Title - Lender's Title Insurance 1,100 Delta Title Inc. Frank Fields 321 Avenue D Anytown, ST 12321 frankf@deltatitle.com 222-444-6666 Title - Other Title Services 1,000 Title - Settlement Agent Fee 350

48 Views

1y ago

Title, Title, Title Title, Title, Title Title,Title, Title

J18.9. ICD – 10 – CM Code Y95. nosocomial condition. J69.0. J69.1. J69.8. J18.0. J18.1. Not All Pneumonias are Created Alike Code Matters . to ED with coffee-ground emesis and inability to void. He was short of breath in the ED with increased respiratory effort, rhonc

194 Views

2y ago

Team Leader, Generous Giving and Stewardship

Team Leader, Generous Giving and Stewardship Generous Giving team With its network of parishes covering the country, the Church of England plays an active role in national life, bringing an important Christian dimension to the nation as well as strengthening community life. The Church of England is arranged geographically into 41

14 Views

3y ago

Adviser for Generous Giving and Stewardship

the new Team Leader and support the Generous Giving and Stewardship team in their work for the first few months. Job Description Purpose The post holder is responsible for the delivery, as part of the Generous Giving and Stewardship Team and under the leadership of the Team Leader, of the Sustainable Finance strategy, the aim of which is:

26 Views

3y ago

BSBCMM401 Make a presentation

1C Select appropriate presentation aids, materials and techniques 23 1D. Brief others involved in the presentation on their roles and responsibilities within the presentation 33 1E Select techniques to evaluate the effectiveness of the presentation 40 Summary47 Learning checkpoint 1: Prepare a presentation 48. Topic 2: Deliver a presentation 55

46 Views

2y ago

Day 1 Module 1: Creating a PowerPoint Presentation

Adding a Digital Signature After completing this module, students will be able to: Share your presentation with a remote audience. Embed fonts in a presentation. Inspect the presentation. Package your presentation for a CD. Use PowerPoint Viewer. Save your presentation for web viewing. Encrypt your presentation.

9 Views

1y ago

OpenOffice - svn.apache.org

Creating a new presentation AutoPilot Empty presentation creates a presentation from scratch. From template uses a template design already created as the base of a new presentation. Open existing presentation continues work on a previusly created presentation. OpenOffice.org After launching OpenOffice.org an AutoPilot Presentation window appears.

23 Views

7m ago

Guidelines for Preparation of Powerpoint Presentation of A Synopsis ...

in the presentation. Every slide must be contained title of dissertation/Synopsis ; Date of Presentation and Slide number in bottom. 4. Oral Presentation: Create notes in your presentation of the points you want to cover in your oral presentation of each slide. Except For things like the research questions,

10 Views

1y ago

Recent Views

The Family and Civil Law Needs of Aboriginal People

2 ABORIGINAL USE OF LEGAL AID CIVIL AND FAMILY LAW SERVICES 41 2.1 Legal Aid for Civil Law Matters 2.1.1 Applications for Civil Aid 2.1.2 Applications for Civil Aid by Gender 2.1.3 Successful Grants of Legal Aid for Civil Law Matters 2.1.4 Grants of Civil Aid by Gender 2.2 The Provision of Minor Assistance for Civil Law Matters

1y ago

133 Views

What is Civil Engineering? - Memphis

What is Civil Engineering? Civil Engineering: The Present The first self-proclaimed civil engineer was John Smeaton (1724 -1792). What is Civil Engineering? Civil Engineering: The Present In 1818 the Institution of Civil Engineers was founded in London and received a Royal Charter in 1828, formally recognizing civil engineering as a profession.File Size: 2MBPage Count: 17Explore furtherIntroduction to Civil DF] Civil Engineering Books Huge Collection (Subject g Books Recommended to you b

2y ago

209 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

The Civil Code of the Republic of Azerbaijan - ASK

7.3. Civil law may not have retroactive effect where it causes harm to subjects of the civil law or worsens their position. Article 8. Territorial Application of Civil Law 8.1. Civil law is effective throughout the territory of the Republic of Azerbaijan without exception. 8.2. Rights specified by civil law are freely exercised and obligatorily .

1y ago

121 Views

American Legion Post 210 - s3-us-west-2.amazonaws

Bockus, John Civil War 0-48 Knapp, Leonard Civil War 0-62 Bryson, Frank T. Civil War 0-6 Lampson, G. W. Civil War 0-25 Burkley, John I. Civil War 0-65A Martin, Jacob A. Civil War 0-49 Carr, Asa M. Civil War 0-39 Martin, Pembrooke Civil War 0-9A Carr, Julius Civil War 0-39 Mather, Jonathan War of 1812 0-78

1y ago

140 Views

Faculty of Juridical, Social and Political Sciences Year .

Law L Law IV 8 Drept procesual civil II / Civil Procedure Law II 5 Law L Law IV 8 Dreptul comerțului internațional / International ommercial Law 4 Law L Law IV 8 riminalistică / Forensics 4 Law L Law IV 8 Practică de cercetare pentru elaborarea lucrării de lincență(3 săptămân

2y ago

384 Views

Intermediate Law Law and You Worksheet 3: Australian law - Home Affairs

4. There are different kinds of law to deal with different kinds of problems. Four important kinds of law are civil law, criminal law, family law and administrative law. Civil law deals with disputes between individuals; for example, if someone sells you goods that are faulty, or that cause you injury or damage, you can take that person to court.

4m ago

110 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Civil Law's Influence on American Constitutionalism

6 Experts in Roman law and civil law may object to this very broad use of the phrase "civil law tradition." Strictly speaking, "civil law" (ius civile) refers to law governing the individual relations of members of a state or commonwealth (civitas). Dig.1.1.1; Dig. 1.1.9 (G. Inst. 1). But I hope that they will understand why I have

1y ago

122 Views

Direito Civil Brasileiro - Vol 1

DIREITO CIVIL 1. Conceito de direito civil 2. Histórico do direito civil 3. A codificação 4. O Código Civil brasileiro 4.1. O Código Civil de 1916 4.2. O Código Civil de 2002 4.2.1. Estrutura e conteúdo 4.2.2. Princípios básicos 4.2.3. Direito civil-constituci

2y ago

176 Views

Civil Code of Georgia Law of Georgia - International Labour Organization

Article 10 - Independence of civil rights from political rights; imperative norms of civil law 1. The exercise of civil rights shall not depend on political rights regulated by the Constitution or by other laws of public law. 2. Participants in a civil relationship may exercise any action not prohibited by law, including any action not .

1y ago

128 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

Common-Law Courts in a Civil-Law System: The Role of United Stat-es .

He learns the law, not by reading statutes that promulgate it or treatises that summarize it, but rather by studying the judicial opinions that invented it. This is the famous case-law method, 1 Oliver Wendell Holmes, Jr., The Common Law (1881). · : .·· ' COMMON-LAW COURTS IN A CIVIL-LAW SYSTEM pioneered by Harvard Law School in the last .

1y ago

197 Views

Presentation Title A Generous Amount Of Presented Space .

It looks like you're using an ad-blocker