Carnegie Mellon University

3y ago
28 Views
7 Downloads
4.30 MB
202 Pages
Last View : 30d ago
Last Download : 3m ago
Upload by : Axel Lin
Transcription

Carnegie Mellon UniversityCARNEGIE INSTITUTE OF TECHNOLOGYTHESISSUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE DEGREE OFTITLEDoctor of PhilosophyImproving the Dependability of Distributed Systemsthrough AIR Software UpgradesPRESENTED BYTudor A. Dumitra!ACCEPTED BY THE DEPARTMENT OFElectrical & Computer EngineeringADVISOR, MAJOR PROFESSORDEPARTMENT HEADDATEDATEAPPROVED BY THE COLLEGE COUNCILDEANDATE

Improving the Dependability of Distributed Systemsthrough AIR Software UpgradesSubmitted in partial fulfillment of the requirements forthe degree ofDoctor of PhilosophyinElectrical & Computer EngineeringTudor A. Dumitra!B.S., Computer Science, “Politehnica” University, Bucharest, RomaniaDiplôme d’Ingénieur, Ecole Polytechnique, Paris, FranceM.S., Electrical & Computer Engineering, Carnegie Mellon UniversityCarnegie Mellon UniversityPittsburgh, PADecember, 2010

To my parents and my teachers, who showed me the way. To my friends,who gave me a place to stand on. Pentru Tanti Lola.

AbstractTraditional fault-tolerance mechanisms concentrate almost entirely on responding to,avoiding, or tolerating unexpected faults or security violations. However, scheduled events,such as software upgrades, account for most of the system unavailability and often introducedata corruption or latent errors. Through two empirical studies, this dissertation identifiesthe leading causes of upgrade failure breaking hidden dependencies and of planned downtime complex data conversions in distributed enterprise systems. These findings represent the foundation of a new benchmark for software-upgrade dependability.This dissertation further introduces the AIR properties Atomicity, Isolation andRuntime-testing required for improving the dependability of distributed systems thatundergo major software upgrades. The AIR properties are realized in Imago, a system designed to reduce both planned and unplanned downtime by upgrading distributed systemsend-to-end. Imago builds upon the idea of isolating the production system from the upgrade operations, in order to avoid breaking hidden dependencies and to decouple the dataconversions from the normal system operation. Imago includes novel mechanisms, suchas providing a parallel universe for the new version, performing data conversions opportunistically, intercepting the live workload at the ingress and egress points or executing anatomic switchover to the new version, which allow it to deliver the AIR properties.Imago harnesses opportunities provided by the emerging cloud-computing technologies, by trading resource overhead (needed by the parallel universe) for an improved dependability of the software upgrades. This approach separates the functional aspects of theupgrade from the mechanisms for online upgrade, enabling an upgrade-as-a-service model.This dissertation also describes techniques for assessing the impact of software upgrades,in order to reason about the implications of relaxing the AIR guarantees.iv

This is for the one that I ate, 32 years ago.AcknowledgmentsSince I started to learn about Computer Science, I have been fascinated by our ability towrite computer programs that can protect themselves from various adverse conditions intheir environments. I equally enjoy building dependable systems firsthand and studyingempirically the behavior of systems large and small. This is the result of my interactionswith a few great mentors, who helped me discover the secrets of Computer Science andwho showed me the ways of scientific research.At age 13, I received a great gift from my uncle, Florin Covaciu. It was an HC-85 computer, a Romanian replica of the Sinclair ZX Spectrum (one of the world’s first personalcomputers). I learned to program on this machine, and, since then, I was not able to stayaway from programming for too long. This path took me to the Computer-Science HighSchool in Bucharest (Liceul de Informatică, now Colegiul Nat, ional Tudor Vianu), where teachers Mihai Budiu and Raluca Vasilescu opened my eyes to many computing concepts and totheir power to transform human society. These teachers shared their great knowledge andpassion for computing, and this inspired me to continue my studies in this field. I was nottoo surprised when I met Mihai and Raluca again, years later, in the graduate program atCarnegie Mellon University.Among my professors at the “Politehnica” University in Bucharest were Mircea Petrescu, who had supervised my father’s Honors thesis (diploma de licent, ă) twenty-five yearspreviously, Adrian Petrescu and Francisc Iacob, the creators of the HC-85, and NicolaeT, ăpus, , who supervised my own Honors thesis on Argo, a search engine with a distributedWeb crawler. Most of all, I am grateful to Zoea Racovit, ă for encouraging me to continuemy graduate studies and to apply to the Ph.D. program at Carnegie Mellon University.My education at the Ecole Polytechnique in Paris taught me to be responsible, confident, determined and to keep my commitments. I have to thank Jean-Marc Steyaert forguiding me through those years. Atom, a final-year project supervised by Sam Toueg, fov

ACKNOWLEDGMENTSvicused on implementing a practical group-communication system based on unreliable failure detectors (an exclusively theoretical technique, at the time) and seeded my curiosityabout distributed systems.When I arrived at Carnegie Mellon, Radu Mărculescu taught me the paramount importance of originality in all academic endeavors and the value of anticipating future technology trends for systems-oriented research. Our first paper together, which identified theneed for system-level fault tolerance in the emerging networks-on-chip (NoC) and reevaluated fundamental trade-offs made by classical networking protocols, remains my mostcited publication. Phil Koopman taught me everything I know about dependability, andhe always gave me good advice throughout my graduate studies.Many people contributed to the approach described in this dissertation. The membersof my thesis committee, Greg Ganger, Bruce Maggs and Asit Dan, encouraged me to pursuethe topic of online software upgrades and provided invaluable feedback in all the stages ofthis research. Dan Siewiorek taught me how to create a rigorous taxonomy. Jiaqi Tan andZhengheng Gho helped me design some of the core algorithms of Imago. At my request,Lorenzo Keller wrote a user manual for ConfErr, his fault-injection tool for mutating configuration files. A dinner-time discussion with Eli Tilevich turned into a publication aboutthe risks of software upgrades across multiple administrative domains. Douglas Schmidtshowed me how to present my research in an effective way. Jean-Charles Fabre has alwaysbeen a great mentor and a true friend. Je n’oublierai jamais que je dois mon premier emploi à tonsoutien sans réserve, Jean-Charles.Daniela Ros, u, Bich Le and Alan Downing my internship supervisors at IBM Research,VMware and Oracle, respectively provided me with a wealth of information about thepractical challenges of performing software upgrades. My approach also incorporates extensive feedback from the industry members of the Parallel Data Lab consortium. Duringmy dissertation research, I received financial support from the NSF CAREER Award CCR0238381, the DARPA PCES contract F33615-03-C-4110, as well as Carnegie Mellon’s CyLaband Parallel Data Lab.While not directly involved in this research, my collaborators Danny Dig and IulianNeamtiu helped me establish the series of workshops on Hot Topics in Software Upgrades(HotSWUp). The first two editions of the workshop provided a venue for insightful discussions about how upgrades are performed at various levels in the front-end of cloud com-

ACKNOWLEDGMENTSviiputing infrastructures, in EJB-based enterprise applications, in databases, in long-runningservers, in middleware frameworks or in satellites orbiting the Earth. Most importantly,HotSWUp emphasized that the challenge of upgrading distributed systems end-to-endcalls for an inter-disciplinary approach, combining ideas and techniques from several areasof Computer Science.Above all, I would like to thank my dissertation advisor, Prof. Priya Narasimhan, for allher guidance. She taught me the secrets of fault-tolerant middleware, and she showed mehow to enhance legacy applications with replication and recovery mechanisms by transparently intercepting the application’s system calls. She shared with me her great gift for writing and presenting technical material, she taught me the scientific method and she showedme how to ask the questions that matter. She passed down the teachings of Gottfried Wilhelm Leibniz, our academic ancestor, about how to turn an abundance of experimentaldata into rigorous scientific findings. She also passed down her aversion of adjectives andhyperbolae. She transformed a student into a researcher.Outside of Computer Science, my good friends too many to list here have always beena source of inspiration and vitality. My aunt, S, tefania Dumitras, , taught me about the practical things in life, and she also taught me French. My cousin, Ioana Căprar, has been closeas a sister. My parents, Dan Dumitras, and Monica Dumitras, , are my ultimate role models.Working at the Institute of Atomic Physics (where the first Romanian computer, CIFA, wasbuilt in 1955), their group of friends consisted of many other researchers, recognized internationally. I grew up among idealistic and passionate people, who were pushing the limitsof science and engineering, and this has influenced who I am today. I am grateful for all theintellectual gifts you gave me. I also thank Corina, with all my heart, for her support andencouragement during the final stages of my dissertation writing. Es, ti un pes, tis, or de aur.This dissertation is the fruit of your labor, as much as mine’s. For my part, this experience helped me to grow up, professionally, and to understand the difference between goodresearch and great research. I continue to marvel at the beauty of Computer Science and atall the things that are out there, for us to discover.

Contents12345Introduction11.1The dependability of software upgrades . . . . . . . . . . . . . . . . . . . . .31.2The next step forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91.3AIR software upgrades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91.4Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12Related Work172.1Causes of upgrade-induced downtime . . . . . . . . . . . . . . . . . . . . . .172.2Properties of software upgrades . . . . . . . . . . . . . . . . . . . . . . . . . .192.3Approaches for dependable upgrade . . . . . . . . . . . . . . . . . . . . . . .202.4Dependability benchmarking for software upgrades . . . . . . . . . . . . . .292.5Impact assessment for online upgrades . . . . . . . . . . . . . . . . . . . . .30Why Do Software Upgrades Fail?323.1Classification method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .353.2Upgrade-centric fault model . . . . . . . . . . . . . . . . . . . . . . . . . . . .403.3Tolerating upgrade faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .473.4Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49Why Do Upgrades Need Planned Downtime?504.1Experimental method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .524.2Leading causes of planned downtime . . . . . . . . . . . . . . . . . . . . . .574.3Existing techniques for avoiding planned downtime . . . . . . . . . . . . . .614.4Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63The AIR Properties65viii

CONTENTSix6Design and Implementation of Imago676.1AIR upgrades with Imago . . . . . . . . . . . . . . . . . . . . . . . . . . . . .726.2Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .776.3Upgrade-as-a-service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .836.4Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85789Dependability Benchmarking for Software Upgrades887.1A benchmark for upgrade dependability . . . . . . . . . . . . . . . . . . . . .927.2Availability and overhead without faults . . . . . . . . . . . . . . . . . . . . .987.3Availability under upgrade-faults . . . . . . . . . . . . . . . . . . . . . . . . . 1007.4Upgrade reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037.5Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105Relaxing the Isolation Property1088.1Isolation level provided by SOA . . . . . . . . . . . . . . . . . . . . . . . . . 1128.2Distributed framework for upgrade-impact assessment . . . . . . . . . . . . 1148.3Design and implementation of Ecotopia . . . . . . . . . . . . . . . . . . . . . 1178.4Case study: Software upgrades a service-oriented enterprise system . . . . 1218.5Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125Relaxing the Atomicity Property1279.1Mixed-version races . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319.2Upgrade risk model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1349.3Qualitative validation of the analytical risk model . . . . . . . . . . . . . . . 1429.4Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14610 Conclusion14810.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14810.2 Open questions and future work . . . . . . . . . . . . . . . . . . . . . . . . . 150AppendicesA NP-Completeness of the Package-Upgrade Problem156B List of Upgrade Faults160

CONTENTSxC Upgrade Risk Model: Implementation163Bibliography168Index186

List of Figures1.1Example of dependencies in a single-host system . . . . . . . . . . . . . . . .61.2Conceptual overview of AIR upgrades . . . . . . . . . . . . . . . . . . . . . .113.1Four ways of violating an upgrade procedure . . . . . . . . . . . . . . . . . .373.2Statistical cluster analysis of upgrade faults . . . . . . . . . . . . . . . . . . .433.3Upgrade-centric fault model . . . . . . . . . . . . . . . . . . . . . . . . . . . .443.4Impact of upgrade faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .464.1Wikipedia architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .544.2Example of schema change that requires planned downtime . . . . . . . . .554.3Implementation of schema changes during offline and online upgrades . . .564.4Major schema reorganization at Wikipedia . . . . . . . . . . . . . . . . . . .574.5Planned downtime imposed by MediaWiki upgrades . . . . . . . . . . . . .606.1Dependable software upgrades with Imago . . . . . . . . . . . . . . . . . . .686.2Imago’s upgrade procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . .726.3Imago’s upgrade procedure (details) . . . . . . . . . . . . . . . . . . . . . . .736.4Performing database-schema changes with Imago . . . . . . . . . . . . . . .756.5The atomic switchover protocol . . . . . . . . . . . . . . . . . . . . . . . . . .766.6Implementation of the egress interceptor . . . . . . . . . . . . . . . . . . . .796.7Implementation of the ingress interceptor . . . . . . . . . . . . . . . . . . . .806.8Communication protocol used during the testing phase . . . . . . . . . . . .826.9Inputs required by the upgrade mechanism . . . . . . . . . . . . . . . . . . .847.1Current approaches for online upgrade in distributed enterprise systems . .957.2Faults and failures during software upgrades . . . . . . . . . . . . . . . . . .97xi

LIST OF FIGURESxii7.3Planned downtime imposed by Imago . . . . . . . . . . . . . . . . . . . . . .997.4Breakdown of Imago’s overhead . . . . . . . . . . . . . . . . . . . . . . . . . 1007.5Runtime overhead imposed by online-upgrade mechanisms . . . . . . . . . 1017.6Impact of upgrade faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1038.1Planning software upgrades and other system changes . . . . . . . . . . . . 1108.2Distributed framework for upgrade-impact assessment . . . . . . . . . . . . 1148.3Representation of a key performance indicator . . . . . . . . . . . . . . . . . 1178.4Scheduling algorithm in Ecotopia . . . . . . . . . . . . . . . . . . . . . . . . . 1208.5The scheduling loop of Ecotopia . . . . . . . . . . . . . . . . . . . . . . . . . 1218.6Sample system managed by Ecotopia . . . . . . . . . . . . . . . . . . . . . . . 1228.7Database upgrade scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238.8Comparison of the Ecotopia scheduling algorithms . . . . . . . . . . . . . . 1249.1Anatomy of a mixed-version race . . . . . . . . . . . . . . . . . . . . . . . . . 1329.2Analytical risk model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1389.3Events leading to a mixed-version inconsistency . . . . . . . . . . . . . . . . 1399.4Discrete risk values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

List of Tables1.1Comparison of several studies of distributed-system availability . . . . . . .23.1Classification features for upgrade faults . . . . . . . . . . . . . . . . . . . . .393.2Examples of hidden dependencies . . . . . . . . . . . . . . . . . . . . . . . .424.1Database schema changes in Wikipedia . . . . . . . . . . . . . . . . . . . . .596.1Structure of Imago’s code . . . . . . . . . . . . . . . . . . . . . . . . . . . . .837.1Description of the upgrade faults injected. . . . . . . . . . . . . . . . . . . . . 1027.2Trade-offs for implementing online upgrades . . . . . . . . . . . . . . . . . . 1068.1“What-if” API for distributed impact assessment . . . . . . . . . . . . . . . . 1169.1Notations from the upgrade risk model . . . . . . . . . . . . . . . . . . . . . 1379.2To upgrade or not to upgrade? . . . . . . . . . . . . . . . . . . . . . . . . . . 144xiii

We hear desperate cries for a silver bullet something to makesoftware costs drop as rapidly as computer hardware costs do.F. Brooks, No silver bullet, 1987Chapter 1IntroductionMODERN distributed systems are perhaps the most intricate structures ever engineered, and their benefits for society are impaired by our inability to make end-to-end dependability guarantees. Software dependability remains challenging despite recentadvances in preventing and finding software bugs, or improvements in unit and integrationtesting. While these techniques reduce the complexity of an accidental task the need to express conceptual specifications in a programming language managing change in softwaresystems represents one of the essential obstacles for their dependability [Brooks, 1987].Even after their deployment in the field, successful distributed systems are expected tochange frequently, in order to add new features, to improve performance and scalability, toconform with government regulations or to reduce operating costs by switching softwarevendors. Unlike hardware or mechanical systems, computer programs can be modifiedwith relative ease. However, when deploying these changes in an actively-used system,through software upgrades, we must preserve the ecosystem of dependencies from theoperational environment. For this reason, the question How to perform software upgradesdependably? represents a grand challenge for distributed-systems research [Kaashoek et al.,2005].Traditional approaches for ensuring dependability [Avižienis et al., 2004] concentratealmost entirely on responding to, avoiding, or tolerating unexpected faults or security violations. However, intentional software changes, such as software upgrades, account for 66%–86% of the time when the service is not available, reportedly (see Table 1.1). The need forsuch planned downtime stems from the current limitations of upgrade mechanisms, whichare unable to upgrade distributed systems atomically, end-to-end. Furthermore, software1

passion for computing, and this inspired me to continue my studies in this field. I was not too surprised when I met Mihai and Raluca again, years later, in the graduate program at Carnegie Mellon University. Among my professors at the “Politehnica” University in Bucharest were Mircea Pe-

Related Documents:

CMMI Appraisal Program The Software Engineering Institute is a federally funded research and development center sponsored by the U.S. Department of Defense and operated by Carnegie Mellon University CMMI, Capability Maturity Model and Carnegie Mellon are re gistered in the U.S. Patent and Trademark Office by Carnegie Mellon University

CMMI Appraisal Program The Software Engineering Institute is a federally funded research and development center sponsored by the U.S. Department of Defense and operated by Carnegie Mellon University CMMI and Carnegie Mellon are registered in the U.S. Patent and Trademark Office by Carnegie Mellon University

Carnegie Mellon University Pittsburgh, PA, USA {stace, ihowley, dps, paulos}@cs.cmu.edu Robotics Institute2 Carnegie Mellon University Pittsburgh, PA, USA ltrutoiu@cs.cmu.edu Mechanical Engineering3 Carnegie Mellon University Pittsburgh, PA, USA ckute@andrew.cmu.edu ABSTRACT With over 13.3 million children living below poverty line in

Protecting Browsers from Cross-Origin CSS Attacks Lin-Shung Huang Carnegie Mellon University linshung.huang@sv.cmu.edu Zack Weinberg Carnegie Mellon University zack.weinberg@sv.cmu.edu Chris Evans Google cevans@google.com Collin Jackson Carnegie Mellon University collin.jackson@sv.cmu.edu ABSTR

Impact from Innovation: Carnegie Mellon University's Role as a Local and Global Economic Engine Final Report iv EXECUTIVE SUMMARY World-class institutions of higher education such as Carnegie Mellon University (CMU) are widely known for their academic and research contributions. They are also major contributors to their local economies.

1 Reverse Engineering Liam O' Brien April 2005 2005 by Carnegie Mellon University Sponsored by the U.S. Department of Defense 2005 by Carnegie Mellon University

ming.zeng@sv.cmu.edu Patrick Tague Carnegie Mellon University Moffett Field, CA, USA patrick.tague@sv.cmu.edu Joy Zhang Carnegie Mellon University Moffett Field, CA, USA joy.zhang@sv.cmu.edu Permission to make digital or hard copies of all or part of this work for personal or

Co-Director, Carnegie Mellon Institute for eCommerce (1998-2004 ) Vice-Chair, University Research Council (2000-2002) Director, eBusiness Technology degree program, Carnegie Mellon (2003-2018) Director, M.S. in Artifi