The Datacenter as a ComputerAn Introduction to the Design ofWarehouse-Scale Machines
iiiSynthesis Lectures on ComputerArchitectureEditorMark D. Hill, University of Wisconsin, MadisonSynthesis Lectures on Computer Architecture publishes 50 to 150 page publications ontopics pertaining to the science and art of designing, analyzing, selecting and interconnectinghardware components to create computers that meet functional, performance and cost goals.The Datacenter as a Computer: An Introduction to the Design of Warehouse-ScaleMachinesLuiz André Barroso and Urs Hölzle2009Computer Architecture Techniques for Power-EfficiencyStefanos Kaxiras and Margaret Martonosi2008Chip Mutiprocessor Architecture: Techniques to Improve Throughput and LatencyKunle Olukotun, Lance Hammond, James Laudon2007Transactional MemoryJames R. Larus, Ravi Rajwar2007Quantum Computing for Computer ArchitectsTzvetan S. Metodi, Frederic T. Chong2006
Copyright 2009 by Morgan & ClaypoolAll rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted inany form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotationsin printed reviews, without the prior permission of the publisher.The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale MachinesLuiz André Barroso and Urs Hölzlewww.morganclaypool.comISBN: 9781598295566 paperbackISBN: 9781598295573 ebookDOI: 10.2200/S00193ED1V01Y200905CAC006A Publication in the Morgan & Claypool Publishers seriesSYNTHESIS LECTURES ON COMPUTER ARCHITECTURELecture #6Series Editor: Mark D. Hill, University of Wisconsin, MadisonSeries ISSNISSN 1935-3235printISSN 1935-3243electronic
The Datacenter as a ComputerAn Introduction to the Design ofWarehouse-Scale MachinesLuiz André Barroso and Urs HölzleGoogle Inc.SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE # 6
viAbstractAs computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacentersare quite different from traditional hosting facilities of earlier times and cannot be viewed simplyas a collection of co-located servers. Large portions of the hardware and software resources in thesefacilities must work in concert to efficiently deliver good levels of Internet service performance,something that can only be achieved by a holistic approach to their design and deployment. In otherwords, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). Wedescribe the architecture of WSCs, the main factors influencing their design, operation, and coststructure, and the characteristics of their software base. We hope it will be useful to architects andprogrammers of today’s WSCs, as well as those of future many-core platforms which may one dayimplement the equivalent of today’s WSCs on a single board.Keywordscomputer organization and design, Internet services, energy efficiency, fault-tolerant computing,cluster computing, data centers, distributed systems, cloud computing.
viiAcknowledgmentsWhile we draw from our direct involvement in Google’s infrastructure design and operation overthe past several years, most of what we have learned and now report here is the result of the hardwork, the insights, and the creativity of our colleagues at Google. The work of our Platforms Engineering, Hardware Operations, Facilities, Site Reliability and Software Infrastructure teams ismost directly related to the topics we cover here, and therefore, we are particularly grateful to themfor allowing us to benefit from their experience. Ricardo Bianchini, Fred Chong and Mark Hillprovided extremely useful feedback despite being handed a relatively immature early version of thetext. Our Google colleagues Jeff Dean and Jimmy Clidaras also provided extensive and particularlyuseful feedback on earlier drafts. Thanks to the work of Kristin Weissman at Google and MichaelMorgan at Morgan & Claypool, we were able to make this lecture available electronically withoutcharge, which was a condition for our accepting this task. We were fortunate that Gerry Kane volunteered his technical writing talent to significantly improve the quality of the text. We would alsolike to thank Catherine Warner for her proofreading and improvements to the text at various stages.Finally, we are very grateful to Mark Hill and Michael Morgan for inviting us to this project, fortheir relentless encouragement and much needed prodding, and their seemingly endless patience.
ixContents1.Introduction.11.1 Warehouse-Scale Computers. 21.2 Emphasis on Cost Efficiency. 31.3 Not Just a Collection of Servers. 41.4 One Datacenter vs. Several Datacenters. 41.5 Why WSCs Might Matter to You. 51.6 Architectural Overview of WSCs. 51.6.1 Storage. 61.6.2 Networking Fabric. 71.6.3 Storage Hierarchy. 81.6.4 Quantifying Latency, Bandwidth, and Capacity. 81.6.5 Power Usage. 101.6.6 Handling Failures. 112.Workloads and Software Infrastructure. 132.1 Datacenter vs. Desktop. 132.2 Performance and Availability Toolbox. 152.3 Cluster-Level Infrastructure Software. 192.3.1 Resource Management. 202.3.2 Hardware Abstraction and Other Basic Services. 202.3.3 Deployment and Maintenance. 202.3.4 Programming Frameworks. 212.4 Application-Level Software. 212.4.1 Workload Examples. 222.4.2 Online: Web Search. 222.4.3 Offline: Scholar Article Similarity. 242.5 A Monitoring Infrastructure. 262.5.1 Service-Level Dashboards. 26
the datacenter as a computer18.104.22.168.2 Performance Debugging Tools. 272.5.3 Platform-Level Monitoring. 28Buy vs. Build. 28Further Reading. 293.Hardware Building Blocks. 313.1 Cost-Efficient Hardware. 313.1.1 How About Parallel Application Performance?. 323.1.2 How Low-End Can You Go?. 353.1.3 Balanced Designs. 374.Datacenter Basics. 394.1 Datacenter Tier Classifications. 394.2 Datacenter Power Systems. 404.2.1 UPS Systems. 414.2.2 Power Distribution Units. 414.3 Datacenter Cooling Systems. 424.3.1 CRAC Units. 424.3.2 Free Cooling. 434.3.3 Air Flow Considerations. 444.3.4 In-Rack Cooling. 444.3.5 Container-Based Datacenters. 455.Energy and Power Efficiency. 475.1 Datacenter Energy Efficiency. 475.1.1 Sources of Efficiency Losses in Datacenters. 495.1.2 Improving the Energy Efficiency of Datacenters. 505.2 Measuring the Efficiency of Computing. 525.2.1 Some Useful Benchmarks. 525.2.2 Load vs. Efficiency. 545.3 Energy-Proportional Computing. 565.3.1 Dynamic Power Range of Energy-Proportional Machines. 575.3.2 Causes of Poor Energy Proportionality. 585.3.3 How to Improve Energy Proportionality. 595.4 Relative Effectiveness of Low-Power Modes. 605.5 The Role of Software in Energy Proportionality. 615.6 Datacenter Power Provisioning. 62
contents xi22.214.171.124.1 Deployment and Power Management Strategies. 625.6.2 Advantages of Oversubscribing Facility Power. 63Trends in Server Energy Usage. 65Conclusions. 665.8.1 Further Reading. 676.Modeling Costs. 696.1 Capital Costs. 696.2 Operational Costs. 716.3 Case Studies. 726.3.1 Real-World Datacenter Costs. 746.3.2 Modeling a Partially Filled Datacenter. 757.Dealing with Failures and Repairs. 777.1 Implications of Software-Based Fault Tolerance. 777.2 Categorizing Faults. 797.2.1 Fault Severity. 807.2.2 Causes of Service-Level Faults. 817.3 Machine-Level Failures. 837.3.1 What Causes Machine Crashes?. 867.3.2 Predicting Faults. 877.4 Repairs. 887.5 Tolerating Faults, Not Hiding Them. 898.Closing Remarks. 918.1 Hardware. 928.2 Software. 938.3 Economics. 948.4 Key Challenges. 968.4.1 Rapidly Changing Workloads. 968.4.2 Building Balanced Systems from Imbalanced Components. 968.4.3 Curbing Energy Usage. 968.4.4 Amdahl’s Cruel Law. 968.5 Conclusions. 97References. 99Author Biographies. 107
chapter 1IntroductionThe ARPANET is about to turn forty, and the World Wide Web is approaching its 20th anniversary. Yet the Internet technologies that were largely sparked by these two remarkable milestonescontinue to transform industries and our culture today and show no signs of slowing down. Morerecently the emergence of such popular Internet services as Web-based email, search and social networks plus the increased worldwide availability of high-speed connectivity have accelerated a trendtoward server-side or “cloud” computing.Increasingly, computing and storage are moving from PC-like clients to large Internet services. While early Internet services were mostly informational, today many Web applications offerservices that previously resided in the client, including email, photo and video storage and office applications. The shift toward server-side computing is driven primarily not only by the need for userexperience improvements, such as ease of management (no configuration or backups needed) andubiquity of access (a browser is all you need), but also by the advantages it offers to vendors. Software as a service allows faster application development because it is simpler for software vendorsto make changes and improvements. Instead of updating many millions of clients (with a myriadof peculiar hardware and software configurations), vendors need only coordinate improvementsand fixes inside their datacenters and can restrict their hardware deployment to a few well-testedconfigurations. Moreover, datacenter economics allow many application services to run at a low costper user. For example, servers may be shared among thousands of active users (and many more inactive ones), resulting in better utilization. Similarly, the computation itself may become cheaper in ashared service (e.g., an email attachment received by multiple users can be stored once rather thanmany times). Finally, servers and storage in a datacenter can be easier to manage than the desktopor laptop equivalent because they are under control of a single, knowledgeable entity.Some workloads require so much computing capability that they are a more natural fit fora massive computing infrastructure than for client-side computing. Search services (Web, images,etc.) are a prime example of this class of workloads, but applications such as language translationcan also run more effectively on large shared computing installations because of their reliance onmassive-scale language models.
the datacenter as a computerThe trend toward server-side computing and the exploding popularity of Internet serviceshas created a new class of computing systems that we have named warehouse-scale computers, orWSCs. The name is meant to call attention to the most distinguishing feature of these machines:the massive scale of their software infrastructure, data repositories, and hardware platform. Thisperspective is a departure from a view of the computing problem that implicitly assumes a modelwhere one program runs in a single machine. In warehouse-scale computing, the program is anInternet service, which may consist of tens or more individual programs that interact to implementcomplex end-user services such as email, search, or maps. These programs might be implementedand maintained by different teams of engineers, perhaps even across organizational, geographic, andcompany boundaries (e.g., as is the case with mashups).The computing platform required to run such large-scale services bears little resemblanceto a pizza-box server or even the refrigerator-sized high-end multiprocessors that reigned in thelast decade. The hardware for such a platform consists of thousands of individual computing nodeswith their corresponding networking and storage subsystems, power distribution and conditioning equipment, and extensive cooling systems. The enclosure for these systems is in fact a buildingstructure and often indistinguishable from a large warehouse.1.1WAREHOUSE-SCALE COMPUTERSHad scale been the only distinguishing feature of these systems, we might simply refer to themas datacenters. Datacenters are buildings where multiple servers and communication gear are colocated because of their common environmental requirements and physical security needs, and forease of maintenance. In that sense, a WSC could be considered a type of datacenter. Traditionaldatacenters, however, typically host a large number of relatively small- or medium-sized applications, each running on a dedicated hardware infrastructure that is de-coupled and protected fromother systems in the same facility. Those datacenters host hardware and software for multiple organizational units or even different companies. Different computing systems within such a datacenteroften have little in common in terms of hardware, software, or maintenance infrastructure, and tendnot to communicate with each other at all.WSCs currently power the services offered by companies such as Google, Amazon, Yahoo,and Microsoft’s online services division. They differ significantly from traditional datacenters: theybelong to a single organization, use a relatively homogeneous hardware and system software plat
2 THE DATACENTEr AS A CoMPUTEr The trend toward server-side computing and the exploding popularity of Internet services has created a new class of computing systems that we have named warehouse-scale computers, or WSCs. The name is meant to call attent