DevOps Done Right - Bitpipe

3y ago
37 Views
4 Downloads
5.67 MB
20 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Karl Gosselin
Transcription

DevOps Done RightBest Practices to Knock Down Barriers to Success 2018 New Relic, Inc. US 888-643-8776 www.newrelic.com www.twitter.com/newrelic blog.newrelic.com

Table of ContentsIntroduction03Chapter 1Balancing SLO With Fast Application Delivery04Chapter 2Creating a Fair and Effective On-Call Policy08Chapter 3Responding to Incidents Effectively11Chapter 4Overcoming Microservices Complexity14Chapter 5Using Data to Speed Software Development17Conclusion 2018 New Relic, Inc.19 US 888-643-8776 www.newrelic.com www.twitter.com/newrelic blog.newrelic.com

eBookDevOps Done Right: Best Practices to Knock Down Barriers to SuccessIntroductionYour team has embraced DevOps. You’re establishing new processes, adoptingTHENNOWMonolith Ruby200 microservicesperforming DevOps machine.Siloed teams50 engineering teams withembedded site reliability engineersOften that missing piece is measurement of data. Although measurementInfrequent releasesUp to 70 deploys per dayReactive responseProactive monitoring and responsenew tools, and forming a culture that emphasizes cross-functional collaboration. But you haven’t yet reached maximum velocity. There’s something missing,something that’s keeping your organization from truly becoming a high-is one of the five pillars of the CALMS framework (Culture, Automation, Lean,Measurement, Sharing) coined by DevOps expert Jez Humble, it’s frequentlyneglected by DevOps teams in their push for increased velocity and autonomy.However, this can create huge problems, as accurate data is critical to theNew Relic then and now: The Journey to DevOpssuccessful functioning of a DevOps team—from effective incident responseto navigating microservices complexity and more.This ebook is for all of the teams and organizations that have been dipping theirtoes in the water and are now ready to take the plunge into all things DevOps.It’s also aimed at those who are treading water without making the DevOpsprogress they need to achieve a true digital transformation.By sharing real-world experiences—particularly lessons we’ve learned here atNew Relic—we want to help you knock down your remaining barriers to DevOpssuccess. From understanding how to set reliability goals to untangling theunique communications and development requirements of your microservicesapproach, we’re bringing together proven best practices that show you howto move faster and more effectively than ever before.03 2018 New Relic, Inc. US 888-643-8776 www.newrelic.com www.twitter.com/newrelic blog.newrelic.com

CHAPTER 1Balancing SLO With FastApplication Delivery

eBookDevOps Done Right: Best Practices to Knock Down Barriers to SuccessBalancing SLO With Fast Application DeliveryYour development cycles are faster and you’re deploying code more frequently,but how’s your reliability? Quality and reliability are equally important outcomes“Fundamentally, [SRE is] what happens when you aska software engineer to design an operations function.”of a successful DevOps approach.That’s where SRE comes in. Site reliability engineering (SRE) is a cross-func-Ben Treynor Sloss, Vice President of Engineering, Googletional role, assuming responsibilities traditionally dedicated and segregatedwithin development, operations, and other IT groups. Because SRE relieson both dev and ops collaboration, it goes hand-in-hand with the DevOps culture.And while DevOps and SRE have much in common, SRE elevates the focuson continuous improvement and managing to measurable outcomes, particularly through the use of service level objectives (SLOs).While industry best practices for SRE call for setting SLIs and SLOs for eachLet’s start with some important definitions:TERMservice that you provide, it can be quite challenging to define and deployDEFINITIONthem if you haven’t done this before. Here are seven steps that we use at NewEXAMPLEService levelindicator(SLI)The SLI is your core measurement of performance.Service levelobjective(SLO)SLOs are the target values orgoals for the performance ofyour system. SLOs representan ongoing commitment.Service levelagreement(SLA)Setting appropriate SLIs and SLOs“Customers can log in andview their data ”Relic to set SLOs and SLIs:1. Identify system boundaries: A system boundary is where one or morecomponents expose one or more capabilities to external customers.The SLA defines what happensif you don’t meet your SLI/SLOcommitments.While internally your platform may have many moving parts—service“99.9% of the time ”nodes, database, load balancer, and so on—the individual pieces aren’tconsidered system boundaries because they aren’t directly exposing“ or they can request arefund for losses incurreddue to unavailability ofthe service.”Learn more about SRE in our ebook, Site Reliability Engineering: Philosophies,Habits, and Tools for SRE Success05 2018 New Relic, Inc.a capability to customers. Instead multiple pieces work together as awhole to expose capabilities. For example, a login service which exposesan API with the capability of authenticating user credentials is a logicalgroup of components that work together as a system. Before you set yourSLIs, start by grouping elements of your platform into systems anddefining their boundaries. This is where you’ll focus your effort in theremaining steps because boundary SLIs and SLOs are the most useful. US 888-643-8776 www.newrelic.com www.twitter.com/newrelic blog.newrelic.com

eBookDevOps Done Right: Best Practices to Knock Down Barriers to Success4. Define corresponding technical SLIs: Now it’s time to define oneor more SLIs per capability using your definition of availability of eachUl/API Tiercapability. Building on our example above, an SLI for a data-routingLogin Servicecapability could be “time to deliver message to correct destination.”Data Storage/Query Tier5. Measure to get a baseline: Obviously, monitoring is how you’ll knowwhether you are achieving your availability goals or not. Using yourLegacyData Tiermonitoring tool, gather baseline data for each SLI before you actuallyData Ingest& Routingset your SLOs.6. Apply SLO targets (per SLI/capability): Once you have the data, butSystems and boundaries within a platformbefore you set your SLOs, ask your customers questions that help youidentify what their expectations are and how you can align your SLOs to2. Define capabilities exposed by each system: Now group the com-meet them. Then choose SLO targets based on your baselines, customerponents of the platform into logical units (e.g., UI/API tier, login service,data storage/query tier, legacy data tier, data ingest & routing). Hereinput, what your team can commit to supporting, and what’s feasibleat New Relic, our system boundaries line up with our engineeringbased on your current technical reality. Following our SLI example for datateam boundaries. Using these groupings, articulate the set of capabilitiesrouting, the SLO could be “99.5% of messages delivered in less than 5that are exposed at each system boundary.seconds.” Don’t forget to configure an alert trigger in your monitoringapplication with a warning threshold for the SLOs you define.Data Ingest TierMultiple capabilities Data ingested Data routedAcme Monitoring ProductCapabilities defined at a system boundary3. Create a clear definition of “available” for each capability: For example,SLI Example for Data Routing Capability“delivery of messages to the correct destination” is a way to describeexpectations of availability for a data-routing capability. Using plain English067. Iterate and tune: Don’t take a set-it-and-forget-it approach to SLOsto describe what is expected for availability—versus using technicaland SLIs. You should assume that they will, and should, evolve over timeterms that not everyone is familiar with—helps avoid misunderstandings.as your services and customer needs change. 2018 New Relic, Inc. US 888-643-8776 www.newrelic.com www.twitter.com/newrelic blog.newrelic.com

eBookDevOps Done Right: Best Practices to Knock Down Barriers to SuccessAdditional tips for SLOs and SLIs “To achieve operational excellence, we measureeverything. Only in that way can we manage andimprove everything.”Make sure each logical instance of a system has its own SLO:For instance, for hard-sharded (versus horizontally scaled) systems,measure SLIs and SLOs separately for each shard. Craig Vandeputte, Director of DevOps, CarRentals.comKnow that SLIs are not the same as alerts: The SRE processis not a replacement for thorough alerting. Use compound SLOs where appropriate: You can express a single,compound SLO to capture multiple SLI conditions and make it easierfor customers to understand. Create customer-specific SLOs as needed: It’s not unusual formajor customers to receive SLAs that give better availability of servicesthan those provided to other customers.07 2018 New Relic, Inc. US 888-643-8776 www.newrelic.com www.twitter.com/newrelic blog.newrelic.com

CHAPTER 2Creating a Fair andEffective On-Call Policy

eBookDevOps Done Right: Best Practices to Knock Down Barriers to SuccessCreating a Fair and Effective On-Call PolicyThe next step in improving reliability while accelerating deployments is to makesure that your organization can handle any software issues that arise—anytimeday or night—quickly and effectively. For this, you need an on-call policy.Wait don’t skip to the next chapter yet. We know the “on call” term can evokemany emotional responses in people. But that’s primarily because many organizations get the concept of on-call rotation wrong. And getting it wrong meansnot only the stress and negative attention of missing your SLAs with customers,but working in an unproductive, unpleasant culture with a team of exhaustedand frustrated engineers.Apply these best practices toimprove your on-call practiceStructure your team and organization fairlyHere at New Relic, every engineer and engineering manager in the productorganization rotates on-call responsibilities for the team’s services. Teams areresponsible for at least three services, but the number of services supporteddepends on the complexity of the services and the size of the team. For yourorganization, look at the size of the total engineering organization and of indi-Start with the fundamentalsvidual teams before choosing an on-call rotation approach. For instance, if theAn effective and fair on-call policy starts with two important prerequisites:every six weeks.team has six engineers, then each engineer could be the primary person on call1. Structured system and organization: Responding to issues effectivelyis far easier when both your systems (services or applications) and yourproduct teams are well organized and structured into logical units.For instance, at New Relic our 57 engineering teams support 200 individual services, with each team acting autonomously to own at least threeservices throughout the product lifecycle, from design to deploymentto maintenance.Be flexible and creative when designing rotationsConsider letting each team design and implement its own on-call rotation policy.Give teams the freedom and autonomy to think out-of-the-box about waysto organize rotations that best suit their individual needs. At New Relic, eachteam has the autonomy to create and implement its own on-call system. Forinstance, one team uses a script that randomly rotates the on-call order of thenon-primary person.2. A culture of accountability: With DevOps, each team is accountable forthe code that it deploys into production. Teams naturally make differentdecisions about changes and deployments when they are responsible andon call for the service versus traditional environments where someone elseis responsible for supporting code once it’s running in production.09 2018 New Relic, Inc. US 888-643-8776 Track metrics and monitor incidentsAn important part of making the on-call rotation fair and effective is monitoring and tracking incident metrics. Here at New Relic, we track number of pages,number of hours paged, and number of off-hours pages. We look at thesewww.newrelic.com www.twitter.com/newrelic blog.newrelic.com

eBookDevOps Done Right: Best Practices to Knock Down Barriers to Successmetrics at the engineer, team, and group levels. Tracking metrics helps draw attention to teams that are faced with unmanageable call loads (if a teamaverages more than one off-hours page per week, that team is consideredto have a high on-call burden). Staying on top of these metrics lets us shiftTooling: Do you have incident response tools that give engineersautomatic, actionable problem notification? priorities to paying down a team’s technical debt or providing more supportto improve the services.Culture: Have you made being on call an essential part of the job inyour engineering culture? Do you have a blameless culture that is focusedon finding and solving the root cause instead of seeking to lay blame?COMPARED TO LOW PERFORMERS,HIGH PERFORMERS IN DEVOPS HAVE1“[DevOps] high performers were twice as likely toexceed their own goals for profitability, market share,and productivity.” 146% 96xAdapt your policy to align with yourcompany’s situationMORE CODE DEPLOYSFASTER MTTRAn on-call policy that works for a team at New Relic might be completelyunsustainable for your company. To create an on-call rotation that is bothfair and effective, consider additional inputs such as: Growth: How fast is your company and your engineering group growing?How much turnover are you experiencing? Geography: Is your engineering organization centralized orgeographically distributed? Do you have the resources to deploy“follow-the-sun” rotations? Complexity: How complex are your applications and how are theystructured? How complex are dependencies across services?1: Source, “2017 State of DevOps Report,” Puppet and DORA10 2018 New Relic, Inc. US 888-643-8776 www.newrelic.com www.twitter.com/newrelic blog.newrelic.com

CHAPTER 3Responding toIncidents Effectively

eBookDevOps Done Right: Best Practices to Knock Down Barriers to SuccessResponding to Incidents EffectivelyConcomitant to on-call rotations is the concept of incident management.2. Instrument your services: Every service should have monitoring andWhat’s an incident? That’s when a system behaves in an unexpected wayalerting for proactive incident reporting. The goal is to discover incidentsthat might negatively impact customers (or partners or employees).before customers do to avoid worst-case scenarios where irritated customers are calling support or posting comments on social media. WithA core competency within the “you build it, you own it” DevOps approach, inci-proactive incident reporting, you can respond to and resolve incidentsdent management is often given short shrift, with teams losing interest oncean issue is resolved. Often organizations without effective incident managementtake on “firefighting” responsibilities using ad-hoc organization, methods, andas quickly as possible.3. Define responder roles: At New Relic, team members from engineeringand support fill the following roles during an incident: incident commandercommunications. When something blows up, everyone scrambles to work outa plan to solve the problem.(drives resolutions), tech lead (diagnoses and fixes), communications leadThere’s a much better way to approach incidents, one that not only minimizesgency communication strategy), incident liaison (interacts with supportthe duration and frequency of outages, but also gives responsible engineersand the business for severity 1s), emergency commander (optional forthe support they need to respond efficiently and effectivelyseverity 1s), and engineering manager (manages the post-incident process).(keeps everyone informed), communications manager (coordinates emer-4. Create a game plan: This is the series of tasks by role that covers every-Creating an effective incidentmanagement processthing that happens throughout the lifecycle of an incident, includingdeclaring an incident, setting the severity, determining the appropriatetech leads to contact, debugging and fixing the issue, managing the1. Define severities: Severities determine how much support will beflow of communications, handing off responsibilities, ending the incident,needed and the potential impact on customers. For example, at New Relicand conducting a retrospective.we use a scale of 1 to 5 for severities: 5. Implement appropriate tools and automation to support the entireLevel 5 does not impact customers and may be used to raiseprocess: From monitoring and alerts, to dashboards and incident track-awareness about an issue. Level 4 involves minor bugs or minor data lags that affect, but don’thinder, customers.12 Level 3 is for major data lags or unavailable features. Levels 2 and 1 are serious incidents that cause outages. 2018 New Relic, Inc. US 888-643-8776 www.newrelic.coming, automating the process is critical to keeping the appropriate teammembers informed and on task, and executing the game plan efficiently. www.twitter.com/newrelic blog.newrelic.com

eBookDevOps Done Right: Best Practices to Knock Down Barriers to Success6. Conduct retrospectives: After the incident, require your teams to conductHOW DEVOPS TEAMS FIND OUT ABOUT ISSUES2a retrospective within one or two days of the incident. Emphasize that theretrospective is blameless and should focus instead on uncovering the trueroot causes of a problem.7. Implement a Don’t Repeat Incidents (DRI) policy: If a service issueimpacts your customers, then it’s time to identify and pay down technicaldebt. A DRI policy says that your team stops any new work on that serviceuntil the root cause of the issue has been fixed or mitigated.Example incident declared in Slack2: Source: “DevOps Survey Results,” 2nd Watch, 2018.13 2018 New Relic, Inc. US 888-643-8776 www.newrelic.com www.twitter.com/newrelic blog.newrelic.com

CHAPTER 4OvercomingMicroservices Complexity

eBookDevOps Done Right: Best Practices to Knock Down Barriers to SuccessOvercoming Microservices ComplexityLike peanut butter and chocolate, microservices and DevOps are better together. By now, companies understand that transforming monolithic applications intoan architecture diagram, description, instructions for running locally,decomposed services can drive dramatic gains in productivity, speed, agility,and information on how to contribute.scalability, and reliability. teams to glean important information from multiple boards, discussions,deploying microservices, they often overlook the substantial changes requiredand emails.in collaboration and communications. Engineers at New Relic developed the following best practices to foster a collaborative environment that simplifies the complexities and communication challenges inherent in a microservices world.Let upstream and downstream dependencies know of major changes:Using data to better understandhow microservices are workingteams that depend on it both upstream and downstream so that in caseBreaking up your monolithic applications into microservices isn’t an easyany issues arise, they won’t waste time trying to track down the root cause.task. First you need deep understanding of a system before you can parti-Communicate early and often: This is particularly important for versioning, deprecations, and situations where you may need to temporarily infrastructure, and security teams are all “neighbors” of your service. Asneighbors access to your service’s roadmap, and maintain a contact list.Before you deploy any major changes to your microservice, notify the Focus on your service “neighborhood”: Your upstream, downstream,a good neighbor, you should attend their demos and standups, give yourGood communication practicesfor a microservices environment Create an announcements-only channel: It’s essential to have a singlesource of truth for important announcements rather than expectingBut while teams recognize the changes required in developing, testing, and15Treat d

DevOps Done Right Best Practices to Knock Down Barriers to Success . blog.newrelic.com. Table of Contents Introduction 03 Chapter 1 Balancing SLO With Fast Application Delivery 04 Chapter 2 Creating a Fair and Effective On-Call Policy 08 Chapter 3 Responding to Incidents Effectively 11 . The Journey to DevOps 03 DevOps Done Right: Best .

Related Documents:

Understand the basics of the DevOps cycle Become familiar with the terms and concepts of DevOps Comprehend the beginning of the DevOps cycle . DevOps and Software Development Life Cycle 3. DevOps main objectives 4. Prerequisites for DevOps 5. Continuous Testing and Integration 6. Continuous Release and Deployment 7. Continuous Application .

DevOps Roadmap DevOps Journey DevOps Process. Adoção do DevOps O enfoque incremental concentra-se na ideia de minimizar o risco e o custo de uma adoção de DevOps, ao mesmo tempo em que . O blog a seguir explica como o DevOps pode melhorar o processo de negócios.

DEVOPS INNOVATION Gordon Haff @ghaff William Henry @ipbabble Cloud & DevOps Product Strategy, Red Hat 17 August 2015. What is DevOps? Source: DevOps Days DC 2015 word cloud from Open Spaces. DevOps applies open source principles and practices with. DEVOPS: THE WHAT & THE WHY TOOLS drawing . Linux Collaboration Summit: Linux Foundation .

International DevOps Certification Academy aims to remove these barriers set in front of the DevOps Professionals in developed and emerging markets by saving them from paying unreason-able fees for DevOps Classroom Trainings and DevOps Certification Examinations before they certify their knowhow in DevOps.

3. DevOps and Mainframe: Mission Possible? 4. DevOps Best Practices for z Systems 5. Building for the modern omni channel world 6. DevOps Success Stories in the Enterprise https://ibm.biz/mmdevops 7. Making a DevOps transition 8. Where DevOps can take you

at oreil.ly/devops A New Excerpt from High Performance Browser Networking HTTP/2 Ilya Grigorik DevOps in Practice J. Paul Reed Docker Security . web operations, DevOps, and web performance with free ebooks and reports from O'Reilly. J. Paul Reed DevOps in Practice. 978-1-491-91306-2 [LSI] DevOps in Practice

DevOps Network Guide 4 communication demanded by a DevOps environment. The DevOps Culture: A culture of DevOps sounds pretty cool to talk about. It means being a part of something bigger. A DevOps culture is simple to adhere to. It is: Collaboration Shared responsibility Creating a culture based around these two

of general rough paths. However, in this paper, we will focus on the case where the driving signal is of bounded variation. Following [6] we interpret the whole collection of iterated integrals as a single algebraic object, known as the signature, living in the algebra of formal tensor series. This representation exposes the natural algebraic structure on the signatures of paths induced by the .