Book Club: The DevOps Handbook (Chapter 19. Enable and Inject Learning into Daily Work)

This entry is part [part not set] of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 19

Institutionalize rituals that increase safety, continuous improvement, and learning by doing the following:

  • Establish a just culture to make safety possible
  • Inject production failures to create resilience
  • Convert local discoveries into global improvements
  • Reserve time to create organizational improvements and learning

When teams work within a complex system, it’s impossible to predict all the outcomes for the actions they take. To enable teams to safely work within complex systems, organizations must become ever better at diagnostics and improvement activities. They must be skilled at detecting problems, solving them, and multiplying the effects by making the solutions available throughout the organization.

Resilient organizations “skilled at detecting problems, solving them, and multiplying the effect by making the solutions available throughout the organization.” These organizations can heal themselves. “For such an organization, responding to crises is not idiosyncratic work. It is something that is done all the time. It is this responsiveness that is their source of reliability.”

Dr. Steven Spear

Chaos Monkey – A Netflix tool that simulated failures in the system to help build resiliency.

When Netflix first ran Chaos Monkey in their production environments, services failed in ways they never could have predicted or imagined – by constantly finding and fixing these issues, Netflix engineers quickly and iteratively created a more resilient service, while simultaneously creating organizational learnings.

Establish a Just, Learning Culture

When accidents occur (which they undoubtedly will), the response to those accidents is seen as “just.”

“When responses to incidents and accidents are seen as unjust, it can impede safety investigations, promoting fear rather than mindfulness in people who do safety-critical work, making organizations more bureaucratic rather than more careful, and cultivating professional secrecy, evasion, and self-protection.”

Dr Sidney Dekker

Dr. Dekker calls this notion of eliminating error by eliminating the people who caused the errors the Bad Apple Theory. He asserts this notion is invalid, because “human error is not our cause of troubles; instead, human error is a consequence of the design of the tools that we gave them.” Instead of “naming, blaming, and shaming” the person who caused the failure, our goal should always be to maximize opportunities for organizational learning.

If teams punish that engineer, everyone is deterred from providing the necessary details to get an understanding of the mechanism and operation of the failure, which guarantees that the failure will occur again. Two effective practices that help create a just, learning-based culture are blameless post-mortems and the controlled introduction of failures into production.

Schedule Blameless Post-Mortem Meetings After Accidents Occur

To conduct a blameless post-mortem The process should include:

(1) Construct a timeline and gather details from multiple perspectives on failures, ensuring teams don’t punish people for making mistakes.
(2) Empower all engineers to improve safety by allowing them to give detailed accounts of their contributions to failures.
(3) Enable and encourage people who do make mistakes to be the experts who educate the rest of the organization on how not to make them in the future.
(4) Accept that there is always a discretionary space where humans can decide to take action or not, and that the judgment of those decisions lies in hindsight.
(5) Propose countermeasures to prevent a similar accident from happening in the future and ensure these countermeasures are recorded with a target date and an owner for follow-ups.

To enable teams to gain this understanding, the following stakeholders need to be present at the meeting:

  • The people involved in decisions that may have contributed to the problem
  • The people who identified the problem
  • The people who responded to the problem
  • The people who diagnosed the problem
  • The people who were affected by the problem
  • Anyone else who is interested in attending the meeting

The first task in the blameless post-mortem meeting is to record the best understanding of the timeline of relevant events as they occurred. During the meeting and the subsequent resolution, we should explicitly disallow the phrases “would have” or “could have,” as they are counterfactual statements that result from our human tendency to create possible alternatives to events that have already occurred. In the meeting, teams must reserve enough time for brainstorming and deciding which countermeasures to implement.

Publish Post-Mortems As Widely As Possible

Teams should widely announce the availability of the meeting notes and any associated artifacts (e.g., timelines, IRC chat logs, external communications). This information should be placed in a centralized location where the entire organization can access it and learn from the incident. Doing this helps us translate local learnings and improvements into global learnings and improvements.

Etsy’s Morgue:

  • Whether the problem was due to a scheduled or an unscheduled incident
  • The post-mortem owner
  • Relevant IRC chat logs (especially important for 2 a.m. issues when accurate note-taking may not happen)
  • Relevant JIRA tickets for corrective actions and their due dates (information particularly important to management)
  • Links to customer forum posts (where customers complain about issues)

Decrease Incident Tolerances to Find Ever-Weaker Failure Signals

As organizations learn how to see and solve problems efficiently, they need to decrease the threshold of what constitutes a problem in order to keep learning.

Organizations are often structured in one of two models: (1) a standardized model, where routine and systems govern everything, including strict compliance with timelines and budgets; or (2) an experimental model, where every day every exercise and every piece of new information is evaluated and debated in a culture that resembles a research and design laboratory.

Redefine Failure and Encourage Calculated Risk Taking

To reinforce a culture of learning and calculated risk-taking, teams need leaders to continually reinforce that everyone should feel both comfortable with and responsible for surfacing and learning from failures. “DevOps must allow this sort of innovation and the resulting risks of people making mistakes. Yes, you’ll have more failures in production. But that’s a good thing, and should not be punished.” – Roy Rapoport of Netflix

Inject Production Failures to Enable Resilience and Learning

As Michael Nygard, author of “Release It! Design and Deploy Production-Ready Software”, writes, “Like building crumple zones into cars to absorb impacts and keep passengers safe, you can decide what features of the system are indispensable and build in failure modes that keep cracks away from those features. If you do not design your failure modes, then you will get whatever unpredictable—and usually dangerous—ones happen to emerge.”

Resilience requires that teams first define failure modes and then perform testing to ensure that these failure modes operate as designed. One way to accomplish this is by injecting faults into production environment and rehearsing large-scale failures to build confidence in recovering from accidents when they occur, ideally without impacting customers.

Institute Game Days to Rehearse Failures

The concept of Game Days comes from the discipline of resilience engineering. Robbins defines resilience engineering as “an exercise designed to increase resilience through large-scale fault injection across critical systems.” The goal for a Game Day is to help teams simulate and rehearse accidents to give them the ability to practice.

The Game Day process involves:

(1) Schedule a catastrophic event — such as the simulated destruction of an entire data center –to happen at some point in the future.
(2) Give teams time to prepare, to eliminate all the single points of failure and to create the necessary monitoring procedures, failover procedures, etc.
(3) The Game Day team defines and executes drills, such as conducting database failovers or turning off an important network connection to expose problems in the defined processes.
(4) Any problems or difficulties that are encountered are identified, addressed, and tested again.

By executing Game Days, teams progressively create a more resilient service and a higher degree of assurance that they can resume operations when inopportune events occur, as well create more learnings and a more resilient organization.

Some of the learnings gained during these disasters included:

  • When connectivity was lost, the failover to the engineer workstations didn’t work.
  • Engineers didn’t know how to access a conference call bridge or the bridge only had capacity for fifty people or they needed a new conference call provider who would allow them to kick off engineers who had subjected the entire conference to hold music.
  • When the data centers ran out of diesel for the backup generators, no one knew the procedures for making emergency purchases through the supplier, resulting in someone using a personal credit card to purchase $50,000 worth of diesel.

Latent defects are the problems that appear only because of having injected faults into the system.

Series Navigation

Leave a Reply

%d bloggers like this: