The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.
The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.
Background on The DevOps Handbook
More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.
And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.
Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.The DevOps Handbook
- Creating telemetry to enable seeing and solving problems
- Using telemetry to better anticipate problems and achieve goals
- Integrating user research and feedback into the work of product teams
- Enabling feedback so Dev and Ops can safely perform deployments
- Enabling feedback to increase the quality of work through peer reviews and pair programming
Telemetry is the process of recording and transmitting the readings of an instrument.
During an outage teams may not be able to determine whether the issue is due to:
- A failure in our application (e.g., defect in the code)
- In our environment (e.g., a networking problem, server configuration problem)
- Something entirely external to us (e.g., a massive denial of service attack)
Operations Rule of Thumb: When something goes wrong in production, we just reboot the server.
Telemetry can be redefined as “an automated communications process by which measurements and other data are collected at remote points and are subsequently transmitted to receiving equipment for monitoring.”
“If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it….Tracking everything is key to moving fast, but the only way to do it is to make tracking anything easy….We enable engineers to track what they need to track, at the drop of a hat, without requiring time-sucking configuration changes or complicated processes.”Ian Malpass
Research has shown high performers could resolve production incidents 168 times faster than their peers.
Create Centralized Telemetry Infrastructure
For decades companies have ended up with silos of information, where Development only creates logging events that are interesting to developers, and Operations only monitors whether the environments are up or down. As a result, when inopportune events occur, no one can determine why the entire system is not operating as designed or which specific component is failing, impeding a teams ability to bring the system back to a working state.
Monitoring involves data collection at the business logic, application, and environments layer (events, logs, metrics) and an event router responsible for storing events and metrics (visualization, trending, alerting, anomaly detection). By transforming logs into metrics, teams can perform statistical operations on them, such as using anomaly detection to find outliers and variances even earlier in the problem cycle.
Ensure it’s easy to enter and retrieve information from our telemetry infrastructure.
Create Application Logging Telemetry That Helps Production
- DEBUG level – program specific
- INFO level – user driven or system specific (credit card transaction)
- WARN level – conditions that can become an error (long DB call)
- ERROR level – error conditions like API call fail
- FATAL level – when to terminate (network daemon can’t bind to a network socket)
Choosing the right logging level is important. Dan North, a former ThoughtWorks consultant who was involved in several projects in which the core continuous delivery concepts took shape, observes, “When deciding whether a message should be ERROR or WARN, imagine being woken up at 4 a.m. Low printer toner is not an ERROR.”
Significant Logging Events: Authentication/authorization decisions (including logoff), System and data access, System and application changes (especially privileged changes), Data changes (such as adding, editing, or deleting data), Invalid input (possible malicious injection, threats, etc.), Resources (RAM, disk, CPU, bandwidth, or any other resource that has hard or soft limits), Health and availability, Startups and shutdowns, Faults and errors, Circuit breaker trips, Delays, and Backup success/failure.
Use Telemetry To Guide Problem Solving
When there is a culture of blame around outages and problems, groups may avoid documenting changes and displaying telemetry where everyone can see them to avoid being blamed for outages. The so-called “mean time until declared innocent” is how quickly someone can convince everyone else that they didn’t cause the outage.
Questions to ask during problem resolution:
- What evidence from monitoring that a problem is actually occurring?
- What are the relevant events and changes in applications and environments that could have contributed to the problem?
- What hypotheses can be formulated to confirm the link between the proposed causes and effects?
- How can these hypotheses be proven to be correct and successfully effect a fix?
Enable the Creation of Production Metrics as Part of Daily Work
Create Self-Service Access to Telemetry and Information Radiators
Information Radiator: defined by the Agile Alliance as “the generic term for any of a number of handwritten, drawn, printed, or electronic displays which a team places in a highly visible location, so that all team members as well as passers-by can see the latest information at a glance: count of automated tests, velocity, incident reports, continuous integration status, and so on. This idea originated as part of the Toyota Production System.”
By putting information radiators in highly visible places, teams promote responsibility among team members, actively demonstrating the following values: (1) The team has nothing to hide from its visitors and (2) The team has nothing to hide from itself — they acknowledge and confront problems.
Find and Fill Telemetry Gaps
- Business level: Examples include the number of sales transactions, revenue of sales transactions, user signups, churn rate, A/B testing results, etc.
- Application level: Examples include transaction times, user response times, application faults, etc.
- Infrastructure level (e.g., database, operating system, networking, storage): Examples include web server traffic, CPU load, disk usage, etc.
- Deployment pipeline level: Examples include build pipeline status (e.g., red or green for our various automated test suites), change deployment lead times, deployment frequencies, test environment promotions, and environment status.
Application and Business Metrics
At the application level, the goal is to ensure teams are generating telemetry not only around application health, but also to measure to what extent they’re achieving organizational goals (e.g., number of new users, user login events, user session lengths, percent of users active, how often certain features are being used).
If a team has a service that’s supporting e-commerce, they should ensure telemetry around all of the user events that lead up to a successful transaction that generates revenue. The team can then instrument all the user actions that are required for desired customer outcomes.
For e-commerce sites, they may want to maximize the time spent on the site; however, for search engines, they may want to reduce the time spent on the site, since long sessions may indicate that users are having difficulty finding what they’re looking for. Business metrics will be part of a customer acquisition funnel, which is the theoretical steps a potential customer will take to make a purchase. For instance, in an e-commerce site, the measurable journey events include total time on site, product link clicks, shopping cart adds, and completed orders.
The goal for production and non-production infrastructure is to ensure teams are generating enough telemetry so that if a problem occurs in any environment, they can quickly determine whether infrastructure is a contributing cause of the problem.
Overlaying Other Relevant Information Onto Metrics
Operational side effects are not just outages, but also significant disruptions and deviations from standard operations. Make work visible by overlaying all production deployment activities on graphs.
By having all elements of a service emitting telemetry that can be analyzed, whether it’s the application, database, or environment, and making that telemetry widely available, teams can find and fix problems long before they cause something catastrophic.