Book Club: The DevOps Handbook (Chapter 15. Analyze Telemetry to Better Anticipate Problems and Achieve Goals)

This entry is part [part not set] of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 15

Outlier Detection: “abnormal running conditions from which significant performance degradation may well result, such as an aircraft engine rotation defect or a flow problem in a pipeline.”

Teams create better alerts by increasing the signal-to-noise ratio, focusing on the variances or outliers that matter.

Adopted from the DevOps Handbook

Instrument and Alert on Undesired Outcomes

Analyze the most severe incidents in the recent past and create a list of telemetry that could have enabled earlier / faster detection and diagnosis of the problem, as well as easier and faster confirmation that an effective fix had been implemented.

For instance, if an NGINX web server stopped responding to requests, a team could look at the leading indicators that could have warned the team the system was starting to deviate from standard operations, such as:

  • Application level: increasing web page load times, etc.
  • OS level: server free memory running low, disk space running low, etc.
  • Database level: database transaction times taking longer than normal, etc.
  • Network level: number of functioning servers behind the load balancer dropping, etc.

Problems That Arise When Telemetry Data Has Non-Gaussian Distribution

Using means and standard deviations to detect variance can be extremely useful. However, using these techniques on many of the telemetry data sets that we use in Operations will not generate the desired results. When the distribution of the data set does not have the Gaussian bell curve described earlier, the properties associated with standard deviations do not apply.

Many production data sets are non-Gaussian distribution.

“In Operations, many of our data sets have what we call ‘chi squared’ distribution. Using standard deviations for this data not only results in over- or under-alerting, but it also results in nonsensical results.” She continues, “When you compute the number of simultaneous downloads that are three standard deviations below the mean, you end up with a negative number, which obviously doesn’t make sense.”

Dr. Nicole Forsgren, The DevOps Handbook

Another tool developed at Netflix to increase service quality, Scryer, addresses some of the shortcomings of Amazon Auto Scaling (AAS), which dynamically increases and decreases AWS compute server counts based on workload data. Scryer works by predicting what customer demands will be based on historical usage patterns and provisions the necessary capacity.

Scryer addressed three problems with AAS:

  • Dealing with rapid spikes in demand
  • After outages, the rapid decrease in customer demand led to AAS removing too much compute capacity to handle future incoming demand
  • AAS didn’t factor in known usage traffic patterns when scheduling compute capacity
Adopted from The DevOps Handbook

Using Anomaly Detection Techniques

Anomaly Detection is “the search for items or events which do not conform to an expected pattern.”

Smoothing is using moving averages (or rolling averages), which transform data by averaging each point with all the other data within a sliding window. This has the effect of smoothing out short-term fluctuations and highlighting longer-term trends or cycles.

Adopted from The DevOps Handbook
Series Navigation

Leave a Reply

%d bloggers like this: