Book Club: The DevOps Handbook (Chapter 15. Analyze Telemetry to Better Anticipate Problems and Achieve Goals)

This entry is in the series DevOps Handbook

Book Club: The DevOps Handbook (Conclusion)
Book Club: The DevOps Handbook (Chapter 23. Protecting the Deployment Pipeline and Integrating Into Change Management and Other Security and Compliance Controls)
Book Club: The DevOps Handbook (Chapter 22. Information Security as Everyone’s Job, Every Day)
Book Club: The DevOps Handbook (Chapter 21. Reserve Time to Create Organizational Learning and Improvement)
Book Club: The DevOps Handbook (Chapter 20. Convert Local Discoveries into Global Improvements)
Book Club: The DevOps Handbook (Chapter 19. Enable and Inject Learning into Daily Work)
Book Club: The DevOps Handbook (Chapter 18. Create Review and Coordination Processes to Increase Quality of Our Current Work)
Book Club: The DevOps Handbook (Chapter 17. Integrate Hypothesis-Driven Development and A/B Testing into Our Daily Work)
Book Club: The DevOps Handbook (Chapter 16. Enable Feedback So Development and Operations Can Safely Deploy Code)
Book Club: The DevOps Handbook (Chapter 15. Analyze Telemetry to Better Anticipate Problems and Achieve Goals)
Book Club: The DevOps Handbook (Chapter 14. Create Telemetry to Enable Seeing and Solving Problems)
Book Club: The DevOps Handbook (Chapter 13. Architect for Low-Risk Releases)
Book Club: The DevOps Handbook (Chapter 12. Automate and Enable Low-Risk Releases)
Book Club: The DevOps Handbook (Chapter 11. Enable and Practice Continuous Integration)
Book Club: The DevOps Handbook (Chapter 10. Enable Fast and Reliable Automated Testing)
Book Club: The DevOps Handbook (Chapter 9. Create the Foundations of our Deployment Pipeline )
Book Club: The DevOps Handbook (Chapter 8. How to Get Great Outcomes by Integrating Operations into the Daily Work of Development)
Book Club: The DevOps Handbook (Chapter 7. How to Design Our Organization and Architecture with Conway’s Law in Mind)
Book Club: The DevOps Handbook (Chapter 6. Understanding the Work in Our Value Stream, Making it Visible, and Expanding it Across the Organization)
Book Club: The DevOps Handbook (Chapter 5. Selecting Which Value Stream to Start With)
Book Club: The DevOps Handbook (Chapter 4. The Third Way: The Principles of Continual Learning and Experimentation)
Book Club: The DevOps Handbook (Chapter 3. The Second Way: The Principles of Feedback)
Book Club: The DevOps Handbook (Chapter 2. The First Way: The Principles of Flow)
Book Club: The DevOps Handbook (Chapter 1. Agile, Continuous Delivery, and the Three Ways)
Book Club: The DevOps Handbook (Introduction)

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.
And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.
Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.
The DevOps Handbook

Chapter 15

Outlier Detection: “abnormal running conditions from which significant performance degradation may well result, such as an aircraft engine rotation defect or a flow problem in a pipeline.”

Teams create better alerts by increasing the signal-to-noise ratio, focusing on the variances or outliers that matter.

Instrument and Alert on Undesired Outcomes

Analyze the most severe incidents in the recent past and create a list of telemetry that could have enabled earlier / faster detection and diagnosis of the problem, as well as easier and faster confirmation that an effective fix had been implemented.

For instance, if an NGINX web server stopped responding to requests, a team could look at the leading indicators that could have warned the team the system was starting to deviate from standard operations, such as:

Application level: increasing web page load times, etc.
OS level: server free memory running low, disk space running low, etc.
Database level: database transaction times taking longer than normal, etc.
Network level: number of functioning servers behind the load balancer dropping, etc.

Problems That Arise When Telemetry Data Has Non-Gaussian Distribution

Using means and standard deviations to detect variance can be extremely useful. However, using these techniques on many of the telemetry data sets that we use in Operations will not generate the desired results. When the distribution of the data set does not have the Gaussian bell curve described earlier, the properties associated with standard deviations do not apply.

Many production data sets are non-Gaussian distribution.

“In Operations, many of our data sets have what we call ‘chi squared’ distribution. Using standard deviations for this data not only results in over- or under-alerting, but it also results in nonsensical results.” She continues, “When you compute the number of simultaneous downloads that are three standard deviations below the mean, you end up with a negative number, which obviously doesn’t make sense.”
Dr. Nicole Forsgren, The DevOps Handbook

Another tool developed at Netflix to increase service quality, Scryer, addresses some of the shortcomings of Amazon Auto Scaling (AAS), which dynamically increases and decreases AWS compute server counts based on workload data. Scryer works by predicting what customer demands will be based on historical usage patterns and provisions the necessary capacity.

Scryer addressed three problems with AAS:

Dealing with rapid spikes in demand
After outages, the rapid decrease in customer demand led to AAS removing too much compute capacity to handle future incoming demand
AAS didn’t factor in known usage traffic patterns when scheduling compute capacity

Using Anomaly Detection Techniques

Anomaly Detection is “the search for items or events which do not conform to an expected pattern.”

Smoothing is using moving averages (or rolling averages), which transform data by averaging each point with all the other data within a sliding window. This has the effect of smoothing out short-term fluctuations and highlighting longer-term trends or cycles.

Series Navigation

Book Club: The DevOps Handbook (Chapter 15. Analyze Telemetry to Better Anticipate Problems and Achieve Goals)

Background on The DevOps Handbook

Chapter 15

Instrument and Alert on Undesired Outcomes

Problems That Arise When Telemetry Data Has Non-Gaussian Distribution

Using Anomaly Detection Techniques

Like this:

Leave a ReplyCancel reply

Background on The DevOps Handbook

Chapter 15

Instrument and Alert on Undesired Outcomes

Problems That Arise When Telemetry Data Has Non-Gaussian Distribution

Using Anomaly Detection Techniques

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Red Green Refactor