Red Green Refactor

Book Club: The DevOps Handbook (Chapter 18. Create Review and Coordination Processes to Increase Quality of Our Current Work)

This entry is in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.
And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.
Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.
The DevOps Handbook

Chapter 18

The theme of this section is enabling Development and Operations to reduce the risk of production changes before they are made.

The peer review process at GitHub is an example of how inspection can increase quality, make deployments safe, and be integrated into the flow of everyone’s daily work. They pioneered the process called “pull request”, one of the most popular forms of peer review that span Dev and Ops. Once a pull request is sent, interested parties can review the set of changes, discuss potential modifications, and even push follow-up commits if necessary.

At GitHub, pull requests are the mechanism used to deploy code into production through a collective set of practices called “GitHub Flow”. The process is how engineers request code reviews, integrate feedback, and declare that code will be deployed to production.

GitHub Flow consists of five steps:

To work on something new, the engineer creates a descriptively named branch off of master.
The engineer commits to that branch locally, regularly pushing their work to the same named branch on the server.
When they need feedback or help, or when they think the branch is ready for merging, they open a pull request.
When they get their desired reviews and get any necessary approvals of the feature, the engineer can then merge it into master.
Once the code changes are merged and pushed to master, the engineer deploys them into production.

The Dangers of the Change Approval Process

When high-profile deployment incidents occur, there are typically two responses. The first narrative is that the accident was due to a change control failure, which seems valid because of a situation where better change control practices could have detected the risk earlier and prevented the change from going into production. The second narrative is that the accident was due to a testing failure.

The reality is that in environments with low-trust, command-and-control cultures, the outcomes of these types of change control and testing countermeasures often result in an increased likelihood that problems will occur again.

Potential Dangers of “Overly Controlling Changes”

Traditional change controls can lead to unintended outcomes, such as contributing to long lead times, and reducing the strength and immediacy of feedback from the deployment process.

Common controls include:

Adding more questions that need to be answered to the change request form.
Requiring more authorizations, such as one more level of management approval or more stakeholders.
Requiring more lead time for change approvals so that change requests can be properly evaluated.

Enable Coordination and Scheduling of Changes

Whenever multiple groups work on systems that share dependencies, changes will likely need to be coordinated to ensure that they don’t interfere with each other. For more complex organizations and organizations with more tightly-coupled architectures, teams may need to deliberately schedule changes, where representatives from the teams get together, not to authorize changes, but to schedule and sequence their changes in order to minimize accidents.

Enable Peer Review of Changes

Instead of requiring approval from an external body prior to deployment, require engineers to get peer reviews of their changes. The goal is to find errors by having fellow engineers close to the work scrutinize changes.

This review improves the quality of changes, which also creates the benefits of cross-training, peer learning, and skill improvement. A logical place to require reviews is prior to committing code to trunk in source control, where changes could potentially have a team-wide or global impact.

The principle of small batch sizes also applies to code reviews. The larger the size of the change that needs to be reviewed, the longer it takes to understand and the larger the burden on the reviewing engineer.

“There is a non-linear relationship between the size of the change and the potential risk of integrating that change—when you go from a ten line code change to a one hundred line code, the risk of something going wrong is more than ten times higher, and so forth.”
Randy Sharp

“Ask a programmer to review ten lines of code, he’ll find ten issues. Ask him to do five hundred lines, and he’ll say it looks good.”
Giray Özil

Guidelines for Code Reviews include:

Everyone must have someone to review their changes before committing to trunk.
Everyone should monitor the commit stream of their fellow team members so that potential conflicts can be identified and reviewed.
Define which changes qualify as high risk and may require review from a designated subject matter expert.
If someone submits a change that is too large to reason about easily, then it should be split up into multiple, smaller changes that can be understood at a glance.

Code Review Formats:

Pair programming: programmers work in pairs.
“Over-the-shoulder”: One developer looks over the author’s shoulder as the latter walks through the code.
Email pass-around: A source code management system emails code to reviewers automatically after the code is checked in.
Tool-assisted code review: Authors and reviewers use specialized tools designed for peer code review or facilities provided by the source code repositories.

Potential Danger of Doing More Manual Testing and Change Freezes

When testing failures occur, the typical reaction is to do more testing. This is true if performing manual testing, because manual testing is naturally slower and more tedious than automated testing.

Manual testing often has the consequence of taking significantly longer to test, which means deploying less frequently, thus increasing the deployment batch size. Instead of performing testing on large batches of changes that are scheduled around change freeze periods, fully integrate testing into daily work as part of the smooth and continual flow into production.

Enable Pair Programming to Improve Changes

Pair programming is when two engineers work together at the same workstation, a method popularized by Extreme Programming and Agile in the early 2000s.

In one common pattern of pairing, one engineer fills the role of the driver, the person who actually writes the code, while the other engineer acts as the navigator, observer, or pointer, the person who reviews the work as it is being performed. The driver focuses their attention on the tactical aspects of completing the task, using the observer as a safety net and guide.

Dr. Laurie Williams performed a study in 2001 that showed “paired programmers are 15% slower than two independent individual programmers, while ‘error-free’ code increased from 70% to 85%.”

“Pairs typically consider more design alternatives than programmers working alone and arrive at simpler, more maintainable designs; they also catch design defects early.“
Dr Laurie Williams

Pair programming has the additional benefit of spreading knowledge throughout the organization and increasing information flow within the team.

Evaluating the Effectiveness of Pull Request Process

One method to evaluate the effectiveness of peer review is to look at production outages and examine the peer review process for any relevant changes.

Ryan Tomayko, CIO and co-founder of GitHub:

“A bad pull request is one that doesn’t have enough context for the reader, having little or no documentation of what the change is intended to do.”
“A great pull request has sufficient detail on why the change is being made, how the change was made, as well as any identified risks and resulting countermeasures.”

Fearlessly Cut Bureaucratic Process

Many companies still have long-standing processes for approval that require months to navigate. These approval processes can significantly increase lead times, not only preventing teams from delivering value quickly to customers, but potentially increasing the risk to our organizational objectives.

“A great metric to publish widely is how many meetings and work tickets are mandatory to perform a release—the goal is to relentlessly reduce the effort required for engineers to perform work and deliver it to the customer.”
Adrian Cockcroft

Lessons Learned

By implementing feedback loops teams can enable everyone to work together toward shared goals, see problems as they occur, and ensure that features not only operate as designed in production, but also achieve organizational goals and organizational learning.

Book Club: The DevOps Handbook (Chapter 17. Integrate Hypothesis-Driven Development and A/B Testing into Our Daily Work)

This entry is in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.
And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.
Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.
The DevOps Handbook

Chapter 17

All too often in software projects, developers work on features for months or years, spanning multiple releases, without ever confirming whether the desired business outcomes are being met, such as whether a particular feature is achieving the desired results or even being used at all.

Before building a feature, teams should ask themselves: “Should we build it, and why?”

A Brief History of A/B Testing

A/B testing techniques were pioneered in direct response marketing, which is one of the two major categories of marketing strategies. The other is called mass marketing or brand marketing; it relies on placing as many ad impressions in front of people as possible to influence buying decisions.

In previous eras, before email and social media, direct response marketing meant sending thousands of postcards or flyers via postal mail, and asking prospects to accept an offer by calling a telephone number, returning a postcard, or placing an order.

Integrating A/B Testing Into Feature Testing

The most commonly used A/B technique in modern UX practice involves a website where visitors are randomly selected to be shown one of two versions of a page, either a control (“A”) or a treatment (“B”).

A/B tests are also known as online controlled experiments and split tests. Performing meaningful user research and experiments ensures that development efforts help achieve customer and organizational goals.

Integrate A/B Testing Into Releases

Fast and iterative A/B testing is made possible by being able to quickly and easily do production deployments on demand, using feature toggles and potentially delivering multiple versions of our code simultaneously to customer segments.

Integrate A/B Testing Into Feature Planning

Product owners should think about each feature as a hypothesis and use production releases as experiments with real users to prove or disprove that hypothesis.

Hypothesis-Driven Development:

We Believe that increasing the size of hotel images on the booking page.
Will Result in improved customer engagement and conversion.
We Will Have Confidence To Proceed When we see a 5% increase in customers who review hotel images who then proceed to book in forty-eight hours.

Book Club: The DevOps Handbook (Chapter 16. Enable Feedback So Development and Operations Can Safely Deploy Code)

This entry is in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.
And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.
Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.
The DevOps Handbook

Chapter 16

The goal is to catch errors in the deployment pipeline before they get into production. However, there will still be errors teams don’t detect, and so they must rely on production telemetry to quickly restore service.

Solutions available to teams:

Turn off broken features with feature toggles
Fix forward (make code changes to fix the defect that are pushed into production through the deployment pipeline)
Roll back (switch back to the previous release by taking broken servers out of rotation using the blue-green or canary release patterns)

Since production deployments are one of the top causes of production issues, each deployment and change event is overlaid onto our metric graphs to ensure that everyone in the value stream is aware of relevant activity, enabling better communication and coordination, as well as faster detection and recovery.

Developers Share Production Duties With Operations

Even when production deployments and releases go flawlessly, in any complex service there will still have unexpected problems, such as incidents and outages that happen at inopportune times. Even if the problem results in a defect being assigned to the feature team, it may be prioritized below the delivery of new features.

As Patrick Lightbody, SVP of Product Management at New Relic, observed in 2011, “We found that when we woke up developers at 2 a.m., defects were fixed faster than ever.” This practice helps Development management see that business goals are not achieved simply because features have been marked as “done”. Instead, the feature is only done when it is performing as designed in production, without causing excessive escalations or unplanned work for either Development or Operations.

When developers get feedback on how their applications perform in production, which includes fixing it when it breaks, they become closer to the customer.

Have Developers Follow Work Downstream

One of the most powerful techniques in interaction and user experience design (UX) is contextual inquiry. This is when the product team watches a customer use the application in their natural environment, often working at their desk. Doing so often uncovers ways that customers struggle with the application, such as:

Requiring scores of clicks to perform simple tasks in their daily work
Cutting and pasting text from multiple screens
Writing down notes on paper

Developers should follow their work downstream, so they can see how downstream work centers must interact with their product to get it running into production. Teams create feedback on the non-functional aspects of our code and identify ways to improve deployability, manageability, operability, etc.

Have Developers Initially Self-Manage Their Production Service

Even when Developers are writing and running their code in production-like environments in their daily work, Operations may still experience disastrous production releases because it’s the first time the application is under true production conditions. This result occurs because operational learnings often occur too late in the software life cycle.

One potential countermeasure is to do what Google does, which is have Development groups self-manage their services in production before they become eligible for a centralized Ops group to manage. By having developers be responsible for deployment and production support, teams are more likely to have a smooth transition to Operations.

Teams could define launch requirements that must be met in order for services to interact with real customers and be exposed to real production traffic.

Launch Guidance:

Defect counts and severity: Does the application actually perform as designed?
Type/frequency of pager alerts: Is the application generating an unsupportable number of alerts in production?
Monitoring coverage: Is the coverage of monitoring sufficient to restore service when things go wrong?
System architecture: Is the service loosely-coupled enough to support a high rate of changes and deployments in production?
Deployment process: Is there a predictable, deterministic, and sufficiently automated process to deploy code into production?
Production hygiene: Is there evidence of enough good production habits that would allow production support to be managed by anyone else?

Google’s Service Handback Mechanism

When a production service becomes sufficiently fragile, Operations has the ability to return production support responsibility back to Development. When a service goes back into a developer-managed state, the role of Operations shifts from production support to consultation, helping the team make the service production-ready.

Google created two sets of safety checks for two critical stages of releasing new services called the Launch Readiness Review and the Hand-Of Readiness Review. The LRR must be performed and signed off on before any new Google service is made publicly available to customers and receives live production traffic. The HRR is performed when the service is transitioned to an Ops-managed state. The HRR is far more stringent and has higher acceptance standards.

The practice of SREs helping product teams early is an important cultural norm that is continually reinforced at Google. Helping product teams is a long-term investment that will pay off many months later when it comes time to launch. It is a form of ‘good citizenship’ and ‘community service’ that is valued, it is routinely considered when evaluating engineers for SRE promotions.

Common Regulatory Concerns to Answer

Does the service generate a significant amount of revenue?
Does the service have high user traffic or have high outage/impairment costs?
Does the service store payment cardholder information, such as credit card numbers, or personally identifiable information, such as Social Security numbers or patient care records? Are there other security issues that could create regulatory, contractual obligation, privacy, or reputation risk?
Does the service have any other regulatory or contractual compliance requirements associated with it, such as US export regulations, PCI-DSS, HIPAA, and so forth?

Book Club: The DevOps Handbook (Chapter 15. Analyze Telemetry to Better Anticipate Problems and Achieve Goals)

This entry is in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.
And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.
Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.
The DevOps Handbook

Chapter 15

Outlier Detection: “abnormal running conditions from which significant performance degradation may well result, such as an aircraft engine rotation defect or a flow problem in a pipeline.”

Teams create better alerts by increasing the signal-to-noise ratio, focusing on the variances or outliers that matter.

Instrument and Alert on Undesired Outcomes

Analyze the most severe incidents in the recent past and create a list of telemetry that could have enabled earlier / faster detection and diagnosis of the problem, as well as easier and faster confirmation that an effective fix had been implemented.

For instance, if an NGINX web server stopped responding to requests, a team could look at the leading indicators that could have warned the team the system was starting to deviate from standard operations, such as:

Application level: increasing web page load times, etc.
OS level: server free memory running low, disk space running low, etc.
Database level: database transaction times taking longer than normal, etc.
Network level: number of functioning servers behind the load balancer dropping, etc.

Problems That Arise When Telemetry Data Has Non-Gaussian Distribution

Using means and standard deviations to detect variance can be extremely useful. However, using these techniques on many of the telemetry data sets that we use in Operations will not generate the desired results. When the distribution of the data set does not have the Gaussian bell curve described earlier, the properties associated with standard deviations do not apply.

Many production data sets are non-Gaussian distribution.

“In Operations, many of our data sets have what we call ‘chi squared’ distribution. Using standard deviations for this data not only results in over- or under-alerting, but it also results in nonsensical results.” She continues, “When you compute the number of simultaneous downloads that are three standard deviations below the mean, you end up with a negative number, which obviously doesn’t make sense.”
Dr. Nicole Forsgren, The DevOps Handbook

Another tool developed at Netflix to increase service quality, Scryer, addresses some of the shortcomings of Amazon Auto Scaling (AAS), which dynamically increases and decreases AWS compute server counts based on workload data. Scryer works by predicting what customer demands will be based on historical usage patterns and provisions the necessary capacity.

Scryer addressed three problems with AAS:

Dealing with rapid spikes in demand
After outages, the rapid decrease in customer demand led to AAS removing too much compute capacity to handle future incoming demand
AAS didn’t factor in known usage traffic patterns when scheduling compute capacity

Using Anomaly Detection Techniques

Anomaly Detection is “the search for items or events which do not conform to an expected pattern.”

Smoothing is using moving averages (or rolling averages), which transform data by averaging each point with all the other data within a sliding window. This has the effect of smoothing out short-term fluctuations and highlighting longer-term trends or cycles.

Book Club: The DevOps Handbook (Chapter 14. Create Telemetry to Enable Seeing and Solving Problems)

This entry is in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.
And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.
Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.
The DevOps Handbook

Chapter 14

Enhancement activities:

Creating telemetry to enable seeing and solving problems
Using telemetry to better anticipate problems and achieve goals
Integrating user research and feedback into the work of product teams
Enabling feedback so Dev and Ops can safely perform deployments
Enabling feedback to increase the quality of work through peer reviews and pair programming

Telemetry is the process of recording and transmitting the readings of an instrument.

During an outage teams may not be able to determine whether the issue is due to:

A failure in our application (e.g., defect in the code)
In our environment (e.g., a networking problem, server configuration problem)
Something entirely external to us (e.g., a massive denial of service attack)

Operations Rule of Thumb: When something goes wrong in production, we just reboot the server.

Telemetry can be redefined as “an automated communications process by which measurements and other data are collected at remote points and are subsequently transmitted to receiving equipment for monitoring.”

“If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it….Tracking everything is key to moving fast, but the only way to do it is to make tracking anything easy….We enable engineers to track what they need to track, at the drop of a hat, without requiring time-sucking configuration changes or complicated processes.”
Ian Malpass

Research has shown high performers could resolve production incidents 168 times faster than their peers.

Mean Time To Resolution (MTTR) for (left to right) High, Medium, and Low performers. Adopted from the DevOps Handbook.

Create Centralized Telemetry Infrastructure

For decades companies have ended up with silos of information, where Development only creates logging events that are interesting to developers, and Operations only monitors whether the environments are up or down. As a result, when inopportune events occur, no one can determine why the entire system is not operating as designed or which specific component is failing, impeding a teams ability to bring the system back to a working state.

Monitoring involves data collection at the business logic, application, and environments layer (events, logs, metrics) and an event router responsible for storing events and metrics (visualization, trending, alerting, anomaly detection). By transforming logs into metrics, teams can perform statistical operations on them, such as using anomaly detection to find outliers and variances even earlier in the problem cycle.

Ensure it’s easy to enter and retrieve information from our telemetry infrastructure.

Create Application Logging Telemetry That Helps Production

Logging Levels:

DEBUG level – program specific
INFO level – user driven or system specific (credit card transaction)
WARN level – conditions that can become an error (long DB call)
ERROR level – error conditions like API call fail
FATAL level – when to terminate (network daemon can’t bind to a network socket)

Choosing the right logging level is important. Dan North, a former ThoughtWorks consultant who was involved in several projects in which the core continuous delivery concepts took shape, observes, “When deciding whether a message should be ERROR or WARN, imagine being woken up at 4 a.m. Low printer toner is not an ERROR.”

Significant Logging Events: Authentication/authorization decisions (including logoff), System and data access, System and application changes (especially privileged changes), Data changes (such as adding, editing, or deleting data), Invalid input (possible malicious injection, threats, etc.), Resources (RAM, disk, CPU, bandwidth, or any other resource that has hard or soft limits), Health and availability, Startups and shutdowns, Faults and errors, Circuit breaker trips, Delays, and Backup success/failure.

Use Telemetry To Guide Problem Solving

When there is a culture of blame around outages and problems, groups may avoid documenting changes and displaying telemetry where everyone can see them to avoid being blamed for outages. The so-called “mean time until declared innocent” is how quickly someone can convince everyone else that they didn’t cause the outage.

Questions to ask during problem resolution:

What evidence from monitoring that a problem is actually occurring?
What are the relevant events and changes in applications and environments that could have contributed to the problem?
What hypotheses can be formulated to confirm the link between the proposed causes and effects?
How can these hypotheses be proven to be correct and successfully effect a fix?

Enable the Creation of Production Metrics as Part of Daily Work

Create Self-Service Access to Telemetry and Information Radiators

Information Radiator: defined by the Agile Alliance as “the generic term for any of a number of handwritten, drawn, printed, or electronic displays which a team places in a highly visible location, so that all team members as well as passers-by can see the latest information at a glance: count of automated tests, velocity, incident reports, continuous integration status, and so on. This idea originated as part of the Toyota Production System.”

By putting information radiators in highly visible places, teams promote responsibility among team members, actively demonstrating the following values: (1) The team has nothing to hide from its visitors and (2) The team has nothing to hide from itself — they acknowledge and confront problems.

Find and Fill Telemetry Gaps

Business level: Examples include the number of sales transactions, revenue of sales transactions, user signups, churn rate, A/B testing results, etc.
Application level: Examples include transaction times, user response times, application faults, etc.
Infrastructure level (e.g., database, operating system, networking, storage): Examples include web server traffic, CPU load, disk usage, etc.
Client software level (e.g., JavaScript on the client browser, mobile application): Examples include application errors and crashes, user measured transaction times, etc.
Deployment pipeline level: Examples include build pipeline status (e.g., red or green for our various automated test suites), change deployment lead times, deployment frequencies, test environment promotions, and environment status.

Application and Business Metrics

At the application level, the goal is to ensure teams are generating telemetry not only around application health, but also to measure to what extent they’re achieving organizational goals (e.g., number of new users, user login events, user session lengths, percent of users active, how often certain features are being used).

If a team has a service that’s supporting e-commerce, they should ensure telemetry around all of the user events that lead up to a successful transaction that generates revenue. The team can then instrument all the user actions that are required for desired customer outcomes.

For e-commerce sites, they may want to maximize the time spent on the site; however, for search engines, they may want to reduce the time spent on the site, since long sessions may indicate that users are having difficulty finding what they’re looking for. Business metrics will be part of a customer acquisition funnel, which is the theoretical steps a potential customer will take to make a purchase. For instance, in an e-commerce site, the measurable journey events include total time on site, product link clicks, shopping cart adds, and completed orders.

Infrastructure Metrics

The goal for production and non-production infrastructure is to ensure teams are generating enough telemetry so that if a problem occurs in any environment, they can quickly determine whether infrastructure is a contributing cause of the problem.

Overlaying Other Relevant Information Onto Metrics

Operational side effects are not just outages, but also significant disruptions and deviations from standard operations. Make work visible by overlaying all production deployment activities on graphs.

By having all elements of a service emitting telemetry that can be analyzed, whether it’s the application, database, or environment, and making that telemetry widely available, teams can find and fix problems long before they cause something catastrophic.

Book Club: The DevOps Handbook (Chapter 13. Architect for Low-Risk Releases)

This entry is in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.
And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.
Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.
The DevOps Handbook

Chapter 13

Strangler Application Pattern — instead of “ripping out and replacing” old services with architectures that no longer support our organizational goals, the existing functionality is put behind an API and further changes to the old service are avoided. All new functionality is then implemented in the new services that use the new desired architecture, making calls to the old system when necessary.

Charles Betz, author of Architecture and Patterns for IT Service Management, Resource Planning, and Governance, observes, “[IT project owners] are not held accountable for their contributions to overall system entropy.”

Reducing overall complexity and increasing the productivity of all development teams is rarely the goal of an individual project.

An Architecture That Enables Productivity, Testability, and Safety

Loosely-coupled architecture with well-defined interfaces enforce how modules connect with each other promotes productivity and safety. It enables small, productive, two-pizza teams that are able to make small changes that can be safely and independently deployed. Since each service also has a well-defined API, it enables easier testing of services and the creation of contracts and SLAs between teams.

Architecture Patterns Pros and Cons

Monolithic (all functionality in one application):

Pro: Simple at first
Pro: Low inter-process latencies
Pro: Single codebase, one deployment unit
Pro: Resource-efficient at small scales
Con: Coordination overhead increases as team grows
Con: Poor enforcement of modularity
Con: Poor scaling
Con: All-or-nothing deploy (downtime, failures)
Con: Long build times

Monolithic (set of monolithic tiers: “front end”, “application server”, “database layer”):

Pro: Simple at first
Pro: Join queries are easy
Pro: Single schema, deployment
Pro: Resource-efficient at small scales
Con: Tendency for increased coupling over time
Con: Poor scaling and redundancy (all or nothing, vertical only)
Con: Difficult to tune properly
Con: All-or-nothing schema management

Microservice (modular, independent, graph relationship vs tiers, isolated persistence):

Pro: Each unit is simple
Pro: Independent scaling and performance
Pro: Independent testing and deployment
Pro: Can optimally tune performance (caching, replication, etc.)
Con: Many cooperating units
Con: Many small repos
Con: Requires more sophisticated tooling and dependency management
Con: Network latencies

Use the Strangler Application Pattern to Safely Evolve Enterprise Architecture

The term strangler application was coined by Martin Fowler in 2004 after he was inspired by seeing massive strangler vines during a trip to Australia, writing, “They seed in the upper branches of a fig tree and gradually work their way down the tree until they root in the soil. Over many years they grow into fantastic and beautiful shapes, meanwhile strangling and killing the tree that was their host.”

The strangler application pattern involves placing existing functionality behind an API, where it remains unchanged, and implementing new functionality using the desired architecture, making calls to the old system when necessary. When teams implement strangler applications, they seek to access all services through versioned APIs, also called versioned services or immutable services.

From The Pipeline v33.0

This entry is in the series From the Pipeline

The following will be a regular feature where we share articles, podcasts, and webinars of interest from the web.

On the Diverse And Fantastical Shapes of Testing

Martin Fowler walks through recent discussion on testing models and the loose definition of “unit test” with some historical background. The test pyramid posits that most testing done as unit tests, whereas the honeycomb and trophy instead focus on a relatively small amount of unit tests and focus mostly on integration tests.

Why You Shouldn’t Use Cucumber for API Testing

“Many people misunderstand the purpose of Cucumber. Because it seems to yield clearer, plain-language test scripts, testers want to use Cucumber as a general-purpose testing tool, including for API tests. But its true purpose is as a BDD framework. You may be thinking, what’s the harm? Here’s why it makes a difference—and why you should choose another tool for API testing.”

Value Stream Thinking: The Next Level of DevOps

Rather than focusing solely on automation, DevOps is much bigger than a CI/CD pipeline. In this article from CloudBees, they run through five reasons to apply value stream thinking. Those categories are: (1)DevOps isn’t just pipelines and automation, (2) Visibility identifies issues and creates consensus, (3) Measurement + value stream thinking = The where and the how, (4) Value should be added at every stage, and (5) Value stream thinking helps negotiate complexity.

Accessibility Testing on Foldable Smartphones

Foldable smartphones are next generation smartphones. Native app development teams will have to adjust for non-functional testing areas such as accessibility, security, performance and UX. For accessibility specifically, there will be scans for both opened and folded modes.

How to Decide if You Should Automate a Test Case

Test automation is imperative for the fast-paced agile projects of today. Testers need to continuously plan, design and execute automated tests to ensure the quality of the software. But the most important task is to decide what to automate first. Here, we have compiled a list of questions to help you prioritize what you should automate next and guide your test automation strategy.

Book Club: The DevOps Handbook (Chapter 12. Automate and Enable Low-Risk Releases)

This entry is in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.
And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.
Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.
The DevOps Handbook

Chapter 12

Kent Beck, the creator of the Extreme Programming methodology, one of the leading proponents of Test Driven Development, and technical coach at Facebook, provided details on their code release strategy at Facebook.

“Chuck Rossi made the observation that there seem to be a fixed number of changes Facebook can handle in one deployment. If we want more changes, we need more deployments. This has led to a steady increase in deployment pace over the past five years, from weekly to daily to thrice daily deployments of our PHP code and from six to four to two-week cycles for deploying our mobile apps. This improvement has been driven primarily by the release engineering team.”
Kent Beck, The DevOps Handbook

Automation The Deployment Process

Some recommended good practices for automating the deployment process include:

Packaging code in ways suitable for deployment
Creating pre-configured virtual machine images or containers
Automating the deployment and configuration of middleware
Copying packages or files onto production servers
Restarting servers, applications, or services
Generating configuration files from templates
Running automated smoke tests to make sure the system is working and correctly configured
Running testing procedures
Scripting and automating database migrations

Deployment Pipeline Requirements:

Deploying the same way to every environment: By using the same deployment mechanism for every environment, production deployments are likely to be far more successful since the team knows that it’s been successfully deployed many times already earlier in the pipeline.

Smoke testing deployments: During the deployment process, test connections to any supporting systems (e.g., databases, message buses, external services) and run a single test “transaction” through the system to ensure that the system is performing as designed. If any of these tests fail, the deployment should be failed.

Ensure consistent environments: In previous steps, the team created a single-step environment build process so that the development, test, and production environments had a common build mechanism. The team must continually ensure that these environments remain synchronized.

Enable Automated Service Deployments

Tim Tischler, Director of Operations Automation at Nike, describes the common experience of a generation of developers: “As a developer, there has never been a more satisfying point in my career than when I wrote the code, when I pushed the button to deploy it, when I could see the production metrics confirm that it actually worked in production, and when I could fix it myself if it didn’t.”

Puppet Labs 2013 State of DevOps Report, which surveyed over four thousand technology professionals, found that there was no statistically significant difference in the change success rates between organizations where Development deployed code and those where Operations deployed code.

Changes to Deployment Strategy:

Build: The deployment pipeline must create packages from version control that can be deployed to any environment, including production.

Test: Anyone should be able to run any or all of our automated test suite on their workstation or on test systems.

Deploy: Anybody should be able to deploy these packages to any environment.

Integrate Code Deployments Into The Deployment Pipeline

Deployment automation must provide the following capabilities:

Ensure packages created during the continuous integration process are suitable for deployment into production
Show the readiness of production environments at a glance
Provide a push-button, self-service method for any suitable version of the packaged code to be deployed into production
Record automatically — for auditing and compliance purposes — which commands were run on which machines when, who authorized it, and what the output was
Run a smoke test to ensure the system is operating correctly and the configuration settings — including items such as database connection strings — are correct
Provide fast feedback for the deployer so they can quickly determine whether their deployment was successful

Etsy provided solid insight into the state of their deployment process including capabilities.

The goal at Etsy has been to make it easy and safe to deploy into production with the fewest number of steps and the least amount of ceremony. For instance, deployments are performed by anyone who wants to perform a deployment. Engineers who want to deploy their code first go to a chat room, where engineers add themselves to the deploy queue, see the deployment activity in progress, see who else is in the queue, broadcast their activities, and get help from other engineers when they need it. They execute 4,500 unit tests locally and all external calls have been stubbed out. After they check-in their changes to trunk in version control, over seven thousand automated trunk tests are instantly run on their continuous integration (CI) servers.

Decouple Deployments From Releases

Deployment is the installation of a specified version of software to a given environment (e.g., deploying code into an integration test environment or deploying code into production). Specifically, a deployment may or may not be associated with a release of a feature to customers.

Release is when the team makes a feature available to all our customers or a segment of customers. The code and environments should be architected in such a way that the release of functionality does not require changing our application code.

There are two broad categories of release patterns:

Environment-based Release Patterns: two or more environments to deploy into, but only one environment is receiving live customer traffic. New code is deployed into a non-live environment, and the release is performed moving traffic to this environment. These patterns include blue-green deployments, canary releases, and cluster immune systems.

Application-based Release Patterns: modify the application to selectively release and expose specific application functionality by small configuration changes. For instance, feature flags can progressively expose new functionality in production to the development team, all internal employees, 1% of the customers, or the entire customer base. This enables a technique called dark launching, where all the functionality to be launched is staged in production and is tested with production traffic before the release.

Environment-based Release Patterns

The simplest of the three patterns is called blue-green deployment. In this pattern, there are two production environments: blue and green. At any time, only one of these is serving customer traffic. The benefits are enabling the team to perform deployments during normal business hours and conduct simple changeovers.

To implement the pattern, create two databases (blue and green database): Each version, blue (old) and green (new), has its own database. During the release, the blue database is put into read-only mode, followed by a backup operation, then a restore of the green database, and finally switch traffic to the green environment.

Additionally the team must decouple database changes from application changes. Instead of supporting two databases, the team decouples the release of database changes from the release of application changes by doing two things: (1) make only additive changes to the database, which means never mutating existing database objects; and, (2) make no assumptions in the application about which database version will be in production.

The canary release pattern automates the release process of promoting to successively larger and more critical environments as the team confirms the code is operating as designed. The term canary release comes from the tradition of coal miners bringing caged canaries into mines to provide early detection of toxic levels of carbon monoxide. If there was too much gas in the cave, it would kill the canaries before it killed the miners, alerting them to evacuate.

For the above diagram:

A1 group: Production servers that only serve internal employees.
A2 group: Production servers that only serve a small percentage of customers and are deployed when certain acceptance criteria have been met (either automated or manual).
A3 group: The rest of the production servers, which are deployed after the software running in the A2 cluster meets certain acceptance criteria.

There are two benefits to this type of safeguard: (1) the team protects against defects that are hard to find through automated tests; and, (2) the time required to detect and respond to the degraded performance by the change is reduced.

Application-based Patterns To Enable Safer Releases

Feature Toggles benefits:

Easy Roll Back – features that create problems or interruptions in production can be quickly and safely disabled by merely changing the feature toggle setting.
Gracefully degrade performance – when service experiences extremely high loads that would normally require an increase in capacity or risk having our service fail in production, feature toggles can reduce the quality of service.
Increase resilience through a service-oriented architecture – If a feature relies on another service that isn’t complete yet, the team can still deploy our feature into production but hide it behind a feature toggle. When that service finally becomes available, then toggle the feature on.

Feature toggles allow features to be deployed into production without making them accessible to users, enabling a technique known as dark launching.

Dark Launch benefits:

Deploy all the functionality into production and then perform testing of that functionality while it’s still invisible to customers.
Safely simulate production-like loads, providing confidence that the service will perform as expected.

From The Pipeline v32.0

This entry is in the series From the Pipeline

The following will be a regular feature where we share articles, podcasts, and webinars of interest from the web.

THE LEGENDS OF RUNETERRA CI/CD PIPELINE

In a look into the game industry, a software engineer at Riot shared details on how they build, test, and deploy “Legends of Runeterra”, an online card game. The team switched from Perforce to Git with a hierarchical branch-based workflow. This is because the team was breaking the main branch build too often with trunk-based development. They also create new test environments as needed for each branch so developers will have isolated sandboxes to test. Riot also use HTTP servers on debug builds of their game for direct control of the game for functional automated testing. Another cool feature Riot has developed is a custom GUI tool for the game so non-technical contributors can more easily use Git.

X-ray Vision and Exploratory Testing

“Imagine you have X-ray vision. Instead of seeing through walls, you can see the inner structure of programs, the bugs lying inside, and how to expose them. Anyone could execute the steps you gave them to reproduce the bugs. The difficulty in testing, then, is not in executing steps; it is figuring out what steps to take. How do you find those hidden bugs? We need to be the X-ray vision.”

Tips for engineering managers learning to lead remotely

GitLab team members share how they managed the shift from in-person, co-located work to working and managing teams remotely at GitLab to help others make the transition to remote work more easily. Clear communication is key — especially when looking for a quick answers as opposed to a formal meeting. There is a bias towards over-communication when working remotely. Another challenge is to build connected and engaged teams. To help, teams should proactively work to build interpersonal connections with activities such as coffee chats, sharing of non-work hobbies, and team building activities.

Building CI/CD Pipeline with Jenkins, Kubernetes & GitHub: Part 2

This article is the second in a series on implementing a CI/CD pipeline that will cover multibranch pipelines and GitHub Organization pipelines. Give this article a read if you’re interested in learning how to build from the ground up, starting with credential management, configuring pipelines, and using Kubernetes. The article also links to other training materials on fundamentals of Kubernetes and deploying with Kubernetes.

Test Flakiness – One of the main challenges of automated testing (Part II)

The Google testing blog has posted part two on their series of test flakiness. In this edition, they explore the four conditions that can cause flakiness, advice on triaging those failures, and how to remedy the problems at their source. The tests themselves can introduce flakiness, which can include test data, test workflows, initial setup of test prerequisites, and initial state of other dependencies. Additionally, an unreliable test-running framework can introduce flakiness. The application and underlying services / libraries that the testing framework depends upon can cause flakiness. Lastly, the OS and hardware that the application and testing framework depend upon can cause flakiness.

Book Club: The DevOps Handbook (Chapter 11. Enable and Practice Continuous Integration)

This entry is in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.
And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.
Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.
The DevOps Handbook

Chapter 11

The ability to “branch” in version control systems enables developers to work on different parts of the software system in parallel, without the risk of individual developers checking in changes that could destabilize or introduce errors into trunk. Integration problems result in rework to get the application into a deployable state, including conflicting changes that must be manually merged or merges that cause test failures, which can require multiple developers to resolve.

Development Practices to Support Innovation Time:

Continuous integration and trunk-based development
Investment in test automation
Creation of a hardware simulator so tests could be run on a virtual platform
The reproduction of test failures on developer workstations
Architecture to support running off a common build and release

“Without automated testing, continuous integration is the fastest way to get a big pile of junk that never compiles or runs correctly.”
DevOps Handbook, Chapter 11

Small Batch Development and What Happens When Code is Committed to Trunk Infrequently

Significant problems result when developers work in long-lived private branches (also known as “feature branches”), only merging back into trunk sporadically, resulting in a large batch size of changes.

Jeff Atwood, founder of the Stack Overflow site and author of the Coding Horror blog, observes that while there are many branching strategies they can all be put on the following spectrum:

Optimize for individual productivity – Every single person on the project works in their own private branch. Everyone works independently, and nobody can disrupt anyone else’s work; however, merging becomes a nightmare.
Optimize for team productivity – Everyone works in the same common area. There are no branches, just a long, unbroken straight line of development. There’s nothing to understand, so commits are simple, but each commit can break the entire project and bring all progress to a screeching halt.

When merging is difficult, teams become less able and motivated to improve and refactor code because refactorings are more likely to cause rework for everyone else. When this happens, teams are more reluctant to modify code that has dependencies throughout the codebase, which is where they could have the highest payoffs.

“When we do not aggressively refactor our codebase, it becomes more difficult to make changes and to maintain over time, slowing down the rate at which we can add new features.”
Ward Cunningham on Technical Debt

Solving the merge problem was one of the primary reasons behind the creation of continuous integration and trunk-based development practices — to optimize for team productivity over individual productivity.

Adopt Trunk-Based Development Practices

One countermeasure to large batch size merges is to institute continuous integration and trunk-based development practices, where all developers check in their code to trunk at least once per day.

Frequent code commits to trunk means each team can run all automated tests on their application as a whole and receive alerts when a change breaks some other part of the application or interferes with the work of another developer.

Gated Commits – the deployment pipeline first confirms that the submitted change will successfully merge, build as expected, and pass all the automated tests before actually being merged into trunk.