Book Club: The DevOps Handbook (Chapter 15. Analyze Telemetry to Better Anticipate Problems and Achieve Goals)

This entry is part 16 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 15

Outlier Detection: “abnormal running conditions from which significant performance degradation may well result, such as an aircraft engine rotation defect or a flow problem in a pipeline.”

Teams create better alerts by increasing the signal-to-noise ratio, focusing on the variances or outliers that matter.

Adopted from the DevOps Handbook

Instrument and Alert on Undesired Outcomes

Analyze the most severe incidents in the recent past and create a list of telemetry that could have enabled earlier / faster detection and diagnosis of the problem, as well as easier and faster confirmation that an effective fix had been implemented.

For instance, if an NGINX web server stopped responding to requests, a team could look at the leading indicators that could have warned the team the system was starting to deviate from standard operations, such as:

  • Application level: increasing web page load times, etc.
  • OS level: server free memory running low, disk space running low, etc.
  • Database level: database transaction times taking longer than normal, etc.
  • Network level: number of functioning servers behind the load balancer dropping, etc.

Problems That Arise When Telemetry Data Has Non-Gaussian Distribution

Using means and standard deviations to detect variance can be extremely useful. However, using these techniques on many of the telemetry data sets that we use in Operations will not generate the desired results. When the distribution of the data set does not have the Gaussian bell curve described earlier, the properties associated with standard deviations do not apply.

Many production data sets are non-Gaussian distribution.

“In Operations, many of our data sets have what we call ‘chi squared’ distribution. Using standard deviations for this data not only results in over- or under-alerting, but it also results in nonsensical results.” She continues, “When you compute the number of simultaneous downloads that are three standard deviations below the mean, you end up with a negative number, which obviously doesn’t make sense.”

Dr. Nicole Forsgren, The DevOps Handbook

Another tool developed at Netflix to increase service quality, Scryer, addresses some of the shortcomings of Amazon Auto Scaling (AAS), which dynamically increases and decreases AWS compute server counts based on workload data. Scryer works by predicting what customer demands will be based on historical usage patterns and provisions the necessary capacity.

Scryer addressed three problems with AAS:

  • Dealing with rapid spikes in demand
  • After outages, the rapid decrease in customer demand led to AAS removing too much compute capacity to handle future incoming demand
  • AAS didn’t factor in known usage traffic patterns when scheduling compute capacity
Adopted from The DevOps Handbook

Using Anomaly Detection Techniques

Anomaly Detection is “the search for items or events which do not conform to an expected pattern.”

Smoothing is using moving averages (or rolling averages), which transform data by averaging each point with all the other data within a sliding window. This has the effect of smoothing out short-term fluctuations and highlighting longer-term trends or cycles.

Adopted from The DevOps Handbook

Book Club: The DevOps Handbook (Chapter 14. Create Telemetry to Enable Seeing and Solving Problems)

This entry is part 15 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 14

Enhancement activities:

  • Creating telemetry to enable seeing and solving problems
  • Using telemetry to better anticipate problems and achieve goals
  • Integrating user research and feedback into the work of product teams
  • Enabling feedback so Dev and Ops can safely perform deployments
  • Enabling feedback to increase the quality of work through peer reviews and pair programming

Telemetry is the process of recording and transmitting the readings of an instrument.

During an outage teams may not be able to determine whether the issue is due to:

  • A failure in our application (e.g., defect in the code)
  • In our environment (e.g., a networking problem, server configuration problem)
  • Something entirely external to us (e.g., a massive denial of service attack)

Operations Rule of Thumb: When something goes wrong in production, we just reboot the server.

Telemetry can be redefined as “an automated communications process by which measurements and other data are collected at remote points and are subsequently transmitted to receiving equipment for monitoring.”

“If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it….Tracking everything is key to moving fast, but the only way to do it is to make tracking anything easy….We enable engineers to track what they need to track, at the drop of a hat, without requiring time-sucking configuration changes or complicated processes.”

Ian Malpass

Research has shown high performers could resolve production incidents 168 times faster than their peers.

Mean Time To Resolution (MTTR) for (left to right) High, Medium, and Low performers. Adopted from the DevOps Handbook.

Create Centralized Telemetry Infrastructure

For decades companies have ended up with silos of information, where Development only creates logging events that are interesting to developers, and Operations only monitors whether the environments are up or down. As a result, when inopportune events occur, no one can determine why the entire system is not operating as designed or which specific component is failing, impeding a teams ability to bring the system back to a working state.

Monitoring involves data collection at the business logic, application, and environments layer (events, logs, metrics) and an event router responsible for storing events and metrics (visualization, trending, alerting, anomaly detection). By transforming logs into metrics, teams can perform statistical operations on them, such as using anomaly detection to find outliers and variances even earlier in the problem cycle.

Ensure it’s easy to enter and retrieve information from our telemetry infrastructure.

Adopted from The DevOps Handbook

Create Application Logging Telemetry That Helps Production

Logging Levels:

  • DEBUG level – program specific
  • INFO level – user driven or system specific (credit card transaction)
  • WARN level – conditions that can become an error (long DB call)
  • ERROR level – error conditions like API call fail
  • FATAL level – when to terminate (network daemon can’t bind to a network socket)

Choosing the right logging level is important. Dan North, a former ThoughtWorks consultant who was involved in several projects in which the core continuous delivery concepts took shape, observes, “When deciding whether a message should be ERROR or WARN, imagine being woken up at 4 a.m. Low printer toner is not an ERROR.”

Significant Logging Events: Authentication/authorization decisions (including logoff), System and data access, System and application changes (especially privileged changes), Data changes (such as adding, editing, or deleting data), Invalid input (possible malicious injection, threats, etc.), Resources (RAM, disk, CPU, bandwidth, or any other resource that has hard or soft limits), Health and availability, Startups and shutdowns, Faults and errors, Circuit breaker trips, Delays, and Backup success/failure.

Use Telemetry To Guide Problem Solving

When there is a culture of blame around outages and problems, groups may avoid documenting changes and displaying telemetry where everyone can see them to avoid being blamed for outages. The so-called “mean time until declared innocent” is how quickly someone can convince everyone else that they didn’t cause the outage.

Questions to ask during problem resolution:

  • What evidence from monitoring that a problem is actually occurring?
  • What are the relevant events and changes in applications and environments that could have contributed to the problem?
  • What hypotheses can be formulated to confirm the link between the proposed causes and effects?
  • How can these hypotheses be proven to be correct and successfully effect a fix?

Enable the Creation of Production Metrics as Part of Daily Work

Adopted from The DevOps Handbook

Create Self-Service Access to Telemetry and Information Radiators

Information Radiator: defined by the Agile Alliance as “the generic term for any of a number of handwritten, drawn, printed, or electronic displays which a team places in a highly visible location, so that all team members as well as passers-by can see the latest information at a glance: count of automated tests, velocity, incident reports, continuous integration status, and so on. This idea originated as part of the Toyota Production System.”

By putting information radiators in highly visible places, teams promote responsibility among team members, actively demonstrating the following values: (1) The team has nothing to hide from its visitors and (2) The team has nothing to hide from itself — they acknowledge and confront problems.

Find and Fill Telemetry Gaps

  • Business level: Examples include the number of sales transactions, revenue of sales transactions, user signups, churn rate, A/B testing results, etc.
  • Application level: Examples include transaction times, user response times, application faults, etc.
  • Infrastructure level (e.g., database, operating system, networking, storage): Examples include web server traffic, CPU load, disk usage, etc.
  • Client software level (e.g., JavaScript on the client browser, mobile application): Examples include application errors and crashes, user measured transaction times, etc.
  • Deployment pipeline level: Examples include build pipeline status (e.g., red or green for our various automated test suites), change deployment lead times, deployment frequencies, test environment promotions, and environment status.

Application and Business Metrics

At the application level, the goal is to ensure teams are generating telemetry not only around application health, but also to measure to what extent they’re achieving organizational goals (e.g., number of new users, user login events, user session lengths, percent of users active, how often certain features are being used).

If a team has a service that’s supporting e-commerce, they should ensure telemetry around all of the user events that lead up to a successful transaction that generates revenue. The team can then instrument all the user actions that are required for desired customer outcomes.

For e-commerce sites, they may want to maximize the time spent on the site; however, for search engines, they may want to reduce the time spent on the site, since long sessions may indicate that users are having difficulty finding what they’re looking for. Business metrics will be part of a customer acquisition funnel, which is the theoretical steps a potential customer will take to make a purchase. For instance, in an e-commerce site, the measurable journey events include total time on site, product link clicks, shopping cart adds, and completed orders.

Adopted from The DevOps Handbook

Infrastructure Metrics

The goal for production and non-production infrastructure is to ensure teams are generating enough telemetry so that if a problem occurs in any environment, they can quickly determine whether infrastructure is a contributing cause of the problem.

Overlaying Other Relevant Information Onto Metrics

Operational side effects are not just outages, but also significant disruptions and deviations from standard operations. Make work visible by overlaying all production deployment activities on graphs.

By having all elements of a service emitting telemetry that can be analyzed, whether it’s the application, database, or environment, and making that telemetry widely available, teams can find and fix problems long before they cause something catastrophic.

Book Club: The DevOps Handbook (Chapter 13. Architect for Low-Risk Releases)

This entry is part 14 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 13

Strangler Application Pattern — instead of “ripping out and replacing” old services with architectures that no longer support our organizational goals, the existing functionality is put behind an API and further changes to the old service are avoided. All new functionality is then implemented in the new services that use the new desired architecture, making calls to the old system when necessary.

Charles Betz, author of Architecture and Patterns for IT Service Management, Resource Planning, and Governance, observes, “[IT project owners] are not held accountable for their contributions to overall system entropy.”

Reducing overall complexity and increasing the productivity of all development teams is rarely the goal of an individual project.

An Architecture That Enables Productivity, Testability, and Safety

Loosely-coupled architecture with well-defined interfaces enforce how modules connect with each other promotes productivity and safety. It enables small, productive, two-pizza teams that are able to make small changes that can be safely and independently deployed. Since each service also has a well-defined API, it enables easier testing of services and the creation of contracts and SLAs between teams.

Adopted from The DevOps Handbook

Architecture Patterns Pros and Cons

Monolithic (all functionality in one application):

  • Pro: Simple at first
  • Pro: Low inter-process latencies
  • Pro: Single codebase, one deployment unit
  • Pro: Resource-efficient at small scales
  • Con: Coordination overhead increases as team grows
  • Con: Poor enforcement of modularity
  • Con: Poor scaling
  • Con: All-or-nothing deploy (downtime, failures)
  • Con: Long build times

Monolithic (set of monolithic tiers: “front end”, “application server”, “database layer”):

  • Pro: Simple at first
  • Pro: Join queries are easy
  • Pro: Single schema, deployment
  • Pro: Resource-efficient at small scales
  • Con: Tendency for increased coupling over time
  • Con: Poor scaling and redundancy (all or nothing, vertical only)
  • Con: Difficult to tune properly
  • Con: All-or-nothing schema management

Microservice (modular, independent, graph relationship vs tiers, isolated persistence):

  • Pro: Each unit is simple
  • Pro: Independent scaling and performance
  • Pro: Independent testing and deployment
  • Pro: Can optimally tune performance (caching, replication, etc.)
  • Con: Many cooperating units
  • Con: Many small repos
  • Con: Requires more sophisticated tooling and dependency management
  • Con: Network latencies

Use the Strangler Application Pattern to Safely Evolve Enterprise Architecture

The term strangler application was coined by Martin Fowler in 2004 after he was inspired by seeing massive strangler vines during a trip to Australia, writing, “They seed in the upper branches of a fig tree and gradually work their way down the tree until they root in the soil. Over many years they grow into fantastic and beautiful shapes, meanwhile strangling and killing the tree that was their host.”

The strangler application pattern involves placing existing functionality behind an API, where it remains unchanged, and implementing new functionality using the desired architecture, making calls to the old system when necessary. When teams implement strangler applications, they seek to access all services through versioned APIs, also called versioned services or immutable services.

From The Pipeline v33.0

This entry is part 33 of 34 in the series From the Pipeline

The following will be a regular feature where we share articles, podcasts, and webinars of interest from the web.

On the Diverse And Fantastical Shapes of Testing

Martin Fowler walks through recent discussion on testing models and the loose definition of “unit test” with some historical background. The test pyramid posits that most testing done as unit tests, whereas the honeycomb and trophy instead focus on a relatively small amount of unit tests and focus mostly on integration tests.

Why You Shouldn’t Use Cucumber for API Testing

“Many people misunderstand the purpose of Cucumber. Because it seems to yield clearer, plain-language test scripts, testers want to use Cucumber as a general-purpose testing tool, including for API tests. But its true purpose is as a BDD framework. You may be thinking, what’s the harm? Here’s why it makes a difference—and why you should choose another tool for API testing.”

Value Stream Thinking: The Next Level of DevOps

Rather than focusing solely on automation, DevOps is much bigger than a CI/CD pipeline. In this article from CloudBees, they run through five reasons to apply value stream thinking. Those categories are: (1)DevOps isn’t just pipelines and automation, (2) Visibility identifies issues and creates consensus, (3) Measurement + value stream thinking = The where and the how, (4) Value should be added at every stage, and (5) Value stream thinking helps negotiate complexity.

Accessibility Testing on Foldable Smartphones

Foldable smartphones are next generation smartphones. Native app development teams will have to adjust for non-functional testing areas such as accessibility, security, performance and UX. For accessibility specifically, there will be scans for both opened and folded modes.

How to Decide if You Should Automate a Test Case

Test automation is imperative for the fast-paced agile projects of today. Testers need to continuously plan, design and execute automated tests to ensure the quality of the software. But the most important task is to decide what to automate first. Here, we have compiled a list of questions to help you prioritize what you should automate next and guide your test automation strategy.

Book Club: The DevOps Handbook (Chapter 12. Automate and Enable Low-Risk Releases)

This entry is part 13 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 12

Kent Beck, the creator of the Extreme Programming methodology, one of the leading proponents of Test Driven Development, and technical coach at Facebook, provided details on their code release strategy at Facebook.

“Chuck Rossi made the observation that there seem to be a fixed number of changes Facebook can handle in one deployment. If we want more changes, we need more deployments. This has led to a steady increase in deployment pace over the past five years, from weekly to daily to thrice daily deployments of our PHP code and from six to four to two-week cycles for deploying our mobile apps. This improvement has been driven primarily by the release engineering team.”

Kent Beck, The DevOps Handbook

Automation The Deployment Process

Some recommended good practices for automating the deployment process include:

  • Packaging code in ways suitable for deployment
  • Creating pre-configured virtual machine images or containers
  • Automating the deployment and configuration of middleware
  • Copying packages or files onto production servers
  • Restarting servers, applications, or services
  • Generating configuration files from templates
  • Running automated smoke tests to make sure the system is working and correctly configured
  • Running testing procedures
  • Scripting and automating database migrations

Deployment Pipeline Requirements:

Deploying the same way to every environment: By using the same deployment mechanism for every environment, production deployments are likely to be far more successful since the team knows that it’s been successfully deployed many times already earlier in the pipeline.

Smoke testing deployments: During the deployment process, test connections to any supporting systems (e.g., databases, message buses, external services) and run a single test “transaction” through the system to ensure that the system is performing as designed. If any of these tests fail, the deployment should be failed.

Ensure consistent environments: In previous steps, the team created a single-step environment build process so that the development, test, and production environments had a common build mechanism. The team must continually ensure that these environments remain synchronized.

Adopted from The DevOps Handbook

Enable Automated Service Deployments

Tim Tischler, Director of Operations Automation at Nike, describes the common experience of a generation of developers: “As a developer, there has never been a more satisfying point in my career than when I wrote the code, when I pushed the button to deploy it, when I could see the production metrics confirm that it actually worked in production, and when I could fix it myself if it didn’t.”

Puppet Labs 2013 State of DevOps Report, which surveyed over four thousand technology professionals, found that there was no statistically significant difference in the change success rates between organizations where Development deployed code and those where Operations deployed code.

Changes to Deployment Strategy:

Build: The deployment pipeline must create packages from version control that can be deployed to any environment, including production.

Test: Anyone should be able to run any or all of our automated test suite on their workstation or on test systems.

Deploy: Anybody should be able to deploy these packages to any environment.

Integrate Code Deployments Into The Deployment Pipeline

Deployment automation must provide the following capabilities:

  • Ensure packages created during the continuous integration process are suitable for deployment into production
  • Show the readiness of production environments at a glance
  • Provide a push-button, self-service method for any suitable version of the packaged code to be deployed into production
  • Record automatically — for auditing and compliance purposes — which commands were run on which machines when, who authorized it, and what the output was
  • Run a smoke test to ensure the system is operating correctly and the configuration settings — including items such as database connection strings — are correct
  • Provide fast feedback for the deployer so they can quickly determine whether their deployment was successful
Adopted from The DevOps Handbook

Etsy provided solid insight into the state of their deployment process including capabilities.

The goal at Etsy has been to make it easy and safe to deploy into production with the fewest number of steps and the least amount of ceremony. For instance, deployments are performed by anyone who wants to perform a deployment. Engineers who want to deploy their code first go to a chat room, where engineers add themselves to the deploy queue, see the deployment activity in progress, see who else is in the queue, broadcast their activities, and get help from other engineers when they need it. They execute 4,500 unit tests locally and all external calls have been stubbed out. After they check-in their changes to trunk in version control, over seven thousand automated trunk tests are instantly run on their continuous integration (CI) servers.

Decouple Deployments From Releases

Deployment is the installation of a specified version of software to a given environment (e.g., deploying code into an integration test environment or deploying code into production). Specifically, a deployment may or may not be associated with a release of a feature to customers.

Release is when the team makes a feature available to all our customers or a segment of customers. The code and environments should be architected in such a way that the release of functionality does not require changing our application code.

There are two broad categories of release patterns:

Environment-based Release Patterns: two or more environments to deploy into, but only one environment is receiving live customer traffic. New code is deployed into a non-live environment, and the release is performed moving traffic to this environment. These patterns include blue-green deployments, canary releases, and cluster immune systems.

Application-based Release Patterns: modify the application to selectively release and expose specific application functionality by small configuration changes. For instance, feature flags can progressively expose new functionality in production to the development team, all internal employees, 1% of the customers, or the entire customer base. This enables a technique called dark launching, where all the functionality to be launched is staged in production and is tested with production traffic before the release.

Environment-based Release Patterns

The simplest of the three patterns is called blue-green deployment. In this pattern, there are two production environments: blue and green. At any time, only one of these is serving customer traffic. The benefits are enabling the team to perform deployments during normal business hours and conduct simple changeovers.

Adopted from The DevOps Handbook

To implement the pattern, create two databases (blue and green database): Each version, blue (old) and green (new), has its own database. During the release, the blue database is put into read-only mode, followed by a backup operation, then a restore of the green database, and finally switch traffic to the green environment.

Additionally the team must decouple database changes from application changes. Instead of supporting two databases, the team decouples the release of database changes from the release of application changes by doing two things: (1) make only additive changes to the database, which means never mutating existing database objects; and, (2) make no assumptions in the application about which database version will be in production.

The canary release pattern automates the release process of promoting to successively larger and more critical environments as the team confirms the code is operating as designed. The term canary release comes from the tradition of coal miners bringing caged canaries into mines to provide early detection of toxic levels of carbon monoxide. If there was too much gas in the cave, it would kill the canaries before it killed the miners, alerting them to evacuate.

Adopted from The DevOps Handbook

For the above diagram:

  • A1 group: Production servers that only serve internal employees.
  • A2 group: Production servers that only serve a small percentage of customers and are deployed when certain acceptance criteria have been met (either automated or manual).
  • A3 group: The rest of the production servers, which are deployed after the software running in the A2 cluster meets certain acceptance criteria.

There are two benefits to this type of safeguard: (1) the team protects against defects that are hard to find through automated tests; and, (2) the time required to detect and respond to the degraded performance by the change is reduced.

Application-based Patterns To Enable Safer Releases

Feature Toggles benefits:

  • Easy Roll Back – features that create problems or interruptions in production can be quickly and safely disabled by merely changing the feature toggle setting.
  • Gracefully degrade performance – when service experiences extremely high loads that would normally require an increase in capacity or risk having our service fail in production, feature toggles can reduce the quality of service.
  • Increase resilience through a service-oriented architecture – If a feature relies on another service that isn’t complete yet, the team can still deploy our feature into production but hide it behind a feature toggle. When that service finally becomes available, then toggle the feature on.

Feature toggles allow features to be deployed into production without making them accessible to users, enabling a technique known as dark launching.

Dark Launch benefits:

  • Deploy all the functionality into production and then perform testing of that functionality while it’s still invisible to customers.
  • Safely simulate production-like loads, providing confidence that the service will perform as expected.

From The Pipeline v32.0

This entry is part 32 of 34 in the series From the Pipeline

The following will be a regular feature where we share articles, podcasts, and webinars of interest from the web.

THE LEGENDS OF RUNETERRA CI/CD PIPELINE

In a look into the game industry, a software engineer at Riot shared details on how they build, test, and deploy “Legends of Runeterra”, an online card game. The team switched from Perforce to Git with a hierarchical branch-based workflow. This is because the team was breaking the main branch build too often with trunk-based development. They also create new test environments as needed for each branch so developers will have isolated sandboxes to test. Riot also use HTTP servers on debug builds of their game for direct control of the game for functional automated testing. Another cool feature Riot has developed is a custom GUI tool for the game so non-technical contributors can more easily use Git.

X-ray Vision and Exploratory Testing

“Imagine you have X-ray vision. Instead of seeing through walls, you can see the inner structure of programs, the bugs lying inside, and how to expose them. Anyone could execute the steps you gave them to reproduce the bugs. The difficulty in testing, then, is not in executing steps; it is figuring out what steps to take. How do you find those hidden bugs? We need to be the X-ray vision.”

Tips for engineering managers learning to lead remotely

GitLab team members share how they managed the shift from in-person, co-located work to working and managing teams remotely at GitLab to help others make the transition to remote work more easily. Clear communication is key — especially when looking for a quick answers as opposed to a formal meeting. There is a bias towards over-communication when working remotely. Another challenge is to build connected and engaged teams. To help, teams should proactively work to build interpersonal connections with activities such as coffee chats, sharing of non-work hobbies, and team building activities.

Building CI/CD Pipeline with Jenkins, Kubernetes & GitHub: Part 2

This article is the second in a series on implementing a CI/CD pipeline that will cover multibranch pipelines and GitHub Organization pipelines. Give this article a read if you’re interested in learning how to build from the ground up, starting with credential management, configuring pipelines, and using Kubernetes. The article also links to other training materials on fundamentals of Kubernetes and deploying with Kubernetes.

Test Flakiness – One of the main challenges of automated testing (Part II)

The Google testing blog has posted part two on their series of test flakiness. In this edition, they explore the four conditions that can cause flakiness, advice on triaging those failures, and how to remedy the problems at their source. The tests themselves can introduce flakiness, which can include test data, test workflows, initial setup of test prerequisites, and initial state of other dependencies. Additionally, an unreliable test-running framework can introduce flakiness. The application and underlying services / libraries that the testing framework depends upon can cause flakiness. Lastly, the OS and hardware that the application and testing framework depend upon can cause flakiness.

Book Club: The DevOps Handbook (Chapter 11. Enable and Practice Continuous Integration)

This entry is part 12 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 11

The ability to “branch” in version control systems enables developers to work on different parts of the software system in parallel, without the risk of individual developers checking in changes that could destabilize or introduce errors into trunk. Integration problems result in rework to get the application into a deployable state, including conflicting changes that must be manually merged or merges that cause test failures, which can require multiple developers to resolve.

Development Practices to Support Innovation Time:

  • Continuous integration and trunk-based development
  • Investment in test automation
  • Creation of a hardware simulator so tests could be run on a virtual platform
  • The reproduction of test failures on developer workstations
  • Architecture to support running off a common build and release

“Without automated testing, continuous integration is the fastest way to get a big pile of junk that never compiles or runs correctly.”

DevOps Handbook, Chapter 11

Small Batch Development and What Happens When Code is Committed to Trunk Infrequently

Significant problems result when developers work in long-lived private branches (also known as “feature branches”), only merging back into trunk sporadically, resulting in a large batch size of changes.

Jeff Atwood, founder of the Stack Overflow site and author of the Coding Horror blog, observes that while there are many branching strategies they can all be put on the following spectrum:

  • Optimize for individual productivity – Every single person on the project works in their own private branch. Everyone works independently, and nobody can disrupt anyone else’s work; however, merging becomes a nightmare.
  • Optimize for team productivity – Everyone works in the same common area. There are no branches, just a long, unbroken straight line of development. There’s nothing to understand, so commits are simple, but each commit can break the entire project and bring all progress to a screeching halt.

When merging is difficult, teams become less able and motivated to improve and refactor code because refactorings are more likely to cause rework for everyone else. When this happens, teams are more reluctant to modify code that has dependencies throughout the codebase, which is where they could have the highest payoffs.

“When we do not aggressively refactor our codebase, it becomes more difficult to make changes and to maintain over time, slowing down the rate at which we can add new features.”

Ward Cunningham on Technical Debt

Solving the merge problem was one of the primary reasons behind the creation of continuous integration and trunk-based development practices — to optimize for team productivity over individual productivity.

Adopt Trunk-Based Development Practices

One countermeasure to large batch size merges is to institute continuous integration and trunk-based development practices, where all developers check in their code to trunk at least once per day.

Frequent code commits to trunk means each team can run all automated tests on their application as a whole and receive alerts when a change breaks some other part of the application or interferes with the work of another developer.

Gated Commits – the deployment pipeline first confirms that the submitted change will successfully merge, build as expected, and pass all the automated tests before actually being merged into trunk.

From The Pipeline v31.0

This entry is part 31 of 34 in the series From the Pipeline

The following will be a regular feature where we share articles, podcasts, and webinars of interest from the web.

How to build a CI/CD pipeline with examples

Every team will setup their CI/CD pipeline differently based on available infrastructure and release policies. The components and tools in any CI/CD pipeline depend on the team’s needs and workflow. At a high level pipelines will have a common structure, which is explained in this article by Deepak Dhami. The four core stages of a pipeline are: source (a change made in an application source code, configuration, environment or data to trigger the pipeline), build, test, and deploy.

The Test Data Bottleneck and How to Solve IT

“Test data is one of the major bottlenecks in testing processes. By simplifying test data, we can solve this bottleneck by tackling four major challenges.” Provides a solid, high-level view of different types of test data and how each can be leveraged.

Running k3d and Istio locally

This is an excellent how-to on getting k3d and Istio setup locally. The choice is tooling is for a local setup since k3d is a wrapper for k3s in Docker. Additionally the author wants to leverage Keptn for application orchestration. Simple walk through for anyone curious.

Introducing Developer Velocity Lab – A Research Initiative to Amplify Developer Work and Well-Being

Microsoft and GitHub have released a Developer Velocity Lab (DVL), which is a joint research initiative led by Dr. Nicole Forsgren. The DVL aims to discover, improve, and amplify developer work by focusing on productivity, community, and well-being. The first publication is SPACE of Developer Productivity.

GitLab and Jira integration: the final steps

The final of a three-part series on GitLab and Jira integration. The final article provides multiple walkthroughs on how to refer Jira issues by the ID in GitLab branch names, commit messages, and merge request titles, as well as the ability to move Jira issues along with commit messages.

Book Club: The DevOps Handbook (Chapter 10. Enable Fast and Reliable Automated Testing)

This entry is part 11 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 10

Teams are likely to get undesired outcomes if they find and fix errors in a separate test phase, executed by a separate QA department only after all development has been completed. Teams should Continuously Build, Test, and Integrate Code Environments.

“Without automated testing, the more code we write, the more time and money is required to test our code—in most cases, this is a totally unscalable business model for any technology organization.”

– Gary Gruver, The DevOps Handbook (Chapter 10)

Google’s own success story on automated testing:

  • 40,000 code commits/day
  • 50,000 builds/day (on weekdays, this may exceed 90,000)
  • 120,000 automated test suites
  • 75 million test cases run daily
  • 100+ engineers working on the test engineering, continuous integration, and release engineering tooling to increase developer productivity (making up 0.5% of the R&D workforce)

Continuously Build, Test, and Integrate Code & Environments

Create automated test suites that increase the frequency of integration and testing of the code and the environments from periodic to continuous.

The deployment pipeline, first defined by Jez Humble and David Farley in their book Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation, ensures that all code checked in to version control is automatically built and tested in a production-like environment.

Create automated build and test processes that run in dedicated environments. This is critical for the following reasons:

  • The build and test process can run all the time, independent of the work habits of individual engineers.
  • A segregated build and test process ensures that teams understand all the dependencies required to build, package, run, and test the code.
  • Packaging the application to enable repeatable installation of code and configurations into an environment.
  • Instead of putting code in packages, teams may choose to package applications into deployable containers.
  • Environments can be made more production-like in a way that is consistent and repeatable.

The deployment pipeline validates after every change the code successfully integrates into a production-like environment. It becomes the platform through which testers request & certify builds during acceptance testing; the pipeline will run automated performance and security validations.

Adopted from The DevOps Handbook

The goal of a deployment pipeline is to provide everyone in the value stream the fastest possible feedback whether a change is successful or not. Changes could be to the code, any environments, automated tests, or the deployment pipeline infrastructure.

A continuous integration practice requires three capabilities:

  1. A comprehensive and reliable set of automated tests that validate the application is in a deployable state.
  2. A culture that “stops the entire production line” when the validation tests fail.
  3. Developers working in small batches on trunk rather than long-lived feature branches.

Build a Fast and Reliable Automated Validation Test Suite

Unit Tests: These test a single method, class, or function in isolation, providing assurance to the developer that their code operates as designed. Unit tests often “stub out” databases and other external dependencies.

Acceptance Tests: These test the application as a whole to provide assurance that a higher level of functionality operates as designed and that regression errors have not been introduced.

Humble and Farley define the difference between unit and acceptance testing as, “The aim of a unit test is to show that a single part of the application does what the programmer intends it to… the objective of acceptance tests is to prove that our application does what the customer meant it to, not that it works the way its programmers think it should.”

After a build passes unit tests, the deployment pipeline runs the build against acceptance tests. Any build that passes acceptance tests is then typically made available for manual testing.

Integration Tests: Integration tests ensure that the application correctly interacts with other production applications and services, as opposed to calling stubbed out interfaces.

As Humble and Farley observe, “Much of the work in the SIT environment involves deploying new versions of each of the applications until they all cooperate. In this situation the smoke test is usually a fully fledged set of acceptance tests that run against the whole application.”

Integration tests are performed on builds that have passed both the unit and acceptance test suites. Since integration tests are often brittle, teams should minimize the number of integration tests and find defects during unit & acceptance testing. The ability to use virtual or simulated versions of remote services when running acceptance tests becomes an essential architectural requirement.

Catch Errors As Early In The Automated Testing as Possible

For the fastest feedback, it’s important to run faster-running automated tests (unit tests) before slower-running automated tests (acceptance and integration tests), which are both run before any manual testing. Another corollary of this principle is that any errors should be found with the fastest category of testing possible.

Adopted from The DevOps Handbook.

Ensure Tests Run Quickly

Design tests to run in parallel.

Adopted from The DevOps Handbook

Write Automated Tests Before Writing The Code

Use techniques such as test-driven development (TDD) and acceptance test-driven development (ATDD). This is when developers begin every change to the system by first writing an automated test that validates the expected behavior fails and then writes the code to make the tests pass.

Test-Driven Development:

  1. Ensure the tests fail. “Write a test for the next bit of functionality to add.” Check in.
  2. Ensure the tests pass. “Write the functional code until the test passes.” Check in.
  3. “Refactor both new and old code to make it well structured.” Ensure the tests pass. Check in again.

Automate As Many Of The Manual Tests As Possible

Automating all the manual tests may create undesired outcomes – teams should not have automated tests that are unreliable or generate false positives (tests that should have passed because the code is functionally correct but failed due to problems such as slow performance, causing timeouts, uncontrolled starting state, or unintended state due to using database stubs or shared test environments).

A small number of reliable, automated tests are preferable over a large number of manual or unreliable automated tests. Start with a small number of reliable automated tests and add to them over time, creating an increasing level of assurance that any changes to the system that take the application out of a deployable state is detected.

Integrate Performance Testing Into The Test Suite

All too often, teams discover that their application performs poorly during integration testing or after it has been deployed to production. The goal is to write and run automated performance tests that validate the performance across the entire application stack (code, database, storage, network, virtualization, etc.) as part of the deployment pipeline to detect problems early when the fixes are cheapest and fastest.

By understanding how the application and environments behave under a production-like load, the team can improve at capacity planning as well as detecting conditions such as:

  • When the database query times grow non-linearly.
  • When a code change causes the number of database calls, storage use, or network traffic to increase.

Integrate Non-Functional Requirements Testing Into The Test Suite

In addition to testing the code functions as designed and it performs under production-like loads, teams should validate every other attribute of the system. These are often called non-functional requirements, which include availability, scalability, capacity, security, etc..

Many nonfunctional requirements rely upon:

  • Supporting applications, databases, libraries, etc.
  • Language interpreters, compilers, etc.
  • Operating systems

Pull The Andon Cord When The Deployment Pipeline Breaks

In order to keep the deployment pipeline in a green state the team should create a virtual Andon Cord, similar to the physical one in the Toyota Production System. Whenever someone introduces a change that causes the build or automated tests to fail, no new work is allowed to enter the system until the problem is fixed.

When the deployment pipeline is broken, at a minimum notify the entire team of the failure, so anyone can either fix the problem or rollback the commit. Every member of the team should be empowered to roll back the commit to get back into a green state.

Why Teams Need To Pull The Andon Cord

The consequence of not pulling the Andon cord and immediately fixing any deployment pipeline issues makes it more difficult to bring applications and environment back into a deployable state.

Consider the following situation:

  1. Someone checks in code that breaks the build or fails automated tests, but no one fixes it.
  2. Someone else checks in another change onto the broken build, which also doesn’t pass the automated tests; however, no one sees the failing test results which would have enabled the team to see the new defect, let alone fix it.
  3. The existing tests don’t run reliably, so the team is unlikely to build new tests.
  4. The negative feedback cycle continues and application quality continues to degrade.

Book Club: The DevOps Handbook (Chapter 9. Create the Foundations of our Deployment Pipeline )

This entry is part 10 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 9

The goal is to create the technical practices and architecture required to enable and sustain the fast flow of work from Development into Operations without causing chaos and disruption to the production environment or customers.

Continuous Delivery (CD) includes:

  • Creating the foundations of the automated deployment pipeline.
  • Ensuring that the team has automated tests that constantly validate the application is in a deployable state.
  • Having developers integrate their code into trunk daily.
  • Architecting the environments and code to enable low-risk releases.

Outcomes of CD:

  • Reduces the lead time to get production-like environments.
  • Enables continuous testing that gives everyone fast feedback on their work.
  • Enables small teams to safely and independently develop, test, and deploy their code into production.
  • Makes production deployments and releases a routine part of daily work.

Ensure the team always use production-like environments at every stage of the value stream. The environments must be created in an automated manner, ideally on demand from scripts and configuration information stored in version control and entirely self-serviced.

Enable On-Demand Creation of Dev, Test, and Production Environments

Instead of documenting the specifications of the production environment in a document or on a wiki page, the organization should create a common build mechanism that creates all environments, such as for development, test, and production. By doing this, any team can get production-like environments in minutes, without opening up a ticket, let alone having to wait weeks.

Automation can help in the following ways:

  • Copying a virtualized environment
  • Building an automated environment creation process
  • Using “infrastructure as code” configuration management tools
  • Using automated operating system configuration tools
  • Assembling an environment from a set of virtual images or containers
  • Spinning up a new environment in a public cloud, private cloud, or other PaaS (platform as a service)

By providing developers an environment they fully control, teams are enabled to quickly reproduce, diagnose, and fix defects, safely isolated from production services and other shared resources. Teams can also experiment with changes to the environments, as well as to the infrastructure code that creates it (e.g., configuration management scripts), further creating shared knowledge between Development and Operations.

Create a Single Repository of Truth For The Entire System

Use of version control has become a mandatory practice of individual developers and development teams. A version control system records changes to files or sets of files stored within the system. This can be source code, assets, or other documents that may be part of a software development project.

Version Control Recommendations:

  • All the environment creation tools and artifacts described in the previous step
  • Any file used to create containers
  • All supporting automated tests and any manual test scripts
  • Any script that supports code packaging, deployment, database migration, and environment provisioning
  • All project artifacts
  • All cloud configuration files
  • Any other script or configuration information required to create infrastructure that supports multiple services

Make Infrastructure Easier To Rebuild Than Repair

sds

Any script used to create database schemas or application reference data Bill Baker, a distinguished engineer at Microsoft, quipped that we used to treat servers like pets: “You name them and when they get sick, you nurse them back to health. [Now] servers are [treated] like cattle. You number them and when they get sick, you shoot them.”

The DevOps Handbook, Chapter 9

Instead of manually logging into servers and making changes, make changes in a way that ensures all changes are replicated everywhere automatically and that all changes are put into version control.

Teams can rely on automated configuration systems to ensure consistency, or they can create new virtual machines or containers from an automated build mechanism and deploy them into production, destroying the old ones or taking them out of rotation. This is known as immutable infrastructure.

Modify The Definition of Development “Done” To Include Running in Production-Like Environments

By having developers write, test, and run their own code in a production-like environment, the majority of the work to successfully integrate code and environments happens during daily work, instead of at the end of the release.

Ideally, team will use the same tools, such as monitoring, logging, and deployment, in pre-production environments as they do in production.