Featured

Welcome to Red Green Refactor

We officially welcome you to the start of Red Green Refactor, a technology blog about automation and DevOps. We are a group of passionate technologists who care about learning and sharing our knowledge. Information Technology is a huge field and even though we’re a small part of it – we wanted another outlet to collaborate with the community.

Why Red Green Refactor?

Red Green Refactor is a term commonly used in Test Driven Development to support a test first approach to software design. Kent Beck is generally credited with discovering or “rediscovering” the phrase “Test Driven Development”. The mantra for the practice is red-green-refactor, where the colors refer to the status of the test driving the development code.

The Red is writing a small piece of test code without the development code implemented. The test should fail upon execution – a red failure. The Green is writing just enough development code to get the test code to pass. The test should pass upon execution – a green pass. The Refactor is making small improvements to the development code without affecting the behavior. The quality of the code is improved according to team standards, addressing “code smells” (making the code readable, maintainable, removing duplication), or using simple design patterns. The point of the practice is to make the code more robust by catching the mistakes early, with an eye on quality of the code from the beginning. Writing in small batches helps the practitioner think about the design of their program consistently.

“Refactoring is a controlled technique for improving the design of an existing codebase.”

Martin Fowler

The goal of Red Green Refactor is similar to the practice of refactoring: to make small-yet-cumulative positive changes, but instead in learning to help educate the community about automation and DevOps. The act of publishing also encourages our team to refine our materials in preparation for a larger audience. Many of the writers on Red Green Refactor speak at conferences, professional groups, and the occasional webinar. The learning at Red Green Refactor will be bi-directional – to the readers and to the writers.

Who Are We?

The writers on Red Green Refactor come from varied backgrounds but all of us made our way into information technology, some purposefully and some accidentally. Our primary focus was on test automation, which has evolved into DevOps practices as we expanded our scope into operations. Occasionally we will invite external contributors to post on a subject of interest. We have a few invited writers lined up and ready to contribute.

“Automation Team” outing with some of Red-Green-Refactor authors

As for myself, I have a background in Physics & Biophysics, with over a decade spent in research science studying fluorescence spectroscopy and microscopy before joining IT. I’ve worked as a requirements analyst, developer, and tester before joining the ranks of pointed-headed management. That doesn’t stop me from exploring new tech at home though or posting about it on a blog.

What Can You Expect From Red Green Refactor?

Technology

Some companies are in the .NET stack, some are Java shops, but everyone needs some form of automation. The result is many varied implementations of both test & task automation. Our team has supported almost all the application types under the sun (desktop, web, mobile, database, API/services, mainframe, etc.). We’ve also explored with many tools both open-source and commercial at companies with ancient tech and bleeding edge. Our posts will be driven by both prior experience as well as exploration to the unknown.

We’ll be exploring programming languages and tools in the automation space.  Readers can expect to learn about frameworks, cloud solutions, CI/CD, design patterns, code reviews, refactoring, metrics, implementation strategies, performance testing, etc. – it’s open ended.

Continuous Improvement

We aim to keep our readers informed about continuous improvement activities in the community. One of the great things about this field is there is so much to learn and it’s ever-changing. It can be difficult at times with the firehose of information coming at you since there are only so many hours in the day. We tend to divide responsibility among our group to perform “deep dives” into certain topics and then share that knowledge with a wider audience (for example: Docker, Analytics or Robot Process Automation). In the same spirit we plan to share information on Red Green Refactor about continuous improvement. Posts about continuous improvement will include: trainings, conference recaps, professional groups, aggregated articles, podcasts, tech book summaries, career development, and even the occasional job posting.

Once again welcome to Red Green Refactor. Your feedback is always welcome.

From The Pipeline v34.0

This entry is part 34 of 34 in the series From the Pipeline

The following will be a regular feature where we share articles, podcasts, and webinars of interest from the web.

How Much Testing is Enough?

The google testing blog recently posted details on the scope of their internal testing practice. In addition to defining core terms, they briefly outline their testing: (1) Document your process or strategy; (2) Have a solid base of unit tests; (3) Don’t skimp on integration testing; (4) Perform end-to-end testing for Critical User Journey; (5) Understand and implement the other tiers of testing; (6) Understand your coverage of code and functionality; (7) Use feedback from the field to improve your process.

Test Automation Strategy Guide

Julia Pottinger posted some excellent thoughts on the approach teams should take for test automation. Every team should start with their goal in mind. Once that is determined, teams should identify the tools and techniques for automation. This is followed by identifying who is writing the automation, when it will be executed, and the environments to be used.

What Makes a Good Automated Test?

Kristin has provided a set of guidelines for determining a good automated test. First, tests should be meaningful. This is because each test you write is an investment in the maintenance of that test. Tests should also be maintainable – the automated checks should be readable and well-organized. Tests should also run quickly in order to provide fast feedback for teams.

The Test Data Bottleneck and How to Solve It

Test data is one of the major bottlenecks in testing processes. By simplifying test data, we can solve this bottleneck by tackling four major challenges: Time, People, Size, and Money.

GitHub’s Engineering Team has moved to Codespaces

Recently the GitHub development team shifted to Codespaces for the majority of GitHub development. They’re making this change because local development was brittle — any changes to a local environment could make it useless and require hours of development time to recover. Collaborating on multiple branches across multiple projects was painful. Now, Codespaces executes a shallow clone and repository history in the background, which reduced time to clone. They also created a GitHub Action that would run nightly, clone the repository, bootstrap dependencies, then build & push a Docker image of the result.

Book Club: The DevOps Handbook (Conclusion)

This entry is part 25 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Conclusion: A Call to Action

DevOps offers a solution at a time when every technology leader is challenged with enabling security, reliability, agility, handling security breaches, improving time to market, and massive technology transformations.

An inherent conflict can exist between Development and Operations that creates worsening problems, which results in slower time to market for new products and features, poor quality, increased outages and technical debt, reduced engineering productivity, as well as increased employee dissatisfaction and burnout. DevOps principles and patterns enable teams to break this core, chronic conflict.

DevOps requires potentially new cultural and management norms, and changes in technical practices and architecture. This results in maximizing developer productivity, organizational learning, high employee satisfaction, and the ability to win in the marketplace.

DevOps is not just a technology imperative, but also an organizational imperative. DevOps is applicable and relevant to any and all organizations that must increase flow of planned work through the technology organization, while maintaining quality, reliability, and security for customers.

“The call to action is this: no matter what role you play in your organization, start finding people around you who want to change how work is performed.”

The DevOps Handbook

This concludes the book club summary for “The DevOps Handbook.” Other book club summaries are available for “The Phoenix Project” and “BDD: Discovery“. Stay subscribed for more book club summaries and other great content on automation & DevOps.

Book Club: The DevOps Handbook (Chapter 23. Protecting the Deployment Pipeline and Integrating Into Change Management and Other Security and Compliance Controls)

This entry is part 24 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 23

Almost any IT organization of any size will have existing change management processes, which are the primary controls to reduce operations and security risks. The goal is to successfully integrate security and compliance into any existing change management process.

ITIL breaks changes down into three categories:

Standard Changes: lower-risk changes that follow an established and approved process but can also be pre-approved. They can include monthly updates of application tax tables or country codes, website content & styling changes, and certain types of application or operating system patches that have a well-understood impact. The change proposer does not require approval before deploying the change, and change deployments can be completely automated and should be logged so there is traceability.

Normal Changes: higher-risk changes that require review or approval from the agreed upon change authority. In many organizations, this responsibility is inappropriately placed on the change advisory board (CAB) or emergency change advisory board (ECAB), which may lack the required expertise to understand the full impact of the change, often leading to unacceptably long lead times. Large code deployments may contain hundreds of thousands of lines of new code, submitted by hundreds of developers. In order for normal changes to be authorized, the CAB will almost certainly have a well-defined request for change (RFC) form that defines what information is required for the go/no-go decision.

Urgent Changes: These are emergency and potentially high-risk changes that must be put into production immediately. These changes often require senior management approval but allow documentation to be performed after the fact. A key goal of DevOps practices is to streamline the normal change process such that it is also suitable for emergency changes.

Recategorize The Majority of Lower Risk Changes as Standard Changes

One way to support an assertion that changes are low risk is to show a history of changes over a significant time period and provide a complete list of production issues during that same period. Ideally, deployments will be performed automatically by configuration management and deployment pipeline tools and the results will be automatically recorded.

Creating this traceability and context should be easy and should not create an overly onerous or time consuming burden for engineers. Linking to user stories, requirements, or defects is almost certainly sufficient.

What To Do When Changes are Categorized as Normal Changes

The goal is to ensure that the change can be deployed quickly, even if it is not fully automated. Ensure that any submitted change requests are as complete and accurate as possible, giving the CAB everything they need to properly evaluate the change.

Because the submitted changes will be manually evaluated by people, it is even more important the context of the change is described. The goal is to share the evidence and artifacts that gives confidence that the change will operate in production as designed.

Reduce Resilience on Separation of Duties

For decades, developers have used separation of duty as one of the primary controls to reduce the risk of fraud or mistakes in the software development process. As complexity and deployment frequency increase, performing production deployments successfully increasingly requires everyone in the value stream to quickly see the outcomes of their actions.

Separation of duty often can impede this by slowing down and reducing the feedback engineers receive on their work. Instead, choose controls such as pair programming, continuous inspection of code check-ins, and code review.

Ensure Documentation and Proof For Auditors and Compliance Officers

As technology organizations increasingly adopt DevOps patterns, there is more tension than ever between IT and audit. These new DevOps patterns challenge traditional thinking about auditing, controls, and risk mitigation.

“DevOps is all about bridging the gap between Dev and Ops. In some ways, the challenge of bridging the gap between DevOps and auditors and compliance officers is even larger. For instance, how many auditors can read code and how many developers have read NIST 800-37 or the Gramm-Leach-Bliley Act? That creates a gap of knowledge, and the DevOps community needs to help bridge that gap.”

Bill Shinn, a principal security solutions architect at Amazon Web Services

Instead, teams work with auditors in the control design process. Assign a single control for each sprint to determine what is needed in terms of audit evidence. Send all the data into the telemetry systems so the auditors can get what they need, completely self-serviced.

“In audit fieldwork, the most commonplace methods of gathering evidence are still screenshots and CSV files filled with configuration settings and logs. Our goal is to create alternative methods of presenting the data that clearly show auditors that our controls are operating and effective.”

The DevOps Handbook

Case Study: Relying on Production Telemetry for ATM Systems

Information security, auditors, and regulators often put too much reliance on code reviews to detect fraud. Instead, they should be relying on production monitoring controls in addition to using automated testing, code reviews, and approvals, to effectively mitigate the risks associated with errors and fraud.

“Many years ago, we had a developer who planted a backdoor in the code that we deploy to our ATM cash machines. They were able to put the ATMs into maintenance mode at certain times, allowing them to take cash out of the machines. We were able to detect the fraud very quickly, and it wasn’t through a code review. These types of backdoors are difficult, or even impossible, to detect when the perpetrators have sufficient means, motive, and opportunity.”

“However, we quickly detected the fraud during our regularly operations review meeting when someone noticed that ATMs in a city were being put into maintenance mode at unscheduled times. We found the fraud even before the scheduled cash audit process, when they reconcile the amount of cash in the ATMs with authorized transactions.”

The DevOps Handbook

Book Club: The DevOps Handbook (Chapter 22. Information Security as Everyone’s Job, Every Day)

This entry is part 23 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 22

The goal is to create & integrate security controls into the daily work of Development and Operations, so that security is part of everyone’s job, every day. Ideally the work will be automated and put into a deployment pipeline. Manual processes, acceptances, and approvals should be replaced with automated controls, relying less on separation of duties and change approval.

To integrate security, compliance, and change management:

  • Make security a part of everyone’s job
  • Integrate preventative controls into the shared source code repository
  • Integrate security with the deployment pipeline
  • Integrate security with telemetry to better enable detection and recovery
  • Protect the deployment pipeline
  • Integrate the deployment activities with the change approval processes
  • Reduce reliance on separation of duty

Integrate Security Into Development Iteration Demonstrations

One of the goals is to have feature teams engaged with Infosec as early as possible, as opposed to primarily engaging at the end of the project. Invite Infosec to the product demonstrations at the end of each development interval so that they can better understand the team goals in the context of organizational goals, observe their implementations as they are being built, and provide guidance and feedback at the earliest stages of the project, when there is the most amount of time and freedom to make corrections.

“By having Infosec involved throughout the creation of any new capability, we were able to reduce our use of static checklists dramatically and rely more on using their expertise throughout the entire software development process.”

Justin Arbuckle, Chief Architect at GE Capital

Integrate Security Into Defect Tracking and Postmortems

Track all open security issues in the same work tracking system that Development and Operations are using, ensuring the work is visible and can be prioritized against all other work. InfoSec traditionally has the security vulnerabilities stored in a GRC (governance, risk, and compliance) tool.

Integrate Preventive Security Controls Into Shared Source Code Repositories and Shared Services

Add to the shared source code repository any mechanisms or tools that help ensure applications and environments are secure. What should be added? authentication and encryption libraries and services. Version control also serves as a omni-directional communication mechanism to keep all parties aware of changes being made.

Items to include in version control related to Security:

  • Code libraries and their recommended configurations
  • Secret management using tools such as Vault, credstash, Trousseau, Red October, etc.
  • OS packages and builds

Integrate Security Into Deployment Pipelines

Prior state: security reviews were started after development was completed. The documentation would be given to Development and Operations, which would be completely un-addressed due to project due date pressure or problems being found too late in the SDLC.

Goal state: automate as many information security tests as possible so they run as part of the deployment pipeline. Security should provide both Dev and Ops with fast feedback on their work.

Ensure Security of the Application

Development testing focuses on the correctness of functionality or happy path, which validates user journeys where everything goes as expected, with no exceptions or error conditions. QA, Infosec, and Fraud practitioners will often focus on the sad paths, which happen when things go wrong, especially in relation to security-related error conditions.

Static analysis: this is testing performed in a non-runtime environment, ideally in the deployment pipeline. Typically, a static analysis tool will inspect program code for all possible run-time behaviors and seek out coding flaws, back doors, and potentially malicious code. Examples of tools include Brakeman, Code Climate, and searching for banned code functions.

Dynamic analysis: dynamic analysis consists of tests executed while a program is in operation. Dynamic tests monitor items such as system memory, functional behavior, response time, and overall performance of the system. Ideally, automated dynamic testing is executed during the automated functional testing phase of a deployment pipeline.

Dependency scanning: Another type of static testing that would normally perform at build time inside of a deployment pipeline involves inventorying dependencies for binaries and executables, and ensuring that these dependencies are free of vulnerabilities or malicious binaries.

Source code integrity and code signing: All developers should have their own PGP key, perhaps created and managed in a system such as keybase.io. All commits to version control should be signed —that is straightforward to configure using the open source tools git. All packages created by the CI process should be signed, and their hash recorded in the centralized logging service for audit purposes.

Ensure Security of Software Supply Chain

“We are no longer writing customized software—instead, we assemble what we need from open source parts, which has become the software supply chain that we are very much reliant upon.”

Josh Corman

Using commercial or open source libraries brings in vulnerabilities along with their functionality.

The 2015 Sonatype State of the Software Supply Chain Report on vulnerability had some noteworthy findings. For one, the typical organization relied upon 7,601 build artifacts and used 18,614 different versions. 7.5% of those components had known vulnerabilities, with over 66% of those vulnerabilities being over two years old without having been resolved. For open source projects with known vulnerabilities registered in the National Vulnerability Database, only 41% were ever fixed and required. Additionally, on average they took 390 days to publish a fix. For those vulnerabilities that were labeled at the highest severity, fixes required 224 days.

Ensure Security of the Environment

Environments should be in a hardened, risk-reduced state.

One approach is to generate automated tests to ensure that all appropriate settings have been correctly applied for configuration hardening, database security settings, key lengths, etc.

Integrate Information Security Into Production Telemetry

Internal security controls are often ineffective in successfully detecting breaches in a timely manner, either because of blind spots in monitoring or because no one in the organization is examining the relevant telemetry in their daily work.

Deploy the monitoring, logging, and alerting required to fulfill information security objectives throughout applications and environments, as well as ensure that it’s adequately centralized to facilitate easy and meaningful analysis and response.

Creating Security Telemetry in Applications

In order to detect problematic user behavior that could be an indicator or enabler of fraud and unauthorized access, create the relevant telemetry in applications:

  • Successful and unsuccessful user logins
  • User password resets
  • User email address resets
  • User credit card changes

Creating Security Telemetry in Environments

Create sufficient telemetry in test environments to detect early indicators of unauthorized access. Monitoring opportunities include:

  • OS changes (in production or in build infrastructure)
  • Security group changes
  • Changes to configurations
  • Cloud infrastructure changes
  • Cross-site scripting attempts
  • SQL Injection attempts
  • Web server errors

“Nothing helps developers understand how hostile the operating environment is than seeing their code being attacked in real-time.”

Nick Galbreath, Director of Engineering at Etsy

Security Telemetry at Etsy:

  • Abnormal production program terminations
  • Database syntax error
  • Indications of SQL injection attacks

Protect The Deployment Pipeline

If someone compromises the servers running deployment pipeline that has the credentials for the version control system, it could enable someone to steal source code. If the deployment pipeline has write access, an attacker could also inject malicious changes into version control repository and inject malicious changes into application and services.

Risks to CI/CD pipelines include:

  • Developers introducing code that enables unauthorized access (mitigate through controls such as code testing, code reviews, and penetration testing)
  • Unauthorized users gaining access to the code or environment (mitigated via controls such as ensuring configurations match known, good states, and effective patching)

In order to protect our continuous build, integration, or deployment pipeline, a mitigation strategy may include:

  • Hardening continuous build and integration servers and ensuring they can be reproduced in an automated manner
  • Reviewing all changes introduced into version control, either through pair programming at commit time or by a code review process between commit and merge into trunk, to prevent continuous integration servers from running uncontrolled code
  • Instrumenting the repository to detect when test code contains suspicious API calls is checked in to the repository, perhaps quarantining it and triggering an immediate code review
  • Ensuring every CI process runs on its own isolated container or VM
  • Ensuring the version control credentials used by the CI system are read-only

Book Club: The DevOps Handbook (Chapter 21. Reserve Time to Create Organizational Learning and Improvement)

This entry is part 22 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 21

One of the practices that forms part of the Toyota Production System is called the improvement blitz (kaizen), defined as a dedicated and concentrated period of time to address a particular issue, often over the course of a several days.

“…blitzes often take this form: A group is gathered to focus intently on a process with problems…The blitz lasts a few days, the objective is process improvement, and the means are the concentrated use of people from outside the process to advise those normally inside the process.”

The DevOps Handbook

Institutionalize Rituals To Pay Down Technical Debt

Teams should schedule rituals that help enforce the practice of reserving Dev and Ops time for improvement work, such as non-functional requirements, automation, etc. One of the easiest ways to do this is to schedule and conduct day- or week-long improvement blitzes, where everyone on a team self-organizes to fix problems they care about—no feature work is allowed.

The technique of dedicated rituals for improvement work has also been called spring or fall cleanings. Other terms have also been used, such as: hack days, hackathons, and innovation time. The goal during these blitzes is not to simply experiment and innovate for the sake of testing out new technologies, but to improve daily work.

The improvement practice reinforces a culture in which engineers work across the entire value stream to solve problems. What makes improvement blitzes so powerful is empowering those closest to the work to continually identify and solve their own problems.

Enable Everyone To Teach and Learn

A dynamic culture of learning creates conditions so that everyone can not only learn, but also teach, whether through traditional didactic methods (attending training) or more experiential or open methods (conferences).

“We have five thousand technology professionals, who we call ‘associates.’ Since 2011, we have been committed to create a culture of learning—part of that is something we call Teaching Thursday, where each week we create time for our associates to learn. For two hours, each associate is expected to teach or learn. The topics are whatever our associates want to learn about—some of them are on technology, on new software development or process improvement techniques, or even on how to better manage their career. The most valuable thing any associate can do is mentor or learn from other associates.”

Steve Farley, VP of Information Technology at Nationwide Insurance

Organizations can help further help teach skills through daily work by jointly performing code reviews that include both parties so that developers learn by doing, as well as by having Development and Operations work together to solve small problems.

Share Your Experiences From DevOps Conferences

In many cost-focused organizations, engineers are often discouraged from attending conferences and learning from their peers. To help build a learning organization, instead companies should encourage engineers (both from Development and Operations) to attend conferences, give talks at them, and, create & organize internal or external conferences themselves. For instance, Nationwide, Target, and Capital One have internal tech conferences.

Create Internal Consulting and Coaches To Spread Practices

Creating an internal coaching and consulting organization is a method commonly used to spread expertise across an organization. Google’s Testing on the Toilet (or TotT) was a weekly testing periodical. Each week, they published a newsletter in nearly every bathroom in nearly every Google office worldwide.

“The goal was to raise the degree of testing knowledge and sophistication throughout the company. It’s doubtful an online-only publication would’ve involved people to the same degree.”

Mike Bland, Google

Book Club: The DevOps Handbook (Chapter 20. Convert Local Discoveries into Global Improvements)

This entry is part 21 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 20

Use Chat Rooms and Chat Bots to Automate and Capture Organizational Knowledge

Having work performed by automation in a chat room has numerous benefits, including:

  • Everyone sees everything that is happening
  • Engineers on their first day of work can see what daily work looks like and how it’s performed
  • People are more apt to ask for help when they see others helping each other
  • Rapid organizational learning is enabled and accumulated

Automate Standardized Processes in Software Re-use

Instead of putting our expertise into Word documents, teams need to transform these documented standards and processes, which encompass the sum of our organizational learnings and knowledge, into an executable form that makes them easier to reuse.

One of the best ways to make this knowledge re-usable is by putting it into a centralized source code repository, making the tool available for everyone to search and use.

ArchOps: “enables our engineers to be builders, not bricklayers. By putting our design standards into automated blueprints that were able to be used easily by anyone, we achieved consistency as a byproduct.”

Justin Arbuckle

Create a Single, Shared Source Code Repository For The Entire Organization

A firm-wide, shared source code repository is one of the most powerful mechanisms used to integrate local discoveries across the entire organization.

Put into the shared source code repository not only source code, but also other artifacts that encode knowledge and learning, including:

  • Configuration standards for libraries, infrastructure, and environments
  • Deployment tools
  • Testing standards and tools, including security
  • Deployment pipeline tools
  • Monitoring and analysis tools
  • Tutorials and standards

Spread Knowledge By Using Automated Tests As Documentation and Communities of Practice

When team have shared libraries being used across the organization, they should enable rapid propagation of expertise and improvements. Ensuring each of these libraries has significant amounts of automated testing included means the libraries become self-documenting and show other engineers how to use them.

The benefit will be nearly automatic if teams practice test-driven development (TDD), where automated tests are written before the code. This discipline turns test suites into a living, up-to-date specification of the system.

Design For Operations Through Codified Non-Functional Requirements

Examples of non-functional requirements include:

  • Sufficient production telemetry in applications and environments
  • The ability to accurately track dependencies
  • Services that are resilient and degrade gracefully
  • Forward and backward compatibility between versions
  • The ability to archive data to manage the size of the production data set
  • The ability to easily search and understand log messages across services
  • The ability to trace requests from users through multiple services
  • Simple, centralized runtime configuration using feature flags and so forth

Build Reusable Operations User Stories Into Development

Instead of manually building servers and then putting them into production according to manual checklists, automate as much of this work as possible. Ideally, for all recurring Ops work teams will know the following: what work is required, who is needed to perform it, and what the steps to complete it are.

“We know a high availability rollout takes fourteen steps, requiring work from four different teams, and the last five times we performed this, it took an average of three days.”

The DevOps Handbook

Ensure Technology Choices Help Achieve Organizational Goals

When expertise for a critical service resides only in one team, and only that team can make changes or fix problems, this creates a bottleneck.

The goal is to identify the technologies that:

  • Impede or slow down the flow of work
  • Disproportionately create high levels of unplanned work
  • Disproportionately create large numbers of support requests
  • Are most inconsistent with the desired architectural outcomes (e.g. throughput, stability, security, reliability, business continuity)

Book Club: The DevOps Handbook (Chapter 19. Enable and Inject Learning into Daily Work)

This entry is part 20 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 19

Institutionalize rituals that increase safety, continuous improvement, and learning by doing the following:

  • Establish a just culture to make safety possible
  • Inject production failures to create resilience
  • Convert local discoveries into global improvements
  • Reserve time to create organizational improvements and learning

When teams work within a complex system, it’s impossible to predict all the outcomes for the actions they take. To enable teams to safely work within complex systems, organizations must become ever better at diagnostics and improvement activities. They must be skilled at detecting problems, solving them, and multiplying the effects by making the solutions available throughout the organization.

Resilient organizations “skilled at detecting problems, solving them, and multiplying the effect by making the solutions available throughout the organization.” These organizations can heal themselves. “For such an organization, responding to crises is not idiosyncratic work. It is something that is done all the time. It is this responsiveness that is their source of reliability.”

Dr. Steven Spear

Chaos Monkey – A Netflix tool that simulated failures in the system to help build resiliency.

When Netflix first ran Chaos Monkey in their production environments, services failed in ways they never could have predicted or imagined – by constantly finding and fixing these issues, Netflix engineers quickly and iteratively created a more resilient service, while simultaneously creating organizational learnings.

Establish a Just, Learning Culture

When accidents occur (which they undoubtedly will), the response to those accidents is seen as “just.”

“When responses to incidents and accidents are seen as unjust, it can impede safety investigations, promoting fear rather than mindfulness in people who do safety-critical work, making organizations more bureaucratic rather than more careful, and cultivating professional secrecy, evasion, and self-protection.”

Dr Sidney Dekker

Dr. Dekker calls this notion of eliminating error by eliminating the people who caused the errors the Bad Apple Theory. He asserts this notion is invalid, because “human error is not our cause of troubles; instead, human error is a consequence of the design of the tools that we gave them.” Instead of “naming, blaming, and shaming” the person who caused the failure, our goal should always be to maximize opportunities for organizational learning.

If teams punish that engineer, everyone is deterred from providing the necessary details to get an understanding of the mechanism and operation of the failure, which guarantees that the failure will occur again. Two effective practices that help create a just, learning-based culture are blameless post-mortems and the controlled introduction of failures into production.

Schedule Blameless Post-Mortem Meetings After Accidents Occur

To conduct a blameless post-mortem The process should include:

(1) Construct a timeline and gather details from multiple perspectives on failures, ensuring teams don’t punish people for making mistakes.
(2) Empower all engineers to improve safety by allowing them to give detailed accounts of their contributions to failures.
(3) Enable and encourage people who do make mistakes to be the experts who educate the rest of the organization on how not to make them in the future.
(4) Accept that there is always a discretionary space where humans can decide to take action or not, and that the judgment of those decisions lies in hindsight.
(5) Propose countermeasures to prevent a similar accident from happening in the future and ensure these countermeasures are recorded with a target date and an owner for follow-ups.

To enable teams to gain this understanding, the following stakeholders need to be present at the meeting:

  • The people involved in decisions that may have contributed to the problem
  • The people who identified the problem
  • The people who responded to the problem
  • The people who diagnosed the problem
  • The people who were affected by the problem
  • Anyone else who is interested in attending the meeting

The first task in the blameless post-mortem meeting is to record the best understanding of the timeline of relevant events as they occurred. During the meeting and the subsequent resolution, we should explicitly disallow the phrases “would have” or “could have,” as they are counterfactual statements that result from our human tendency to create possible alternatives to events that have already occurred. In the meeting, teams must reserve enough time for brainstorming and deciding which countermeasures to implement.

Publish Post-Mortems As Widely As Possible

Teams should widely announce the availability of the meeting notes and any associated artifacts (e.g., timelines, IRC chat logs, external communications). This information should be placed in a centralized location where the entire organization can access it and learn from the incident. Doing this helps us translate local learnings and improvements into global learnings and improvements.

Etsy’s Morgue:

  • Whether the problem was due to a scheduled or an unscheduled incident
  • The post-mortem owner
  • Relevant IRC chat logs (especially important for 2 a.m. issues when accurate note-taking may not happen)
  • Relevant JIRA tickets for corrective actions and their due dates (information particularly important to management)
  • Links to customer forum posts (where customers complain about issues)

Decrease Incident Tolerances to Find Ever-Weaker Failure Signals

As organizations learn how to see and solve problems efficiently, they need to decrease the threshold of what constitutes a problem in order to keep learning.

Organizations are often structured in one of two models: (1) a standardized model, where routine and systems govern everything, including strict compliance with timelines and budgets; or (2) an experimental model, where every day every exercise and every piece of new information is evaluated and debated in a culture that resembles a research and design laboratory.

Redefine Failure and Encourage Calculated Risk Taking

To reinforce a culture of learning and calculated risk-taking, teams need leaders to continually reinforce that everyone should feel both comfortable with and responsible for surfacing and learning from failures. “DevOps must allow this sort of innovation and the resulting risks of people making mistakes. Yes, you’ll have more failures in production. But that’s a good thing, and should not be punished.” – Roy Rapoport of Netflix

Inject Production Failures to Enable Resilience and Learning

As Michael Nygard, author of “Release It! Design and Deploy Production-Ready Software”, writes, “Like building crumple zones into cars to absorb impacts and keep passengers safe, you can decide what features of the system are indispensable and build in failure modes that keep cracks away from those features. If you do not design your failure modes, then you will get whatever unpredictable—and usually dangerous—ones happen to emerge.”

Resilience requires that teams first define failure modes and then perform testing to ensure that these failure modes operate as designed. One way to accomplish this is by injecting faults into production environment and rehearsing large-scale failures to build confidence in recovering from accidents when they occur, ideally without impacting customers.

Institute Game Days to Rehearse Failures

The concept of Game Days comes from the discipline of resilience engineering. Robbins defines resilience engineering as “an exercise designed to increase resilience through large-scale fault injection across critical systems.” The goal for a Game Day is to help teams simulate and rehearse accidents to give them the ability to practice.

The Game Day process involves:

(1) Schedule a catastrophic event — such as the simulated destruction of an entire data center –to happen at some point in the future.
(2) Give teams time to prepare, to eliminate all the single points of failure and to create the necessary monitoring procedures, failover procedures, etc.
(3) The Game Day team defines and executes drills, such as conducting database failovers or turning off an important network connection to expose problems in the defined processes.
(4) Any problems or difficulties that are encountered are identified, addressed, and tested again.

By executing Game Days, teams progressively create a more resilient service and a higher degree of assurance that they can resume operations when inopportune events occur, as well create more learnings and a more resilient organization.

Some of the learnings gained during these disasters included:

  • When connectivity was lost, the failover to the engineer workstations didn’t work.
  • Engineers didn’t know how to access a conference call bridge or the bridge only had capacity for fifty people or they needed a new conference call provider who would allow them to kick off engineers who had subjected the entire conference to hold music.
  • When the data centers ran out of diesel for the backup generators, no one knew the procedures for making emergency purchases through the supplier, resulting in someone using a personal credit card to purchase $50,000 worth of diesel.

Latent defects are the problems that appear only because of having injected faults into the system.

Book Club: The DevOps Handbook (Chapter 18. Create Review and Coordination Processes to Increase Quality of Our Current Work)

This entry is part 19 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 18

The theme of this section is enabling Development and Operations to reduce the risk of production changes before they are made.

The peer review process at GitHub is an example of how inspection can increase quality, make deployments safe, and be integrated into the flow of everyone’s daily work. They pioneered the process called “pull request”, one of the most popular forms of peer review that span Dev and Ops. Once a pull request is sent, interested parties can review the set of changes, discuss potential modifications, and even push follow-up commits if necessary.

At GitHub, pull requests are the mechanism used to deploy code into production through a collective set of practices called “GitHub Flow”. The process is how engineers request code reviews, integrate feedback, and declare that code will be deployed to production.

GitHub Flow consists of five steps:

  1. To work on something new, the engineer creates a descriptively named branch off of master.
  2. The engineer commits to that branch locally, regularly pushing their work to the same named branch on the server.
  3. When they need feedback or help, or when they think the branch is ready for merging, they open a pull request.
  4. When they get their desired reviews and get any necessary approvals of the feature, the engineer can then merge it into master.
  5. Once the code changes are merged and pushed to master, the engineer deploys them into production.

The Dangers of the Change Approval Process

When high-profile deployment incidents occur, there are typically two responses. The first narrative is that the accident was due to a change control failure, which seems valid because of a situation where better change control practices could have detected the risk earlier and prevented the change from going into production. The second narrative is that the accident was due to a testing failure.

The reality is that in environments with low-trust, command-and-control cultures, the outcomes of these types of change control and testing countermeasures often result in an increased likelihood that problems will occur again.

Potential Dangers of “Overly Controlling Changes”

Traditional change controls can lead to unintended outcomes, such as contributing to long lead times, and reducing the strength and immediacy of feedback from the deployment process.

Common controls include:

  • Adding more questions that need to be answered to the change request form.
  • Requiring more authorizations, such as one more level of management approval or more stakeholders.
  • Requiring more lead time for change approvals so that change requests can be properly evaluated.
Adopted from The DevOps Handbook

Enable Coordination and Scheduling of Changes

Whenever multiple groups work on systems that share dependencies, changes will likely need to be coordinated to ensure that they don’t interfere with each other. For more complex organizations and organizations with more tightly-coupled architectures, teams may need to deliberately schedule changes, where representatives from the teams get together, not to authorize changes, but to schedule and sequence their changes in order to minimize accidents.

Enable Peer Review of Changes

Instead of requiring approval from an external body prior to deployment, require engineers to get peer reviews of their changes. The goal is to find errors by having fellow engineers close to the work scrutinize changes.

This review improves the quality of changes, which also creates the benefits of cross-training, peer learning, and skill improvement. A logical place to require reviews is prior to committing code to trunk in source control, where changes could potentially have a team-wide or global impact.

The principle of small batch sizes also applies to code reviews. The larger the size of the change that needs to be reviewed, the longer it takes to understand and the larger the burden on the reviewing engineer.

“There is a non-linear relationship between the size of the change and the potential risk of integrating that change—when you go from a ten line code change to a one hundred line code, the risk of something going wrong is more than ten times higher, and so forth.”

Randy Sharp

“Ask a programmer to review ten lines of code, he’ll find ten issues. Ask him to do five hundred lines, and he’ll say it looks good.”

Giray Özil

Guidelines for Code Reviews include:

  • Everyone must have someone to review their changes before committing to trunk.
  • Everyone should monitor the commit stream of their fellow team members so that potential conflicts can be identified and reviewed.
  • Define which changes qualify as high risk and may require review from a designated subject matter expert.
  • If someone submits a change that is too large to reason about easily, then it should be split up into multiple, smaller changes that can be understood at a glance.

Code Review Formats:

  • Pair programming: programmers work in pairs.
  • “Over-the-shoulder”: One developer looks over the author’s shoulder as the latter walks through the code.
  • Email pass-around: A source code management system emails code to reviewers automatically after the code is checked in.
  • Tool-assisted code review: Authors and reviewers use specialized tools designed for peer code review or facilities provided by the source code repositories.
Adopted from The DevOps Handbook

Potential Danger of Doing More Manual Testing and Change Freezes

When testing failures occur, the typical reaction is to do more testing. This is true if performing manual testing, because manual testing is naturally slower and more tedious than automated testing.

Manual testing often has the consequence of taking significantly longer to test, which means deploying less frequently, thus increasing the deployment batch size. Instead of performing testing on large batches of changes that are scheduled around change freeze periods, fully integrate testing into daily work as part of the smooth and continual flow into production.

Enable Pair Programming to Improve Changes

Pair programming is when two engineers work together at the same workstation, a method popularized by Extreme Programming and Agile in the early 2000s.

In one common pattern of pairing, one engineer fills the role of the driver, the person who actually writes the code, while the other engineer acts as the navigator, observer, or pointer, the person who reviews the work as it is being performed. The driver focuses their attention on the tactical aspects of completing the task, using the observer as a safety net and guide.

Dr. Laurie Williams performed a study in 2001 that showed “paired programmers are 15% slower than two independent individual programmers, while ‘error-free’ code increased from 70% to 85%.”

“Pairs typically consider more design alternatives than programmers working alone and arrive at simpler, more maintainable designs; they also catch design defects early.

Dr Laurie Williams

Pair programming has the additional benefit of spreading knowledge throughout the organization and increasing information flow within the team.

Evaluating the Effectiveness of Pull Request Process

One method to evaluate the effectiveness of peer review is to look at production outages and examine the peer review process for any relevant changes.

Ryan Tomayko, CIO and co-founder of GitHub:

  • “A bad pull request is one that doesn’t have enough context for the reader, having little or no documentation of what the change is intended to do.”
  • “A great pull request has sufficient detail on why the change is being made, how the change was made, as well as any identified risks and resulting countermeasures.”

Fearlessly Cut Bureaucratic Process

Many companies still have long-standing processes for approval that require months to navigate. These approval processes can significantly increase lead times, not only preventing teams from delivering value quickly to customers, but potentially increasing the risk to our organizational objectives.

“A great metric to publish widely is how many meetings and work tickets are mandatory to perform a release—the goal is to relentlessly reduce the effort required for engineers to perform work and deliver it to the customer.”

Adrian Cockcroft

Lessons Learned

By implementing feedback loops teams can enable everyone to work together toward shared goals, see problems as they occur, and ensure that features not only operate as designed in production, but also achieve organizational goals and organizational learning.

Book Club: The DevOps Handbook (Chapter 17. Integrate Hypothesis-Driven Development and A/B Testing into Our Daily Work)

This entry is part 18 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 17

All too often in software projects, developers work on features for months or years, spanning multiple releases, without ever confirming whether the desired business outcomes are being met, such as whether a particular feature is achieving the desired results or even being used at all.

Before building a feature, teams should ask themselves: “Should we build it, and why?”

A Brief History of A/B Testing

A/B testing techniques were pioneered in direct response marketing, which is one of the two major categories of marketing strategies. The other is called mass marketing or brand marketing; it relies on placing as many ad impressions in front of people as possible to influence buying decisions.

In previous eras, before email and social media, direct response marketing meant sending thousands of postcards or flyers via postal mail, and asking prospects to accept an offer by calling a telephone number, returning a postcard, or placing an order.

Integrating A/B Testing Into Feature Testing

The most commonly used A/B technique in modern UX practice involves a website where visitors are randomly selected to be shown one of two versions of a page, either a control (“A”) or a treatment (“B”).

A/B tests are also known as online controlled experiments and split tests. Performing meaningful user research and experiments ensures that development efforts help achieve customer and organizational goals.

Integrate A/B Testing Into Releases

Fast and iterative A/B testing is made possible by being able to quickly and easily do production deployments on demand, using feature toggles and potentially delivering multiple versions of our code simultaneously to customer segments.

Integrate A/B Testing Into Feature Planning

Product owners should think about each feature as a hypothesis and use production releases as experiments with real users to prove or disprove that hypothesis.

Hypothesis-Driven Development:

  • We Believe that increasing the size of hotel images on the booking page.
  • Will Result in improved customer engagement and conversion.
  • We Will Have Confidence To Proceed When we see a 5% increase in customers who review hotel images who then proceed to book in forty-eight hours.

Book Club: The DevOps Handbook (Chapter 16. Enable Feedback So Development and Operations Can Safely Deploy Code)

This entry is part 17 of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 16

The goal is to catch errors in the deployment pipeline before they get into production. However, there will still be errors teams don’t detect, and so they must rely on production telemetry to quickly restore service.

Solutions available to teams:

  • Turn off broken features with feature toggles
  • Fix forward (make code changes to fix the defect that are pushed into production through the deployment pipeline)
  • Roll back (switch back to the previous release by taking broken servers out of rotation using the blue-green or canary release patterns)

Since production deployments are one of the top causes of production issues, each deployment and change event is overlaid onto our metric graphs to ensure that everyone in the value stream is aware of relevant activity, enabling better communication and coordination, as well as faster detection and recovery.

Developers Share Production Duties With Operations

Even when production deployments and releases go flawlessly, in any complex service there will still have unexpected problems, such as incidents and outages that happen at inopportune times. Even if the problem results in a defect being assigned to the feature team, it may be prioritized below the delivery of new features.

As Patrick Lightbody, SVP of Product Management at New Relic, observed in 2011, “We found that when we woke up developers at 2 a.m., defects were fixed faster than ever.” This practice helps Development management see that business goals are not achieved simply because features have been marked as “done”. Instead, the feature is only done when it is performing as designed in production, without causing excessive escalations or unplanned work for either Development or Operations.

When developers get feedback on how their applications perform in production, which includes fixing it when it breaks, they become closer to the customer.

Have Developers Follow Work Downstream

One of the most powerful techniques in interaction and user experience design (UX) is contextual inquiry. This is when the product team watches a customer use the application in their natural environment, often working at their desk. Doing so often uncovers ways that customers struggle with the application, such as:

  • Requiring scores of clicks to perform simple tasks in their daily work
  • Cutting and pasting text from multiple screens
  • Writing down notes on paper

Developers should follow their work downstream, so they can see how downstream work centers must interact with their product to get it running into production. Teams create feedback on the non-functional aspects of our code and identify ways to improve deployability, manageability, operability, etc.

Have Developers Initially Self-Manage Their Production Service

Even when Developers are writing and running their code in production-like environments in their daily work, Operations may still experience disastrous production releases because it’s the first time the application is under true production conditions. This result occurs because operational learnings often occur too late in the software life cycle.

One potential countermeasure is to do what Google does, which is have Development groups self-manage their services in production before they become eligible for a centralized Ops group to manage. By having developers be responsible for deployment and production support, teams are more likely to have a smooth transition to Operations.

Teams could define launch requirements that must be met in order for services to interact with real customers and be exposed to real production traffic.

Launch Guidance:

  • Defect counts and severity: Does the application actually perform as designed?
  • Type/frequency of pager alerts: Is the application generating an unsupportable number of alerts in production?
  • Monitoring coverage: Is the coverage of monitoring sufficient to restore service when things go wrong?
  • System architecture: Is the service loosely-coupled enough to support a high rate of changes and deployments in production?
  • Deployment process: Is there a predictable, deterministic, and sufficiently automated process to deploy code into production?
  • Production hygiene: Is there evidence of enough good production habits that would allow production support to be managed by anyone else?

Google’s Service Handback Mechanism

When a production service becomes sufficiently fragile, Operations has the ability to return production support responsibility back to Development. When a service goes back into a developer-managed state, the role of Operations shifts from production support to consultation, helping the team make the service production-ready.

Adopted from The DevOps Handbook

Google created two sets of safety checks for two critical stages of releasing new services called the Launch Readiness Review and the Hand-Of Readiness Review. The LRR must be performed and signed off on before any new Google service is made publicly available to customers and receives live production traffic. The HRR is performed when the service is transitioned to an Ops-managed state. The HRR is far more stringent and has higher acceptance standards.

The practice of SREs helping product teams early is an important cultural norm that is continually reinforced at Google. Helping product teams is a long-term investment that will pay off many months later when it comes time to launch. It is a form of ‘good citizenship’ and ‘community service’ that is valued, it is routinely considered when evaluating engineers for SRE promotions.

Adopted from The DevOps Handbook

Common Regulatory Concerns to Answer

  • Does the service generate a significant amount of revenue?
  • Does the service have high user traffic or have high outage/impairment costs?
  • Does the service store payment cardholder information, such as credit card numbers, or personally identifiable information, such as Social Security numbers or patient care records? Are there other security issues that could create regulatory, contractual obligation, privacy, or reputation risk?
  • Does the service have any other regulatory or contractual compliance requirements associated with it, such as US export regulations, PCI-DSS, HIPAA, and so forth?