Featured

Welcome to Red Green Refactor

We officially welcome you to the start of Red Green Refactor, a technology blog about automation and DevOps. We are a group of passionate technologists who care about learning and sharing our knowledge. Information Technology is a huge field and even though we’re a small part of it – we wanted another outlet to collaborate with the community.

Why Red Green Refactor?

Red Green Refactor is a term commonly used in Test Driven Development to support a test first approach to software design. Kent Beck is generally credited with discovering or “rediscovering” the phrase “Test Driven Development”. The mantra for the practice is red-green-refactor, where the colors refer to the status of the test driving the development code.

The Red is writing a small piece of test code without the development code implemented. The test should fail upon execution – a red failure. The Green is writing just enough development code to get the test code to pass. The test should pass upon execution – a green pass. The Refactor is making small improvements to the development code without affecting the behavior. The quality of the code is improved according to team standards, addressing “code smells” (making the code readable, maintainable, removing duplication), or using simple design patterns. The point of the practice is to make the code more robust by catching the mistakes early, with an eye on quality of the code from the beginning. Writing in small batches helps the practitioner think about the design of their program consistently.

“Refactoring is a controlled technique for improving the design of an existing codebase.”

Martin Fowler

The goal of Red Green Refactor is similar to the practice of refactoring: to make small-yet-cumulative positive changes, but instead in learning to help educate the community about automation and DevOps. The act of publishing also encourages our team to refine our materials in preparation for a larger audience. Many of the writers on Red Green Refactor speak at conferences, professional groups, and the occasional webinar. The learning at Red Green Refactor will be bi-directional – to the readers and to the writers.

Who Are We?

The writers on Red Green Refactor come from varied backgrounds but all of us made our way into information technology, some purposefully and some accidentally. Our primary focus was on test automation, which has evolved into DevOps practices as we expanded our scope into operations. Occasionally we will invite external contributors to post on a subject of interest. We have a few invited writers lined up and ready to contribute.

“Automation Team” outing with some of Red-Green-Refactor authors

As for myself, I have a background in Physics & Biophysics, with over a decade spent in research science studying fluorescence spectroscopy and microscopy before joining IT. I’ve worked as a requirements analyst, developer, and tester before joining the ranks of pointed-headed management. That doesn’t stop me from exploring new tech at home though or posting about it on a blog.

What Can You Expect From Red Green Refactor?

Technology

Some companies are in the .NET stack, some are Java shops, but everyone needs some form of automation. The result is many varied implementations of both test & task automation. Our team has supported almost all the application types under the sun (desktop, web, mobile, database, API/services, mainframe, etc.). We’ve also explored with many tools both open-source and commercial at companies with ancient tech and bleeding edge. Our posts will be driven by both prior experience as well as exploration to the unknown.

We’ll be exploring programming languages and tools in the automation space.  Readers can expect to learn about frameworks, cloud solutions, CI/CD, design patterns, code reviews, refactoring, metrics, implementation strategies, performance testing, etc. – it’s open ended.

Continuous Improvement

We aim to keep our readers informed about continuous improvement activities in the community. One of the great things about this field is there is so much to learn and it’s ever-changing. It can be difficult at times with the firehose of information coming at you since there are only so many hours in the day. We tend to divide responsibility among our group to perform “deep dives” into certain topics and then share that knowledge with a wider audience (for example: Docker, Analytics or Robot Process Automation). In the same spirit we plan to share information on Red Green Refactor about continuous improvement. Posts about continuous improvement will include: trainings, conference recaps, professional groups, aggregated articles, podcasts, tech book summaries, career development, and even the occasional job posting.

Once again welcome to Red Green Refactor. Your feedback is always welcome.

From The Pipeline v37.0

The following is a regular feature where we share articles, podcasts, and webinars of interest from the web. We have a numeric theme with the items chosen for this “From the Pipeline”.

10 BEST TEST DATA MANAGEMENT TOOLS IN 2022

Recently I’ve been exploring Test Data Management (TDM) tooling after spending a year working with some tricky data sets for automated testing. This list is a good starting point because TDM is not about only creating synthetic data, but also subsetting, masking, obfuscation, and structuring. Given the enormity of the challenge it’s no wonder there are many commercial vendors offering solutions.

What’s New in Cypress 10

The latest release of Cypress has a number of changes. First, component testing is updated to test directly in the browser. There are also project structure updates to configuration and plugins. Naming conventions are recommended to differentiate between end-to-end tests and component tests. Lastly, the migration assistant to help with making the directory changes to your project.

15 Top Load Testing Tools for 2022

Joe Colantonio updates his list of Load Testing tools each year in an easy-to-use guide. He has interspersed recorded interviews and reviews of those tools in his guide to make it excellent reference material if you’re looking to brush up on the best tools in the industry.

18 virtual team building activities and games for 2022

A handy guide from Atlassian on some remote team building activities. Aside from the enumerated list, I really enjoy the table reference so you can quickly identify the best circumstance for the activity. Useful for any team lead or Scrum Master.

16 ways software testability can assist manual testing

Ashley Graf has posted an solid article on testability for manual testing. She provides some valuable ideas to use in manual testing practices, which should be seen as just as important as automated tests.

From The Pipeline v36.0

This entry is part [part not set] of 36 in the series From the Pipeline

The following is a regular feature where we share articles, podcasts, and webinars of interest from the web.

The top 7 advanced features of Cypress to know about

Eran Kinsbruner and Gleb Bahmutov posted some features of Cypress, one of the fastest growing test automation frameworks. They enumerate many of the benefits, chief of which are running the tests inside the browser. Additional benefits of this approach include execution speed and debugging capabilities, along with CI server execution, visual testing capabilities, and growing list of plugins to support team needs.

What’s the Problem with User Stories?

In an article associated with an Agile DevOps Virtual Conference, Adam Sandman recommends making the following considerations when writing user stories: (1) Understand the drivers and goals; (2) Break down the problem into broad themes; (3) Write the user stories; and, (4) Update the stories and write acceptance tests as you develop the system.

STP & STPCon: Changing Times, Moving Forward

Software Test Professionals Conference is shifting post-pandemic under the InflectraCon banner. STPCon was a long-running series of in-person conferences across the U.S. along with online webinars and workshops. The pandemic damaged the ability of STP to continue operating, so working with InflectraCon will ensure the events continue.

Code Colocation is King

Koen van Gilst has posted a short article on the principle of proximity in a codebase. He argues that you should keep the code that changes together close together. This includes where to include one-off pieces of functionality, where tests should live, and other code structuring recommendations. A good read for consideration.

Risk Coverage: A New Currency for Testing

“In the era of agile and DevOps, release decisions need to be made rapidly—preferably, even automatically and instantaneously. Test results that focus solely on the number of test cases leave you with a huge blind spot. If you want fast, accurate assessments of the risks associated with promoting the latest release candidate to production, you need a new currency in testing: Risk coverage needs to replace test coverage.”

From The Pipeline v35.0

This entry is part [part not set] of 36 in the series From the Pipeline

The following is a regular feature where we share articles, podcasts, and webinars of interest from the web.

Avoiding Storms with BDD & Automated Testing

Adam Cogan provides a decent introduction to Behavior-Driven Development in a post that highlights Microsoft’s shift to use Playwright at their primary test automation tool. While I don’t agree with his assessment that Playwright has overtaken Selenium just yet, having more supported tools to assist testers is a good thing.

Pillars of a Good Test Automation Framework

Marie Drake takes the reader through her pillars of a good automation framework, based on a discussion from an online panel the prior month. Her must-haves for a framework are: reusability, maintainability, scalability, framework extensions, ease of use, and community support.

10 Experiments to Improve Your Exploratory Testing Note Taking

From the archives, this gem from Alan Richardson provides guidance on how to enhance your Exploratory Testing sessions with several easy-to-remember experiments. This is recommended reading for testers looking to refresh their skills or anyone just starting to learn Exploratory Testing.

What Is Mobile Device Testing? Strategies for Testing on Devices

Similar to the above advice on Exploratory Testing, the following article from Perfecto Mobile provides some excellent advice on the approach a team or organization should take when testing with mobile devices.

Demystifying Differential and Incremental Analysis for Static Code Analysis within DevOps

“The DevOps Movement has many recommended practices for automation of processes and testing. However, across different market verticals, the requirements, practices, and cadence of releases very widely. For example, in the more security or safety relevant software markets, development processes often also include compliance to coding standards, or other security and safety practices that must be “baked” into the process in order to achieve compliance.”

From The Pipeline v34.0

This entry is part [part not set] of 36 in the series From the Pipeline

The following will be a regular feature where we share articles, podcasts, and webinars of interest from the web.

How Much Testing is Enough?

The google testing blog recently posted details on the scope of their internal testing practice. In addition to defining core terms, they briefly outline their testing: (1) Document your process or strategy; (2) Have a solid base of unit tests; (3) Don’t skimp on integration testing; (4) Perform end-to-end testing for Critical User Journey; (5) Understand and implement the other tiers of testing; (6) Understand your coverage of code and functionality; (7) Use feedback from the field to improve your process.

Test Automation Strategy Guide

Julia Pottinger posted some excellent thoughts on the approach teams should take for test automation. Every team should start with their goal in mind. Once that is determined, teams should identify the tools and techniques for automation. This is followed by identifying who is writing the automation, when it will be executed, and the environments to be used.

What Makes a Good Automated Test?

Kristin has provided a set of guidelines for determining a good automated test. First, tests should be meaningful. This is because each test you write is an investment in the maintenance of that test. Tests should also be maintainable – the automated checks should be readable and well-organized. Tests should also run quickly in order to provide fast feedback for teams.

The Test Data Bottleneck and How to Solve It

Test data is one of the major bottlenecks in testing processes. By simplifying test data, we can solve this bottleneck by tackling four major challenges: Time, People, Size, and Money.

GitHub’s Engineering Team has moved to Codespaces

Recently the GitHub development team shifted to Codespaces for the majority of GitHub development. They’re making this change because local development was brittle — any changes to a local environment could make it useless and require hours of development time to recover. Collaborating on multiple branches across multiple projects was painful. Now, Codespaces executes a shallow clone and repository history in the background, which reduced time to clone. They also created a GitHub Action that would run nightly, clone the repository, bootstrap dependencies, then build & push a Docker image of the result.

Book Club: The DevOps Handbook (Conclusion)

This entry is part [part not set] of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Conclusion: A Call to Action

DevOps offers a solution at a time when every technology leader is challenged with enabling security, reliability, agility, handling security breaches, improving time to market, and massive technology transformations.

An inherent conflict can exist between Development and Operations that creates worsening problems, which results in slower time to market for new products and features, poor quality, increased outages and technical debt, reduced engineering productivity, as well as increased employee dissatisfaction and burnout. DevOps principles and patterns enable teams to break this core, chronic conflict.

DevOps requires potentially new cultural and management norms, and changes in technical practices and architecture. This results in maximizing developer productivity, organizational learning, high employee satisfaction, and the ability to win in the marketplace.

DevOps is not just a technology imperative, but also an organizational imperative. DevOps is applicable and relevant to any and all organizations that must increase flow of planned work through the technology organization, while maintaining quality, reliability, and security for customers.

“The call to action is this: no matter what role you play in your organization, start finding people around you who want to change how work is performed.”

The DevOps Handbook

This concludes the book club summary for “The DevOps Handbook.” Other book club summaries are available for “The Phoenix Project” and “BDD: Discovery“. Stay subscribed for more book club summaries and other great content on automation & DevOps.

Book Club: The DevOps Handbook (Chapter 23. Protecting the Deployment Pipeline and Integrating Into Change Management and Other Security and Compliance Controls)

This entry is part [part not set] of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 23

Almost any IT organization of any size will have existing change management processes, which are the primary controls to reduce operations and security risks. The goal is to successfully integrate security and compliance into any existing change management process.

ITIL breaks changes down into three categories:

Standard Changes: lower-risk changes that follow an established and approved process but can also be pre-approved. They can include monthly updates of application tax tables or country codes, website content & styling changes, and certain types of application or operating system patches that have a well-understood impact. The change proposer does not require approval before deploying the change, and change deployments can be completely automated and should be logged so there is traceability.

Normal Changes: higher-risk changes that require review or approval from the agreed upon change authority. In many organizations, this responsibility is inappropriately placed on the change advisory board (CAB) or emergency change advisory board (ECAB), which may lack the required expertise to understand the full impact of the change, often leading to unacceptably long lead times. Large code deployments may contain hundreds of thousands of lines of new code, submitted by hundreds of developers. In order for normal changes to be authorized, the CAB will almost certainly have a well-defined request for change (RFC) form that defines what information is required for the go/no-go decision.

Urgent Changes: These are emergency and potentially high-risk changes that must be put into production immediately. These changes often require senior management approval but allow documentation to be performed after the fact. A key goal of DevOps practices is to streamline the normal change process such that it is also suitable for emergency changes.

Recategorize The Majority of Lower Risk Changes as Standard Changes

One way to support an assertion that changes are low risk is to show a history of changes over a significant time period and provide a complete list of production issues during that same period. Ideally, deployments will be performed automatically by configuration management and deployment pipeline tools and the results will be automatically recorded.

Creating this traceability and context should be easy and should not create an overly onerous or time consuming burden for engineers. Linking to user stories, requirements, or defects is almost certainly sufficient.

What To Do When Changes are Categorized as Normal Changes

The goal is to ensure that the change can be deployed quickly, even if it is not fully automated. Ensure that any submitted change requests are as complete and accurate as possible, giving the CAB everything they need to properly evaluate the change.

Because the submitted changes will be manually evaluated by people, it is even more important the context of the change is described. The goal is to share the evidence and artifacts that gives confidence that the change will operate in production as designed.

Reduce Resilience on Separation of Duties

For decades, developers have used separation of duty as one of the primary controls to reduce the risk of fraud or mistakes in the software development process. As complexity and deployment frequency increase, performing production deployments successfully increasingly requires everyone in the value stream to quickly see the outcomes of their actions.

Separation of duty often can impede this by slowing down and reducing the feedback engineers receive on their work. Instead, choose controls such as pair programming, continuous inspection of code check-ins, and code review.

Ensure Documentation and Proof For Auditors and Compliance Officers

As technology organizations increasingly adopt DevOps patterns, there is more tension than ever between IT and audit. These new DevOps patterns challenge traditional thinking about auditing, controls, and risk mitigation.

“DevOps is all about bridging the gap between Dev and Ops. In some ways, the challenge of bridging the gap between DevOps and auditors and compliance officers is even larger. For instance, how many auditors can read code and how many developers have read NIST 800-37 or the Gramm-Leach-Bliley Act? That creates a gap of knowledge, and the DevOps community needs to help bridge that gap.”

Bill Shinn, a principal security solutions architect at Amazon Web Services

Instead, teams work with auditors in the control design process. Assign a single control for each sprint to determine what is needed in terms of audit evidence. Send all the data into the telemetry systems so the auditors can get what they need, completely self-serviced.

“In audit fieldwork, the most commonplace methods of gathering evidence are still screenshots and CSV files filled with configuration settings and logs. Our goal is to create alternative methods of presenting the data that clearly show auditors that our controls are operating and effective.”

The DevOps Handbook

Case Study: Relying on Production Telemetry for ATM Systems

Information security, auditors, and regulators often put too much reliance on code reviews to detect fraud. Instead, they should be relying on production monitoring controls in addition to using automated testing, code reviews, and approvals, to effectively mitigate the risks associated with errors and fraud.

“Many years ago, we had a developer who planted a backdoor in the code that we deploy to our ATM cash machines. They were able to put the ATMs into maintenance mode at certain times, allowing them to take cash out of the machines. We were able to detect the fraud very quickly, and it wasn’t through a code review. These types of backdoors are difficult, or even impossible, to detect when the perpetrators have sufficient means, motive, and opportunity.”

“However, we quickly detected the fraud during our regularly operations review meeting when someone noticed that ATMs in a city were being put into maintenance mode at unscheduled times. We found the fraud even before the scheduled cash audit process, when they reconcile the amount of cash in the ATMs with authorized transactions.”

The DevOps Handbook

Book Club: The DevOps Handbook (Chapter 22. Information Security as Everyone’s Job, Every Day)

This entry is part [part not set] of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 22

The goal is to create & integrate security controls into the daily work of Development and Operations, so that security is part of everyone’s job, every day. Ideally the work will be automated and put into a deployment pipeline. Manual processes, acceptances, and approvals should be replaced with automated controls, relying less on separation of duties and change approval.

To integrate security, compliance, and change management:

  • Make security a part of everyone’s job
  • Integrate preventative controls into the shared source code repository
  • Integrate security with the deployment pipeline
  • Integrate security with telemetry to better enable detection and recovery
  • Protect the deployment pipeline
  • Integrate the deployment activities with the change approval processes
  • Reduce reliance on separation of duty

Integrate Security Into Development Iteration Demonstrations

One of the goals is to have feature teams engaged with Infosec as early as possible, as opposed to primarily engaging at the end of the project. Invite Infosec to the product demonstrations at the end of each development interval so that they can better understand the team goals in the context of organizational goals, observe their implementations as they are being built, and provide guidance and feedback at the earliest stages of the project, when there is the most amount of time and freedom to make corrections.

“By having Infosec involved throughout the creation of any new capability, we were able to reduce our use of static checklists dramatically and rely more on using their expertise throughout the entire software development process.”

Justin Arbuckle, Chief Architect at GE Capital

Integrate Security Into Defect Tracking and Postmortems

Track all open security issues in the same work tracking system that Development and Operations are using, ensuring the work is visible and can be prioritized against all other work. InfoSec traditionally has the security vulnerabilities stored in a GRC (governance, risk, and compliance) tool.

Integrate Preventive Security Controls Into Shared Source Code Repositories and Shared Services

Add to the shared source code repository any mechanisms or tools that help ensure applications and environments are secure. What should be added? authentication and encryption libraries and services. Version control also serves as a omni-directional communication mechanism to keep all parties aware of changes being made.

Items to include in version control related to Security:

  • Code libraries and their recommended configurations
  • Secret management using tools such as Vault, credstash, Trousseau, Red October, etc.
  • OS packages and builds

Integrate Security Into Deployment Pipelines

Prior state: security reviews were started after development was completed. The documentation would be given to Development and Operations, which would be completely un-addressed due to project due date pressure or problems being found too late in the SDLC.

Goal state: automate as many information security tests as possible so they run as part of the deployment pipeline. Security should provide both Dev and Ops with fast feedback on their work.

Ensure Security of the Application

Development testing focuses on the correctness of functionality or happy path, which validates user journeys where everything goes as expected, with no exceptions or error conditions. QA, Infosec, and Fraud practitioners will often focus on the sad paths, which happen when things go wrong, especially in relation to security-related error conditions.

Static analysis: this is testing performed in a non-runtime environment, ideally in the deployment pipeline. Typically, a static analysis tool will inspect program code for all possible run-time behaviors and seek out coding flaws, back doors, and potentially malicious code. Examples of tools include Brakeman, Code Climate, and searching for banned code functions.

Dynamic analysis: dynamic analysis consists of tests executed while a program is in operation. Dynamic tests monitor items such as system memory, functional behavior, response time, and overall performance of the system. Ideally, automated dynamic testing is executed during the automated functional testing phase of a deployment pipeline.

Dependency scanning: Another type of static testing that would normally perform at build time inside of a deployment pipeline involves inventorying dependencies for binaries and executables, and ensuring that these dependencies are free of vulnerabilities or malicious binaries.

Source code integrity and code signing: All developers should have their own PGP key, perhaps created and managed in a system such as keybase.io. All commits to version control should be signed —that is straightforward to configure using the open source tools git. All packages created by the CI process should be signed, and their hash recorded in the centralized logging service for audit purposes.

Ensure Security of Software Supply Chain

“We are no longer writing customized software—instead, we assemble what we need from open source parts, which has become the software supply chain that we are very much reliant upon.”

Josh Corman

Using commercial or open source libraries brings in vulnerabilities along with their functionality.

The 2015 Sonatype State of the Software Supply Chain Report on vulnerability had some noteworthy findings. For one, the typical organization relied upon 7,601 build artifacts and used 18,614 different versions. 7.5% of those components had known vulnerabilities, with over 66% of those vulnerabilities being over two years old without having been resolved. For open source projects with known vulnerabilities registered in the National Vulnerability Database, only 41% were ever fixed and required. Additionally, on average they took 390 days to publish a fix. For those vulnerabilities that were labeled at the highest severity, fixes required 224 days.

Ensure Security of the Environment

Environments should be in a hardened, risk-reduced state.

One approach is to generate automated tests to ensure that all appropriate settings have been correctly applied for configuration hardening, database security settings, key lengths, etc.

Integrate Information Security Into Production Telemetry

Internal security controls are often ineffective in successfully detecting breaches in a timely manner, either because of blind spots in monitoring or because no one in the organization is examining the relevant telemetry in their daily work.

Deploy the monitoring, logging, and alerting required to fulfill information security objectives throughout applications and environments, as well as ensure that it’s adequately centralized to facilitate easy and meaningful analysis and response.

Creating Security Telemetry in Applications

In order to detect problematic user behavior that could be an indicator or enabler of fraud and unauthorized access, create the relevant telemetry in applications:

  • Successful and unsuccessful user logins
  • User password resets
  • User email address resets
  • User credit card changes

Creating Security Telemetry in Environments

Create sufficient telemetry in test environments to detect early indicators of unauthorized access. Monitoring opportunities include:

  • OS changes (in production or in build infrastructure)
  • Security group changes
  • Changes to configurations
  • Cloud infrastructure changes
  • Cross-site scripting attempts
  • SQL Injection attempts
  • Web server errors

“Nothing helps developers understand how hostile the operating environment is than seeing their code being attacked in real-time.”

Nick Galbreath, Director of Engineering at Etsy

Security Telemetry at Etsy:

  • Abnormal production program terminations
  • Database syntax error
  • Indications of SQL injection attacks

Protect The Deployment Pipeline

If someone compromises the servers running deployment pipeline that has the credentials for the version control system, it could enable someone to steal source code. If the deployment pipeline has write access, an attacker could also inject malicious changes into version control repository and inject malicious changes into application and services.

Risks to CI/CD pipelines include:

  • Developers introducing code that enables unauthorized access (mitigate through controls such as code testing, code reviews, and penetration testing)
  • Unauthorized users gaining access to the code or environment (mitigated via controls such as ensuring configurations match known, good states, and effective patching)

In order to protect our continuous build, integration, or deployment pipeline, a mitigation strategy may include:

  • Hardening continuous build and integration servers and ensuring they can be reproduced in an automated manner
  • Reviewing all changes introduced into version control, either through pair programming at commit time or by a code review process between commit and merge into trunk, to prevent continuous integration servers from running uncontrolled code
  • Instrumenting the repository to detect when test code contains suspicious API calls is checked in to the repository, perhaps quarantining it and triggering an immediate code review
  • Ensuring every CI process runs on its own isolated container or VM
  • Ensuring the version control credentials used by the CI system are read-only

Book Club: The DevOps Handbook (Chapter 21. Reserve Time to Create Organizational Learning and Improvement)

This entry is part [part not set] of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 21

One of the practices that forms part of the Toyota Production System is called the improvement blitz (kaizen), defined as a dedicated and concentrated period of time to address a particular issue, often over the course of a several days.

“…blitzes often take this form: A group is gathered to focus intently on a process with problems…The blitz lasts a few days, the objective is process improvement, and the means are the concentrated use of people from outside the process to advise those normally inside the process.”

The DevOps Handbook

Institutionalize Rituals To Pay Down Technical Debt

Teams should schedule rituals that help enforce the practice of reserving Dev and Ops time for improvement work, such as non-functional requirements, automation, etc. One of the easiest ways to do this is to schedule and conduct day- or week-long improvement blitzes, where everyone on a team self-organizes to fix problems they care about—no feature work is allowed.

The technique of dedicated rituals for improvement work has also been called spring or fall cleanings. Other terms have also been used, such as: hack days, hackathons, and innovation time. The goal during these blitzes is not to simply experiment and innovate for the sake of testing out new technologies, but to improve daily work.

The improvement practice reinforces a culture in which engineers work across the entire value stream to solve problems. What makes improvement blitzes so powerful is empowering those closest to the work to continually identify and solve their own problems.

Enable Everyone To Teach and Learn

A dynamic culture of learning creates conditions so that everyone can not only learn, but also teach, whether through traditional didactic methods (attending training) or more experiential or open methods (conferences).

“We have five thousand technology professionals, who we call ‘associates.’ Since 2011, we have been committed to create a culture of learning—part of that is something we call Teaching Thursday, where each week we create time for our associates to learn. For two hours, each associate is expected to teach or learn. The topics are whatever our associates want to learn about—some of them are on technology, on new software development or process improvement techniques, or even on how to better manage their career. The most valuable thing any associate can do is mentor or learn from other associates.”

Steve Farley, VP of Information Technology at Nationwide Insurance

Organizations can help further help teach skills through daily work by jointly performing code reviews that include both parties so that developers learn by doing, as well as by having Development and Operations work together to solve small problems.

Share Your Experiences From DevOps Conferences

In many cost-focused organizations, engineers are often discouraged from attending conferences and learning from their peers. To help build a learning organization, instead companies should encourage engineers (both from Development and Operations) to attend conferences, give talks at them, and, create & organize internal or external conferences themselves. For instance, Nationwide, Target, and Capital One have internal tech conferences.

Create Internal Consulting and Coaches To Spread Practices

Creating an internal coaching and consulting organization is a method commonly used to spread expertise across an organization. Google’s Testing on the Toilet (or TotT) was a weekly testing periodical. Each week, they published a newsletter in nearly every bathroom in nearly every Google office worldwide.

“The goal was to raise the degree of testing knowledge and sophistication throughout the company. It’s doubtful an online-only publication would’ve involved people to the same degree.”

Mike Bland, Google

Book Club: The DevOps Handbook (Chapter 20. Convert Local Discoveries into Global Improvements)

This entry is part [part not set] of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 20

Use Chat Rooms and Chat Bots to Automate and Capture Organizational Knowledge

Having work performed by automation in a chat room has numerous benefits, including:

  • Everyone sees everything that is happening
  • Engineers on their first day of work can see what daily work looks like and how it’s performed
  • People are more apt to ask for help when they see others helping each other
  • Rapid organizational learning is enabled and accumulated

Automate Standardized Processes in Software Re-use

Instead of putting our expertise into Word documents, teams need to transform these documented standards and processes, which encompass the sum of our organizational learnings and knowledge, into an executable form that makes them easier to reuse.

One of the best ways to make this knowledge re-usable is by putting it into a centralized source code repository, making the tool available for everyone to search and use.

ArchOps: “enables our engineers to be builders, not bricklayers. By putting our design standards into automated blueprints that were able to be used easily by anyone, we achieved consistency as a byproduct.”

Justin Arbuckle

Create a Single, Shared Source Code Repository For The Entire Organization

A firm-wide, shared source code repository is one of the most powerful mechanisms used to integrate local discoveries across the entire organization.

Put into the shared source code repository not only source code, but also other artifacts that encode knowledge and learning, including:

  • Configuration standards for libraries, infrastructure, and environments
  • Deployment tools
  • Testing standards and tools, including security
  • Deployment pipeline tools
  • Monitoring and analysis tools
  • Tutorials and standards

Spread Knowledge By Using Automated Tests As Documentation and Communities of Practice

When team have shared libraries being used across the organization, they should enable rapid propagation of expertise and improvements. Ensuring each of these libraries has significant amounts of automated testing included means the libraries become self-documenting and show other engineers how to use them.

The benefit will be nearly automatic if teams practice test-driven development (TDD), where automated tests are written before the code. This discipline turns test suites into a living, up-to-date specification of the system.

Design For Operations Through Codified Non-Functional Requirements

Examples of non-functional requirements include:

  • Sufficient production telemetry in applications and environments
  • The ability to accurately track dependencies
  • Services that are resilient and degrade gracefully
  • Forward and backward compatibility between versions
  • The ability to archive data to manage the size of the production data set
  • The ability to easily search and understand log messages across services
  • The ability to trace requests from users through multiple services
  • Simple, centralized runtime configuration using feature flags and so forth

Build Reusable Operations User Stories Into Development

Instead of manually building servers and then putting them into production according to manual checklists, automate as much of this work as possible. Ideally, for all recurring Ops work teams will know the following: what work is required, who is needed to perform it, and what the steps to complete it are.

“We know a high availability rollout takes fourteen steps, requiring work from four different teams, and the last five times we performed this, it took an average of three days.”

The DevOps Handbook

Ensure Technology Choices Help Achieve Organizational Goals

When expertise for a critical service resides only in one team, and only that team can make changes or fix problems, this creates a bottleneck.

The goal is to identify the technologies that:

  • Impede or slow down the flow of work
  • Disproportionately create high levels of unplanned work
  • Disproportionately create large numbers of support requests
  • Are most inconsistent with the desired architectural outcomes (e.g. throughput, stability, security, reliability, business continuity)

Book Club: The DevOps Handbook (Chapter 19. Enable and Inject Learning into Daily Work)

This entry is part [part not set] of 25 in the series DevOps Handbook

The following is a chapter summary for “The DevOps Handbook” by Gene Kim, Jez Humble, John Willis, and Patrick DeBois for an online book club.

The book club is a weekly lunchtime meeting of technology professionals. As a group, the book club selects, reads, and discuss books related to our profession. Participants are uplifted via group discussion of foundational principles & novel innovations. Attendees do not need to read the book to participate.

Background on The DevOps Handbook

More than ever, the effective management of technology is critical for business competitiveness. For decades, technology leaders have struggled to balance agility, reliability, and security. The consequences of failure have never been greater―whether it’s the healthcare.gov debacle, cardholder data breaches, or missing the boat with Big Data in the cloud.

And yet, high performers using DevOps principles, such as Google, Amazon, Facebook, Etsy, and Netflix, are routinely and reliably deploying code into production hundreds, or even thousands, of times per day.

Following in the footsteps of The Phoenix Project, The DevOps Handbook shows leaders how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT Operations, and Information Security to elevate your company and win in the marketplace.

The DevOps Handbook

Chapter 19

Institutionalize rituals that increase safety, continuous improvement, and learning by doing the following:

  • Establish a just culture to make safety possible
  • Inject production failures to create resilience
  • Convert local discoveries into global improvements
  • Reserve time to create organizational improvements and learning

When teams work within a complex system, it’s impossible to predict all the outcomes for the actions they take. To enable teams to safely work within complex systems, organizations must become ever better at diagnostics and improvement activities. They must be skilled at detecting problems, solving them, and multiplying the effects by making the solutions available throughout the organization.

Resilient organizations “skilled at detecting problems, solving them, and multiplying the effect by making the solutions available throughout the organization.” These organizations can heal themselves. “For such an organization, responding to crises is not idiosyncratic work. It is something that is done all the time. It is this responsiveness that is their source of reliability.”

Dr. Steven Spear

Chaos Monkey – A Netflix tool that simulated failures in the system to help build resiliency.

When Netflix first ran Chaos Monkey in their production environments, services failed in ways they never could have predicted or imagined – by constantly finding and fixing these issues, Netflix engineers quickly and iteratively created a more resilient service, while simultaneously creating organizational learnings.

Establish a Just, Learning Culture

When accidents occur (which they undoubtedly will), the response to those accidents is seen as “just.”

“When responses to incidents and accidents are seen as unjust, it can impede safety investigations, promoting fear rather than mindfulness in people who do safety-critical work, making organizations more bureaucratic rather than more careful, and cultivating professional secrecy, evasion, and self-protection.”

Dr Sidney Dekker

Dr. Dekker calls this notion of eliminating error by eliminating the people who caused the errors the Bad Apple Theory. He asserts this notion is invalid, because “human error is not our cause of troubles; instead, human error is a consequence of the design of the tools that we gave them.” Instead of “naming, blaming, and shaming” the person who caused the failure, our goal should always be to maximize opportunities for organizational learning.

If teams punish that engineer, everyone is deterred from providing the necessary details to get an understanding of the mechanism and operation of the failure, which guarantees that the failure will occur again. Two effective practices that help create a just, learning-based culture are blameless post-mortems and the controlled introduction of failures into production.

Schedule Blameless Post-Mortem Meetings After Accidents Occur

To conduct a blameless post-mortem The process should include:

(1) Construct a timeline and gather details from multiple perspectives on failures, ensuring teams don’t punish people for making mistakes.
(2) Empower all engineers to improve safety by allowing them to give detailed accounts of their contributions to failures.
(3) Enable and encourage people who do make mistakes to be the experts who educate the rest of the organization on how not to make them in the future.
(4) Accept that there is always a discretionary space where humans can decide to take action or not, and that the judgment of those decisions lies in hindsight.
(5) Propose countermeasures to prevent a similar accident from happening in the future and ensure these countermeasures are recorded with a target date and an owner for follow-ups.

To enable teams to gain this understanding, the following stakeholders need to be present at the meeting:

  • The people involved in decisions that may have contributed to the problem
  • The people who identified the problem
  • The people who responded to the problem
  • The people who diagnosed the problem
  • The people who were affected by the problem
  • Anyone else who is interested in attending the meeting

The first task in the blameless post-mortem meeting is to record the best understanding of the timeline of relevant events as they occurred. During the meeting and the subsequent resolution, we should explicitly disallow the phrases “would have” or “could have,” as they are counterfactual statements that result from our human tendency to create possible alternatives to events that have already occurred. In the meeting, teams must reserve enough time for brainstorming and deciding which countermeasures to implement.

Publish Post-Mortems As Widely As Possible

Teams should widely announce the availability of the meeting notes and any associated artifacts (e.g., timelines, IRC chat logs, external communications). This information should be placed in a centralized location where the entire organization can access it and learn from the incident. Doing this helps us translate local learnings and improvements into global learnings and improvements.

Etsy’s Morgue:

  • Whether the problem was due to a scheduled or an unscheduled incident
  • The post-mortem owner
  • Relevant IRC chat logs (especially important for 2 a.m. issues when accurate note-taking may not happen)
  • Relevant JIRA tickets for corrective actions and their due dates (information particularly important to management)
  • Links to customer forum posts (where customers complain about issues)

Decrease Incident Tolerances to Find Ever-Weaker Failure Signals

As organizations learn how to see and solve problems efficiently, they need to decrease the threshold of what constitutes a problem in order to keep learning.

Organizations are often structured in one of two models: (1) a standardized model, where routine and systems govern everything, including strict compliance with timelines and budgets; or (2) an experimental model, where every day every exercise and every piece of new information is evaluated and debated in a culture that resembles a research and design laboratory.

Redefine Failure and Encourage Calculated Risk Taking

To reinforce a culture of learning and calculated risk-taking, teams need leaders to continually reinforce that everyone should feel both comfortable with and responsible for surfacing and learning from failures. “DevOps must allow this sort of innovation and the resulting risks of people making mistakes. Yes, you’ll have more failures in production. But that’s a good thing, and should not be punished.” – Roy Rapoport of Netflix

Inject Production Failures to Enable Resilience and Learning

As Michael Nygard, author of “Release It! Design and Deploy Production-Ready Software”, writes, “Like building crumple zones into cars to absorb impacts and keep passengers safe, you can decide what features of the system are indispensable and build in failure modes that keep cracks away from those features. If you do not design your failure modes, then you will get whatever unpredictable—and usually dangerous—ones happen to emerge.”

Resilience requires that teams first define failure modes and then perform testing to ensure that these failure modes operate as designed. One way to accomplish this is by injecting faults into production environment and rehearsing large-scale failures to build confidence in recovering from accidents when they occur, ideally without impacting customers.

Institute Game Days to Rehearse Failures

The concept of Game Days comes from the discipline of resilience engineering. Robbins defines resilience engineering as “an exercise designed to increase resilience through large-scale fault injection across critical systems.” The goal for a Game Day is to help teams simulate and rehearse accidents to give them the ability to practice.

The Game Day process involves:

(1) Schedule a catastrophic event — such as the simulated destruction of an entire data center –to happen at some point in the future.
(2) Give teams time to prepare, to eliminate all the single points of failure and to create the necessary monitoring procedures, failover procedures, etc.
(3) The Game Day team defines and executes drills, such as conducting database failovers or turning off an important network connection to expose problems in the defined processes.
(4) Any problems or difficulties that are encountered are identified, addressed, and tested again.

By executing Game Days, teams progressively create a more resilient service and a higher degree of assurance that they can resume operations when inopportune events occur, as well create more learnings and a more resilient organization.

Some of the learnings gained during these disasters included:

  • When connectivity was lost, the failover to the engineer workstations didn’t work.
  • Engineers didn’t know how to access a conference call bridge or the bridge only had capacity for fifty people or they needed a new conference call provider who would allow them to kick off engineers who had subjected the entire conference to hold music.
  • When the data centers ran out of diesel for the backup generators, no one knew the procedures for making emergency purchases through the supplier, resulting in someone using a personal credit card to purchase $50,000 worth of diesel.

Latent defects are the problems that appear only because of having injected faults into the system.