Site Reliability Engineering (or SRE) has been gaining popularity recently to help improve reliability of systems and provide a prescriptive approach to implementing DevOps.
Site Reliability Engineers (or SREs) use techniques such as Service Level Objectives (SLOs) and Error Budgets (EBs) to quantify the risk tolerance for systems and services, as well as to balance the needs of velocity and system stability and reliability.
Similarly, testers play a key role in balancing the needs of velocity with overall system quality. We feel that these approaches can by synergized through better collaboration between developers, testers, and SREs and by leveraging each other’s practices.
In this blog we will explore (a) the synergies between SREs and testers and (b) how testers can work with SREs and development teams to balance the needs of velocity and quality.
Error Budgets Overview
Error Budgets allow SREs to balance the needs between velocity and stability. As long as there is sufficient room in the error budget, teams prioritize new feature development and frequent deployments. However, as error budget is exhausted, teams slow down (or stop) new feature development and deployment, and focus more on system hardening and testing.
Synergies Between SREs and Testers
The EB approach used by SREs is analogous to how testers use overall application quality and release risk to modulate velocity. Since reliability is only one of many overall quality metrics, the EB approach is in fact a more specific approach to modulating velocity based on quality. Therefore, this approach can (and should) be synergized with overall QA modulation.
Before we discuss the unified approach, let’s first discuss the synergies between the roles of SREs and testers – specially software (development) engineers in test (SDET or SET). Both have their roots in software development, and therefore much in common.
Some of the key points of commonality include the following:
- Software Engineering approach: Both SREs and SETs are software engineers. They bring core software engineering approaches to their domain and work closely with the application development team. These include practices such as “everything-as-code” (such as configurations or tests), version control of assets, and white-box focus.
- Shift-Left: Both SREs and SETs help to shift left their respective disciplines early in the lifecycle to ensure reliability/quality is built in. This includes architecture quality, configuration quality, and early monitoring (see Figure below).
- Overlap on Shift-Right activities: Increasingly, testers are also practicing “Shift-Right”, which overlaps with SRE activities. These include chaos engineering, canary testing, A/B testing, and extraction of insights from operational data.
- Toil reduction through automation: Both SREs and SETs actively drive automation efforts in their domain to remove waste and reduce cycle time and errors.
- Technical debt ownership: Error budget is a form of software technical debt. Just like technical debt, error budgets are used to trigger decisions on hardening and velocity.
- Velocity with safety: Finally, both roles help to balance the needs of velocity with reliability and quality. As we have discussed, reliability is a sub-set of overall system quality.
Synergizing Error Budgets and Release Quality to Modulate Velocity
Testers and QA professionals use a variety of techniques and measures to assess release quality and risk. These include things like code quality, batch size, functional and non-functional requirements coverage (through tests), defect detection and removal, UX/CX, compliance, supportability, technical debt etc.
Various approaches exist to quantify release risk based on the measures from the above techniques. Organizations make business decisions to proceed with releases despite risk. However, deficiencies in each of these measures add up to the quality debt of an application. As quality debt increases, the risk of releasing software progressively increases. At some point, the risk threshold is crossed, and releases are slowed or halted to allow time for remediation or hardening.
Clearly, this is analogous to how SREs use error budgets. Therefore, it makes sense to use them in a combined, synergistic manner.
In this unified approach, the velocity is modulated by a combination of error budget and release risk. This provides a more holistic view of balancing velocity and quality and essentially subsumes reliability as a measure of overall quality.
Summary and Looking Forward
Hopefully, this article has provided readers with some insight into the synergies between SREs and testers and how an integrated approach can be used for modulating velocity. We don’t quite have well defined models for “release risk budgets”; however, we can define those along the lines of SRE error budgets. Stay tuned for more blogs on this subject.