Microservices and Risk
Microservices are trendy, everyone’s using them, but they are not just a passing fad or a fashion? Are they here to stay? Probably.
Why? Well, beyond all the headline reasons to move towards microservices, the compelling reason is that they change the nature of risk in software releases in a way that is compatible with the very high speed of development and release that is a driving necessity for businesses that revolve around software – which is pretty much any business that plans on surviving the age of digital transformation.
Microservices architecture, together with agile development practices, even at relatively small scale (10s, 100s of services) contain risk in any given release by ensuring that components are small, simple, well understood and isolated from the greater ecosystem by their encapsulation behind interfaces or contracts.
In theory the practice of working with small units of change made to small units of well understood and strongly encapsulated code means that
- the change itself can be easily and quickly acceptance tested
- system testing is unnecessary for this (or indeed any) release, because the risk of software change is fully contained in the component.
The first of these statements is usually pretty much true. The second statement bears further analysis though. What do we mean by system testing here? Well, there is
- E2E functional smoke test – happy paths, most important functions are not broken
- Full functional regression tests – happy and negative paths, nothing is broken
- Component interop robustness tests – components are well behaved citizens of the service ecosystem – they don’t break their consumers when they are upgraded, they handle properly all the ways in which the components they depend on may respond to their requests, and they are resilient to those dependencies changing in unpredictable and badly behaved ways, failing gracefully if and when they need to fail.
The first piece can and should be done in the CD pipeline before and/or as part of and/or after deployment. And/or continuously in the form of API monitoring / right-shifted testing.
The second piece is the piece that is really made unnecessary by strong functional encapsulation. It is necessary to run full regression including negative tests at the component level and this should be done in a fully automated way in CI on every PR. But it is not necessary to do this level of testing at the service mesh level, downstream of component change acceptance. The responsibility for this can and should be fully left-shifted.
The third piece is the interesting one. I call it component interop robustness testing, because that is what it is. The term contract testing is starting to arise for this.
What is Contract Testing?
The intent of contract testing is to test that a release of a component does not break the contract between it and its consumers. i.e. it is safe to release into the mesh. I go a little further with this and add that it should also be resilient to its downstream dependencies changing the contract on them without notice. i.e. it should not assume that everyone else is as well behaved as I am. If you learn nothing else from driving a car in California, it should be this.
Also, you aren’t really testing the contract. What you are testing is that the component does not break the contract and that the component is resilient if it encounters fellow citizens on the mesh that are less law-abiding with regards to the contracts they have with you, such that civil society as a whole is resilient to occasional acts of criminality in its midst. Components can of course change the contract that they implement but well-behaved components do it in a well-behaved way that ensures backward compatibility for all consumers that may not yet have been upgraded to understand the new terms of the updated contract.
What does Contract Testing test for?
- It does not break contract with its consumers. If the contract evolved, that’s fine, but it must not break consumers who are unaware of the change.
- It fully and completely handles all possible contractually defined responses that may come back from services that it sends requests to.
- It does not break if its downstream dependencies change their contract, regardless of whether they do it a well-behaved way or not. It is either resilient to unexpected responses or fails gracefully. It does not roll over and stick its feet in the air.
The question then is how and when to do this testing. You have three choices for this.
1. Left-shifted component interop testing. The responsibility for ensuring that components released into the service mesh are well behaved with regards to their interop with other components rests with the agile team producing the component.
2. Traditional system & integration testing. The responsibility for ensuring the application / service mesh as a whole is working lives with a team downstream of the component producing agile team.
3. Make like an ostrich. Pretend there is no problem and don’t make anyone own this.
Option 3 is particularly popular (but stupid). Option 2 is possible at low scale, as it was with monolith-based apps. But at high scale it is simply impossible. At high scale you potentially have 1000s or tens of 1000s of services in the mesh constantly changing. You cannot test the interop of that many things that won’t stand still long enough to be tested.
Option 1 is the only rational approach that scales.
A Closer Look at Contract Testing
Let’s say I have a simple mesh of services that interop with each other in the way shown to the right. We’ll concentrate on A, C and D for the discussion, but know that A is not the only consumer of C’s contract, and D is not C’s only downstream dependency.
Now some of the discussion around contract testing, is line-centric – trying to test the lines on this picture. This in my view in wrong-headed because you don’t actually release lines. You release boxes. So you really need to test boxes and the fact that they are well-behaved with regards to the lines that are relevant to them. Hence I take a box-centric approach and lens to this.
Now, A, C and D are owned and maintained by 3 separate teams, helpfully named Team A, Team C and Team D. They are good little Agile teams that release new versions of their cute little components at a frequency unheard of in the bygone age of the monolith. However, some are more frequent than others. They don’t release on the same dates or frequency. Team A release every 2 weeks religiously. Team C releases every week equally religiously and are threatening a religious war with Team A. Team D releases whenever it feels it has something ready to be released which can be several times a day at times, and then nothing for a month. You never know with them.
Their release trains looks something like this where the circles represent their releases, the upward arrows represent their testing of their upstream interop with their consumers (that they are not breaking their contract) and the downwards arrows represent their testing of their downstream interop with their dependencies (that they handle all the conditions that may arise under the relevant contracts fully, and that they are resilient to those dependencies breaking contract).
Now the upward arrows are API tests. Team C needs API tests that fully exercise the contract that component C implements and checks that it is implementing them properly. This is not just happy path testing. This includes negative testing of course. Part of the contract is how C handles badly formed requests of various natures. Further, if Team C knows how its consumers (A and B) actually call it i.e. which bits of the interface it exposes are actually used by its consumers, then it can clearly prioritize those use cases above others that are less or never used. If they don’t know this, they have to fully regression test the interface with API tests.
The downward arrows are unit tests wrapped around virtual services. They have to be virtual services, not real services. Why? Because real services are hard enough to manipulate into providing all of their possible responses based on the contract, and they definitely don’t misbehave on demand in ways that violate contract. To really test downstream interop robustness you must be able to simulate your downstream dependencies acting in all the ways defined in the contract, and some that are outside of it. So Team A needs a virtual service that can pretend to be Component C and manipulated to respond in whatever way is needed for the interop tests they need to run.
Getting Smart About This
The observant will note that for any consumer-producer pair in the mesh, let’s stick with the A-C pair for now, the transactions that flow between them define the “de facto contract”. And that if you know those, you can use them to generate both:
- the API Tests for C to use to test its upstream interop whenever they release a new version of C
- the virtual services for A to use to test its handling of the contract with C whenever the release a new version of A.
And that, as long as all these teams do this, the relevant contracts will be continuously tested from both ends, across time.
So with a shared transaction repository, populated by recording the interactions in test or production, or from API specs, the tests for the up arrows and the virtual services for the down arrows, can be auto-generated.
Scale is the Kicker
For any given service pair, this is all pretty cool.
But if you take it to the real world where the service dependency graph is orders of magnitude more complex than this example and ever changing, it’s not a case of being cool, it’s basic hygiene. It is a critical competency to have if you want to competently manage risk in a scaled microservices based service ecosystem.
And if you want to do it efficiently, consistently and collaboratively across many engineering teams in the organization, you will absolutely need tooling to manage the transactions and contracts, and auto-generate component interop tests and virtual services to support this type of testing.