essay / architecture
Shadow testing saved us from a silent rollout disaster
We rewrote a service with zero documentation and passed all the tests. The tests were not enough.
We inherited a service nobody fully understood. The original authors had moved on, documentation was sparse, and the rules engine underneath (DROOLS) was end-of-life. The team spent two quarters reverse-engineering the behavior, designing a replacement, and writing tests against everything they could observe.
The automated tests passed. The team was confident. The rollout review was scheduled.
I had just joined the team, and something about the confidence level made me uneasy.
The problem with testing against what you know
When you rewrite a service that has no documented business requirements, your tests only cover the behavior you discovered. Every test is an assertion about something an engineer noticed and wrote down. The gap between “what we tested” and “what the service actually does in production” is unknowable until production traffic hits it.
I proposed shadow testing: deploy the new implementation alongside the old one, fork production requests asynchronously to both, and log every difference. No customer impact, no risk, just data.
The team pushed back. They had spent months on this. Shadow testing would delay the rollout by weeks. The manager assured me the test coverage was comprehensive.
Building the case without the argument
I did not want to make this adversarial. I asked to run a quick spot check: deploy the new service to production without routing traffic to it, crawl through logs for recent request/response pairs, replay them against the new implementation, and compare outputs.
One week of work. A DynamoDB table to store the samples. A Lambda function to replay and record. Crude but effective.
The comparison found a small percentage of responses differing between old and new, less than 2%. But that 2% represented real production scenarios that no automated test covered. Edge cases in how certain payment instrument types were validated, locale-specific formatting differences, a handful of error codes that mapped differently.
The full shadow test
That 2% was enough to justify the full shadow testing framework. We deployed it to production, soaking the new implementation for several sprints while collecting every discrepancy. We addressed the ones with customer impact, then toggled traffic to the new service while keeping the old one running in shadow mode as a safety net.
The approach I now recommend for any service rewrite without complete documentation:
- Deploy new implementation dark (no live traffic)
- Replay sampled production requests against both implementations
- Diff the outputs automatically, categorize discrepancies by severity
- Fix the high-impact differences, document the acceptable ones
- Toggle traffic gradually while maintaining reverse shadow testing
The total delay was about three sprints. The alternative was shipping a rewrite that would have silently broken 2% of production cases, the kind of failure that erodes trust slowly because it does not trigger alerts.