September 05, 2019
by Dan Abel
Here's a real life tale of how we used Application Monitoring, Observability and Graceful Degradation to be able to ship fast but also catch and fix mistakes without letting our users down.
In it we take a look at safe failure states and complementing metrics with supporting data and how we use them to solve real issues.
Everything was fine and normal on Tes.com. Things got shipped. But then things always got shipped - we ship a lot.
The team that looks after the AAA services got an alert in their chat room. Something was up with our log-in service. Indications were that there was a fault on the log-in page.
The first question we always have to ask is: 'how is this issue affecting our users?' The good news was that in this case it wasn’t.
Our log-in pages are the gateway to our products, and they need to be reliable. Not only do they need to keep people out who shouldn’t have access, they need to be consistent in letting people in, even when we have issues. Because of this, we’ve put some thinking into their operation, and have fallbacks in place.
We absolutely test our code before we ship it, but we acknowledge that we value the feedback from getting software into users' hands, more than we get from testing every permutation and waiting to ship.
When software is only really tested when it comes into contact with users, we need to use monitoring to tell us of problems. Monitoring is not enough though, we also need safe failure states, to allow us to offer services even when things go wrong.
Our approach to monitoring is usually to check that users are able to reach their goal. To do that we monitor that the rate of log-ins does not drop significantly. If our degraded service works, then this should not change.
So how do we know we had a problem? We expected something like this to happen. We also monitored the rate of use of the HTML log-ins - and alerted when there was a sharp rise.
The second question we ask is how is this issue arising from our software (and what’s the fix?)
From our instrumentation and logs, I could see when the issue started, and that it coincided with a release. This made it simple to find the code change, but I still did not know what was broken. It was only a problem for a percentage of our users, so I needed to be crafty to find it.
I don’t like bare metrics - they might indicate what is occurring but they don’t provide any of the 'why'. I like to log additional data when I send a metrics tick to provide the supporting information to allow the metrics to drive us forward.
The great news was that we record browser fingerprints when people log-in. I could filter that for users using the bare HTML log-in. I could then parse the user agent, and from there it was a small step to reproduce the broken site on the right browser and get a fix shipped.
Shipping small changes often has many advantages, but testing in Production ain't for free.
Use these and you can keep fixing your problems as you go. Keeping shipping and keeping your users happy.