Testing in Production is not for free - a real life tale

by Dan Abel

Here’s a real life tale of how we used Application Monitoring, Observability and Graceful Degradation to be able to ship fast but also catch and fix mistakes without letting our users down.

In it we take a look at safe failure states and complementing metrics with supporting data and how we use them to solve real issues.

Let me take you back to 6th June.

an alert and observing owl

Everything was fine and normal on Tes.com. Things got shipped. But then things always got shipped - we ship a lot.

The team that looks after the AAA services got an alert in their chat room. Something was up with our log-in service. Indications were that there was a fault on the log-in page.

an alert that there were too many HTML sign ins

The first question we ask

The first question we always have to ask is: ‘how is this issue affecting our users?’ The good news was that in this case it wasn’t.

Our log-in pages are the gateway to our products, and they need to be reliable. Not only do they need to keep people out who shouldn’t have access, they need to be consistent in letting people in, even when we have issues. Because of this, we’ve put some thinking into their operation, and have fallbacks in place.

We absolutely test our code before we ship it, but we acknowledge that we value the feedback from getting software into users’ hands, more than we get from testing every permutation and waiting to ship.

When software is only really tested when it comes into contact with users, we need to use monitoring to tell us of problems. Monitoring is not enough though, we also need safe failure states, to allow us to offer services even when things go wrong.

When the Javascript fails in log-in we degrade our service and offer users a plain HTML log-in page instead. This gets them in and to the product they want to use.

Wait. But if it doesn’t break… how do you know it’s broken?

Our approach to monitoring is usually to check that users are able to reach their goal. To do that we monitor that the rate of log-ins does not drop significantly. If our degraded service works, then this should not change.

So how do we know we had a problem? We expected something like this to happen. We also monitored the rate of use of the HTML log-ins - and alerted when there was a sharp rise.

The second question we ask

The second question we ask is how is this issue arising from our software (and what’s the fix?)

From our instrumentation and logs, I could see when the issue started, and that it coincided with a release. This made it simple to find the code change, but I still did not know what was broken. It was only a problem for a percentage of our users, so I needed to be crafty to find it.

Metrics + supporting data lets you observe your system

I don’t like bare metrics - they might indicate what is occurring but they don’t provide any of the ‘why’. I like to log additional data when I send a metrics tick to provide the supporting information to allow the metrics to drive us forward.

The great news was that we record browser fingerprints when people log-in. I could filter that for users using the bare HTML log-in. I could then parse the user agent, and from there it was a small step to reproduce the broken site on the right browser and get a fix shipped.

Move fast and pay attention

Shipping small changes often has many advantages, but testing in Production ain’t for free.

  • Build fallback states that degrade to where your systems still work. This helps your users, and it also helps you to be able to react without panic.

  • Don’t monitor everything, but be sure to monitor what’s critical both for the user, and for the team to manage the product.

  • Remember to instrument the detail when you set up a metric - it will help you work out what’s at fault when you have an issue.

Use these and you can keep fixing your problems as you go. Keeping shipping and keeping your users happy.