Watching the Watchmen

by Dan Abel

Monitoring lets our teams know when something unexpected is happening in our live environment. Our monitors keep watch, so we don’t have to.

However, systems change and so do teams. Information is forgotten, products and services may be handed over to new teams.

The biggest concerns of a year ago may be yesterday’s news. How should we keep our watchmen relevant?

Taking ownership

I work in the EngSec team at Tes. We took ownership of some existing systems a few months ago. Last week we took the time to look at the monitors that help us take care of the user registration and log-in parts of tes.com.

We discussed our approach and decided to review all of our key monitors.

Monitors should do more than tell us something is broken. They must inform and call us to action. Each monitor should make sense and contain information that the team could act on. If a monitor was no longer giving the team useful information we would remove it.

We consider monitors to be tests we write against our live environment. Understanding where we have great ‘coverage’ over our live systems is key to allow us to move fast safely.

We checked each monitor by asking these questions:

  1. Is it clear?
    What is it there to do?
    Does it say clearly what it is protecting, in a language the team understands?

  2. Is it informative?
    Does it give the team useful sources of information to drive the investigation?

  3. Is it actionable?
    Does it help the team know where to start to solve the issue?

  4. Is it useful?
    Is it a key metric for the team & product?
    Is it testing and ready to alert us on something that we care about?

How did it go?

Going through this review was a learning experience. Our team chatted and learned from each other a lot. We shared what we knew and talked about what parts of our codebase the alerts support that would help us ship with confidence. We learned new things about the services by seeing the monitoring left behind by the previous teams. Our experience found the monitors to be a form of system documentation by example.

We updated and tweaked quite a few monitors, as well as removing a couple.

Just like deleting someone else’s code from a code base, removing a monitor feels worrying. No one wants to be the one who decides a monitor is not important, and then miss a critical bug in production that would have been caught by that monitor!

We believe it is important for a team to own their monitoring and have it setup to work for them. Review sessions helped broaden our understanding and build courage to take further ownership.

It turned out that the team had a lot to talk about when it comes to watching the Watchmen.