Tes Engineering Blog

Musings of the Tes Engineering Team

BlogHow we workMeet the TeamOpen rolesWork with us

June 20, 2019

Exceptional Alerts - Instrumenting and Observing Part 3

by Dan Abel

Observing what happens when your users interact with your software keep you from disaster, allowing your users to keep working and you to keep shipping.

At Tes we capture what happens when our users interact with our services. We set expectations on outcomes. This means we know when our users can't reach their goals. It also means we can act fast to fix problems.

In this blog post I'll show how alerts help with this and how we make this a team-focused activity.

A call to action
A bat-signal (source https://www.flickr.com/photos/blakta2/)

Observing is your safety net

Observability is key to supporting your applications and users if you want to be shipping often, experimenting and learning. Mistakes will be made, networks are not perfect, and your code, integrations and users will surprise you.

Dashboards can be really useful for deep dives, problem solving and managing launch days, but long term, they don't cut it. Your team will need monitors and alerting.

But alerts can also let you down if you don't treat them right.

Alerts should be exceptional

You can monitor many different things such as HTTP status codes and application events, but an alert should be there to highlight something unexpected or unwanted. More than that it should be unusual. The system has failed and it's time for action!

Alerts are a call to action

If your alerts are firing all the time either your team will be doing nothing but fixing issues or beginning to treat them as noise. It's well worth understanding the root causes of your issues so you can get some quiet to do new things.

However if some alerts are becoming noise, you risk critical issues being ignored. Your alerts need to be thinned out and managed.

Alerts need tending

Untended alerts can be like an unloved garden. Untended, they can overgrow their space and crowd out what's important. Alerts that have fulfilled their purpose need grubbing up, those taking up too much space need pruning, and confusing alerts need weeding out.

When an alert fires, it's well worth evaluating the importance and value and tuning it so it's telling you the right thing at the right time.

  • Do you often ignore alerts until there clearly is an issue?
  • Do you get many false alarms?
  • Do alerts you no longer care about clutter up your chat channel?

Likely you need to review how sensitive your alerts are, and if they are right for the team.

Alerts should communicate well

We want to work in a world where alerts fire infrequently. So how does a team use what it learns to become more effective in dealing with issues?

The answer can lie in the alert message where a team can leave notes for the future.

Anatomy of an alert

Don’t expect everyone on your team to be experts in every part of the system they look after. For action to be taken an alert needs to be clear about:

  1. What has happened, and how critical it is?
  2. What should be done
  3. Quick access to sources of information that will help

The example below first makes a statement of what the alert is about. It then gives us links to filtered log queries to get more information.

Importantly it contains a link to a wiki page for further info. This page may start brief, but it will often get filled in as people learn more about the alert and we build tools to help us fix issues, leading to a knowledge base on the root causes of the problem.

Critically it has a call to action, making clear who it might affect and what we might need to do.

Teams are the unit of operability

An alert is a placeholder for a conversation

The job of caring for our services and our users should be done as a team. Your team might have a rota for user issues or something less formal, but who ever is 'on call' should be taking point, not be working in isolation. Their job is to look into the issue and report back to the team, where that new knowledge can be put to best use. It might be a short update, or it might be a discussion, but it should lead to new team understanding, tweaks to the monitor and it may drive additional work for the future to make the alert not be needed at all.

Alerts are not chores

Managing the live service should not be a chore. It's a key point of experience, learning, and assistance to help your users solve their problem or get a thing done. It should be at the heart of what you do. A great team attitude to monitoring can really help with that.

© Tes Engineering Team2020| All rights reserved
Follow @tes_engineering