June 20, 2019
by Dan Abel
Observing what happens when your users interact with your software keep you from disaster, allowing your users to keep working and you to keep shipping.
At Tes we capture what happens when our users interact with our services. We set expectations on outcomes. This means we know when our users can't reach their goals. It also means we can act fast to fix problems.
In this blog post I'll show how alerts help with this and how we make this a team-focused activity.
Observability is key to supporting your applications and users if you want to be shipping often, experimenting and learning. Mistakes will be made, networks are not perfect, and your code, integrations and users will surprise you.
Dashboards can be really useful for deep dives, problem solving and managing launch days, but long term, they don't cut it. Your team will need monitors and alerting.
But alerts can also let you down if you don't treat them right.
You can monitor many different things such as HTTP status codes and application events, but an alert should be there to highlight something unexpected or unwanted. More than that it should be unusual. The system has failed and it's time for action!
If your alerts are firing all the time either your team will be doing nothing but fixing issues or beginning to treat them as noise. It's well worth understanding the root causes of your issues so you can get some quiet to do new things.
However if some alerts are becoming noise, you risk critical issues being ignored. Your alerts need to be thinned out and managed.
Untended alerts can be like an unloved garden. Untended, they can overgrow their space and crowd out what's important. Alerts that have fulfilled their purpose need grubbing up, those taking up too much space need pruning, and confusing alerts need weeding out.
When an alert fires, it's well worth evaluating the importance and value and tuning it so it's telling you the right thing at the right time.
Likely you need to review how sensitive your alerts are, and if they are right for the team.
We want to work in a world where alerts fire infrequently. So how does a team use what it learns to become more effective in dealing with issues?
The answer can lie in the alert message where a team can leave notes for the future.
Don’t expect everyone on your team to be experts in every part of the system they look after. For action to be taken an alert needs to be clear about:
The example below first makes a statement of what the alert is about. It then gives us links to filtered log queries to get more information.
Importantly it contains a link to a wiki page for further info. This page may start brief, but it will often get filled in as people learn more about the alert and we build tools to help us fix issues, leading to a knowledge base on the root causes of the problem.
Critically it has a call to action, making clear who it might affect and what we might need to do.
The job of caring for our services and our users should be done as a team. Your team might have a rota for user issues or something less formal, but who ever is 'on call' should be taking point, not be working in isolation. Their job is to look into the issue and report back to the team, where that new knowledge can be put to best use. It might be a short update, or it might be a discussion, but it should lead to new team understanding, tweaks to the monitor and it may drive additional work for the future to make the alert not be needed at all.
Managing the live service should not be a chore. It's a key point of experience, learning, and assistance to help your users solve their problem or get a thing done. It should be at the heart of what you do. A great team attitude to monitoring can really help with that.