November 03, 2017
by Dan Abel
A friend of mine tells a great story of a team avoiding a great deal of grief. All of their system health checks were green, but the live graph of purchases dropped to zero and stayed there. Despite the many positive system indicators, the team were able to see they had a problem and were able able to react quickly to find and to fix it.
It turned out that user purchases was a key indicator of success. Observing it saved them from an embarrassing day of support calls and explanations.
Knowing what's happening when your users interact with your software can save your ass; allowing your users to keep working and you to keep shipping. All the metrics, logs and test automation don’t matter if the customer can’t get their thing done.
At Tes, we instrument our services, reporting what happens as our users interact with the software on our live environment. We then set expectations on this record of behaviour to check for successful outcomes. This means we know when our users can't reach their goals. It also means we can act fast to fix problems.
At Tes, the teams that write the code also take care of the service in the Live environment (where our customers are). Each of the microservices we deploy has a job to do. Processing data, producing a web page, sending email or generating notifications. We like to verify that each service is doing what it's supposed to, so if something goes wrong, we know.
Our services send low-level data: HTTP errors (404s, 500s etc), queue states, as well as disk or CPU usage. These indicators are great for problem solving.
We want to know what's happening inside each microservice, so we instrument the code in the service to send out signals for both application errors and events. These can then be used to alert the team when problems occur.
In this post I'll show you a simple way to instrument a microservice, that allows us to observe, set expectations, and act on issues.
When our code catches and handles an error in a microservice, we often want to keep an eye on how often it's happening. We do that by letting the metrics service know when errors were caught with code that looks like this:
applicationError() is part of a module that wraps a couple of lower-level libraries. A call to it causes the metrics service to increment a counter. These 'metrics' are clever counters that know when each increment arrives. They also save tags that are sent, which allow us to break down the cause of a problem: ‘failedtofetchjobinfomation’ is a tag in the example above.
Over on the metrics service, we can observe and act on the data we send. Our key step with errors is raising alerts to the people who need to hear them.
Once we have a microservice that emits errors, we can monitor what happens - setting expectations that will send alerts to the team. For simple service management, a starter approach is to expect no errors from the service - monitoring that the error count remains at zero.
When an alert triggers, we get a message in our team chat room, allowing us to work as a group, talk and examine the issue, and act on it. I'll dive deeper into the details of alerts and how these insights can improve how a team works in Part 3.
Not all errors are equal. There are errors we don't expect to ever happen and we want to know if they do. There are often error levels that are acceptable. There will always be some 'item not found' errors from typing mistakes - we can ignore a few, but if we get a flood of them, it's likely that something's wrong and we need to know. Setting thresholds in our expectations can help manage our priorities.
Focusing in on a single error tag, we can set a tolerance so a limited number are allowed in a 5 minute period.
Application events are triggered in a similar way. When an important action happens in an application, we emit an application event with code like:
This call increments the
application_submitted count on the monitoring service and adds the type as a tag. We don't group application actions together so we can differentiate what's happening. Each action has a separate count that we increment and monitor. We also consume and act on each one of them a little differently.
Instead of alerting based on counts or thresholds, application events are consumed in several different ways.
Capturing application events allows the team to view and explore what has been happening with our service. We've built service health dashboards that give us helpful overview when we put new services live. They also let us explore patterns of usage and problem states.
We set expectations on what should happen when a service is working as expected; alerting when the service is not behaving as we want.
Similar to how we reason when we write automated tests, in our monitoring service, we state what behaviours we expect from the application. For example, we could expect that the count of application submissions does not drop to zero.
We can also set more complex expectations, reasoning about connected events. Thinking about this has become critical to how we measure and I’ll cover in the next blog post.
Rather than writing additional code paths for cases we aren’t sure could ever happen, or worse: dismissing them and not accounting for them, we set expectations on corner cases. Placing a metric on them and check for the unlikely outcome. This allows us to act on real information rather than conjecture and deal with changes that are out of our control.
Setting expectations gives us confidence that our services continue to work as we keep releasing. If a service starts to fail our users, we'll know.
This checking complements the tests that run before we ship a change, as long as we keep listening and are able to move fast to react to an issue. Having dashboards allow us to diagnose problems that we aren't expecting, when we know something has gone wrong.
In a Microservice environment, knowing when something has gone wrong requires a more complex view of what's going on. Observing and setting expectations to detect problems happening between collaborating microservices. That's what I'll be looking at in my next post.
Charity Majors' writing has really helped me put words to what I care about and challenged how I work.
Monitoring is for operating software/systems— Charity Majors (@mipsytipsy) September 23, 2017
Instrumentation is for writing software
Observability is for understanding systems
Our CTO Clifton Cunningham summarises how we work with microservices at Tes in this video: