If you want to be confident that your users are able to achieve their goals using your service there’s more to do than monitoring the health of individual micro-services.
You need assurance that your set of micro-services are working well together, and when they aren’t, you need the information necessary to fix any problems as soon as you can.
This blog follows one Tes team’s mission to better identify and diagnose problems, enabling them to move fast and ship with confidence.
First, let me tell you a little background.
What we do
In our set of services to educators, Tes advertises education vacancies and supports the hiring process online. Teachers apply via browser-based application forms and uploaded documents. Schools manage their applications in our Applicant Tracking tool.
When an applicant submits their completed application, there’s a lot of work done behind the scenes: generating documents, sending emails, updating profiles, and of course informing the school they have applied to.
The user doesn’t have to wait for this all to happen; they carry on with their next task whilst we are acting on their behalf. It’s our job as engineers to ensure everything happens right for every application made.
How we get this done
Tes.com runs on a large set of collaborating micro-services. A number of teams each take care of a bunch. Around 10 services are involved in delivering a completed job application to a school.
When an application is submitted, event messages are posted that trigger the relevant services to do the work needed.
Observing the behaviour of any single service can’t tell us if a Job Application submission has been processed correctly. We need to know more about how our services are working together.
Our early cross-service checks
When we first started out, we began by instrumenting and collating data from our most critical services. With this data alone we were able to set assertions in our monitoring tool that crossed service boundaries.
Our first cross-service alert was to set the expectation that for every application form a user submitted, a PDF of the application data was created.
It was quick to set up a check that these two counts had been equal during the last 15 minutes.
Checking chains of events
Our next step was to add a check for the subsequent event. Submitted applications should have PDFs rendered and emails sent to schools. So we set an expectation for that chain that would tell us when and where we had a problem in a critical area.
These monitors gave us a lot of feedback and confidence. With little extra effort we moved from knowing what our services were doing to knowing that our systems were working together.
The bad news
It wasn’t completely smooth sailing: we got false alarms. Our expectations were simple: comparing simple counts of events from the last 15 minutes.
If the first event in a chain fell before the start of the 15-minute window, the sums would be wrong and it would look as if we had more PDFs than application submissions. In the same way it could not take into consideration that it takes a little time for the chain of events to happen. When the expectation was checked, chains of events that were not yet finished looked like an error, even if work was just in progress.
To make things worse, the monitoring system could see the values were wrong but neither it nor we could see why. The data didn’t provide the depth necessary for us to solve problems or know whether there had been a false alarm.
We needed to develop methods to be able to quickly dig into the details whenever we got an alert. We built an audit log, then scripts to query it.
This led to our next step — writing our own alerting code.
Using our domain knowledge to get monitoring reliability
Building an audit log
We needed more detail than our monitoring system provided, so we built an audit log. This recorded the critical events that followed each job application. This proved super useful for supporting users and solving problems.
Our audit log collated actions from the core job application micro-service and confirmations of successful processing from the collaborating micro-services. This allowed us to hold more detail about the actions that each micro-service took.
The scripts that we built to examine our audit log quickly became more precise than the monitor expectations that sent us alerts. We changed gear to use these scripts to drive our alerts — to see if we could stop the false alarms.
Alerting from the audit trail
We wanted to use the audit log to assert that all applications recently submitted had been processed correctly.
We started by selecting the application submissions made in the last 15 minutes. This avoided false alarms from counting ‘headless’ events — those where the initial event fell outside the assessing window.
We dropped all application submissions that were less than 5 minutes old. This allowed enough time for all pending events to be completed and stopped the false alarms from in-process submissions.
We then asserted that all the application submissions had a complete set of subsequent events, using the monitoring to alert us to any problems.
This reduced the false alarms to zero, giving us a reliable system that told us when an application submission was not getting processed correctly by one of our micro-services.
Even better, it could tell us the details of the fault whenever it caught a problem.
Gain confidence via data to ship and move fast
We learnt that to be able to accurately spot problems and react quickly to unexpected situations we needed to instrument our software at a sufficient level of detail, allowing us to think and code about individual user events.
We also learnt that we didn’t need to do all that at the start; we could build a noisier, simpler system and finesse it as we shipped and learnt more. By focusing on the problems we had, we progressively reduced our support workload, chunk by chunk.
This solution has replaced big parts of what would typically be done by suites of integration or system tests, which I’ll write about in Part 3.