On Tuesday evening, post the launch of the new home page, we had a second set of performance problems that impacted the entire tes.com site around 6pm for 40 minutes, and then subsequently during two periods at 9pm and 12am. The root cause turned out to be a misconfigured Redis caching server that was moved to in response to the issues on the 24th of April.
During the post-mortem of the issues the day before we had agreed that a key action was to upgrade and improve the monitoring of the part of our platform that does the composition of the shared fragments (e.g. the masthead) into the pages across the site, we call this our ‘page composition service’ - or page composer. This is a crucial shared part of our platform that does work on almost every single page request to tes.com. To ensure it stays up we run 18 instances of this distributed across 3 locations in AWS (6 in each) - to put this in context, we have only 6 instances of the application that serves the tes.com home page.
We made the call to migrate the storage behind page composer to a recenlty created shared server that was being used by a range of other services (e.g. to manage the session state by logged in users), as this server had a significant amount of spare capacity, on a more modern version and (so we thought) well monitored.
The previous version of the page composer database had been up and stable for 923 days, giving an indication of just how reliably this service had been to this point. The configuration change was made in the morning, but not deployed live. The intent was to do some final checks and testing later in the day before pushing it to live.
When the decision was taken to release the home page later in the day, the change was inadvertently merged into and pushed live with the other changes that modified the service that served the home pages.
Immediately after the deployment the site, and page composer, were all operating as expected. Then, at 17.53 we observed a complete failure of page composer to serve any requests, effectively making most of tes.com unavailable.
We took immediate action, re-starting page composer (to ensure it wasn’t a short term issue) and also re-deploying the tier of services above it (the nginx web server layer) that distributes requests to all of the services around our platform. Neither of these actions made any difference, and we were getting no alerts from any other parts of our platform. We then stepped through a series of actions to attempt to pinpoint the problem, confirming that the services could all talk to each other (ruling out a network issue), and that there were no broader issues in AWS impacting us.
We then checked the storage for page composer, and discovered that it was completely un-responsive - this was the root cause.
The storage server that we had re-pointed it to had been misconfigured in two ways: 1) it had no memory limits, so it had filled up during the time after the page composer deployment and then locked when it ran out of memory, and 2) the monitoring that had been setup on that box to alert us of the fact wasn’t actually working.
The storage server was quickly restarted and service resumed to tes.com. The server was then re-configured to match the old service, and the monitoring and alerting agents re-installed and configured correctly.
During the night we then saw further instability whenever this server began to reach the memory limits we had set (with alerts notifying us as expected), and rather than doing as expected and ‘evicting’ old data (the expected behavior), it again crashed, bringing the site down. The decision was made at 1am to migrate back to the old storage and service resumed.
We are now monitoring the full stack closely as we plan next steps.
What actions are we taking?
- More time and care should have been taken in the decision to move the storage service for a critical piece of infrastructure, especially given the importance of it and the reliability of it to that point. The team are now re-evaluating the change and will move the storage to a managed service within the AWS environment, under a more managed process.
- The platform team are reviewing every piece of infrastructure across all environments to ensure that all have monitoring agents installed and they are configured correctly.
We take the stability and uptime of our platform very seriously within the engineering team, and we recognise that this type of outage on the site impacts not only Tes but those that are building businesses on our platform. Rest assured that we will learn from these incidents and do what we can to ensure they do not recur.
For any further updates or to see live status of tes.com, please remember you can visit: https://trust.tes.com at any time