Outage on 24th April - how one inefficient piece of code impacted tes.com

by Clifton Cunningham

We had a number of site related performance issues on Monday 24th April that impacted the entirety of tes.com. The fix to which (as always) was deceptively simple, and resulted in response times on average dropping from 100ms to 10ms, and CPU usage on the server reduce by almost 400%.

slowdown in response times across tes.com

As part of the rebrand we have been rebuilding the services that supply shared assets to all parts of our platform, which include core styles, images and the fragments of HTML for the masthead, footer and left hand navigation rail.

When a page is rendered on tes.com, the part of our stack that is responsible for that section (e.g. the resource details page) calls out to get hold of these assets, keeps a copy locally in its cache for a short time (as the content doesn’t change often) and then renders the page. This architecture has been remarkably scalable and resilient for the past ~3 years.

We changed the way the navigation is rendered within the masthead, primarily to support the huge number of permutations that exist within our navigation (both naming and URL targets) for each of the geographies and sections on tes.com.

The developers delivered this feature with a deceptively simple piece of code that allowed the navigation items to be declared as ‘templates’, where we replaced parts of the text and URL at runtime with the actual properties of the request - e.g. ensuring that a US user saw what we wanted them to see.

The code originally as written had a helper function - the ‘variable replacer’:

function variableReplacer(reg, existingObj) {
  return function(object, args) {
    var data = variables.registryToObject(reg, args, existingObj);
    return mapObj(object, function (value) {
      if (typeof value === 'string') {
        return handlebars.compile(value)(data);
      } else {
        return value;
      }
    });
  };
}

This code worked fine under test, and was fine in operation for over 8 weeks. On Monday the 24th, as we began to see the usual increase in traffic as a result of the end of the school holidays, we observed (from around 8am), the overall platform start to show signs of degraded performance (not user noticeable), which peaked at 11am with the entire site beginning to respond very slowly.

We traced the root cause back to this service, and in particular the piece of code that was responsible for rendering the masthead. It was rendering significantly slower than usual, which then triggered a cascade of slow-downs across all services behind the site.

The immediate action was to scale up the service (increasing available processors and memory), and scale up the service that does the composition of the masthead into the pages. This alleviated the short term issue, but we were still seeing the service respond slowly over the course of the day.

We also observed that the new masthead code was being called significantly more often than the old code, which we pinned down to the fact we were sending “full” URLs (including all possible combinations of URL parameters) and not just the base path of the URL (which was all that the service really needed). Ultimately this resulted in many more permutations of the masthead being served. This change was implemented, and brought down the number of requests and pressure on the service.

diff of code change that removed query strings from cache

On the morning of the 25th the output of a code review of the service that serves the site-assets highlighted that it was using an inordinate amount of memory rendering the masthead (10mb per request), with the bulk of this sitting in the two large orange sections below (under the Map). These pieces of code call the above helper function - the variable replacer - almost 50-100 times per request.

flame graph highlighting ineffecient map over handlebars compile

In particular, this code had a very expensive operation - handlebars.compile(value)(data) - that was both memory and cpu heavy, that was being repetitively called for the same values. We quickly modified the above code to instead keep a copy of each template around in memory after it was compiled (as there are a limited number of possible permutations), and then re-use them on subsequent requests.

Doing this one small change we were able to bring the memory usage down dramatically:

The new code:

var handlebarsVariableCache = {};
function variableReplacer(reg, existingObj) {
  return function(object, args) {
    var data = variables.registryToObject(reg, args, existingObj);
    return mapObj(object, function (value) {
      if (typeof value === 'string') {
        if (value.indexOf('{{') >= 0) {
          if (!handlebarsVariableCache[value]) {
            handlebarsVariableCache[value] = handlebars.compile(value);
          }
          return handlebarsVariableCache[value](data);
        } else {
          return value;
        }
      } else {
        return value;
      }
    });
  };
}

The new memory flamegraph:

screen shot 2017-04-25 at 08 09 38

Deploying this change then saw response times on average drop from 100ms to 10ms, and CPU usage on the server reduce by almost 400%.

screen shot 2017-04-26 at 09 13 09

The pressure on the stack was removed, and the site functioned as normal again.

Actions

  1. The team re-building the masthead should have picked up the fact that this code was inefficient within usual code review process. We will be using this as an example within our knowledge share process across the engineering team to ensure everyone understands why reviewing for performance is important.

  2. We recognise that we have a single point of failure in this service, and are re-visiting the way we serve pages to see if we can reduce our exposure to these sorts of issues in future - for example, further separating the segments of our site to ensure that any issue only impacts one area vs all and assessing both the site assets and page composition services and storage architecture.

  3. We are conducting a full review of all core services to understand if any recent features or new development are putting them closer to any performance boundaries that may cause issues for users or potential downstream outages.

Final Points

We take the stability and uptime of our platform very seriously within the engineering team, and we recognise that this type of outage on the site impacts not only Tes but those that are building businesses on our platform. Rest assured that we will learn from these incidents and do what we can to ensure they do not recur.

For any further updates or to see the live status of tes.com, please remember you can visit: https://trust.tes.com at any time.