If you want to be confident that your users are able to achieve their goals using your service there’s more to do than monitoring the health of individual micro-services. You need assurance that your set of micro-services are working well together, and when they aren’t, you need the information necessary to fix any problems as soon as you can. This blog follows one Tes team’s mission to better identify and diagnose problems, enabling them to move fast and ship with confidence.
A friend of mine tells a great story of a team avoiding a great deal of grief. All of their system health checks were green, but the live graph of purchases dropped to zero and stayed there. Despite the many positive system indicators, the team were able to see they had a problem and were able able to react quickly to find and to fix it. It turned out that user purchases was a key indicator of success.
Hack days are an interesting idea based on the 15% time from 3M in 1948. The fundamental goal for hack days is to empower engineers to solve problems that they see however they want to. Without the day-to-day delivery pressure they can solve problems that other people don’t even realize they have or in ways that are incredibly creative.
On Tuesday evening, post the launch of the new home page, we had a second set of performance problems that impacted the entire tes.com site around 6pm for 40 minutes, and then subsequently during two periods at 9pm and 12am. The root cause turned out to be a misconfigured Redis caching server that was moved to in response to the issues on the 24th of April. During the post-mortem of the issues the day before we had agreed that a key action was to upgrade and improve the monitoring of the part of our platform that does the composition of the shared fragments (e.g.
We had a number of site related performance issues on Monday 24th April that impacted the entirety of tes.com. The fix to which (as always) was deceptively simple, and resulted in response times on average dropping from 100ms to 10ms, and CPU usage on the server reduce by almost 400%. As part of the rebrand we have been rebuilding the services that supply shared assets to all parts of our platform, which include core styles, images and the fragments of HTML for the masthead, footer and left hand navigation rail.
Debugging allows us developers to assume the role of detective, and like any good detective, we need to consult all of our sources to understand what’s going on. If your application uses MongoDB for persistence, one source you have available is the oplog. What is the oplog? The MongoDB oplog, or operations log, is a standard capped MongoDB collection. Each document in the collection is a record of a write operation (a delete, update or insertion) that has resulted in data being changed.
This is the second in a series of posts about improving page performance. Part 1 discussed what we're measuring and how. A video of me talking about the performance issues discussed in this post. The problem For the job details page, we accept banners supplied by schools which aren't compressed as well they could be. Large images don't block the rendering of the main content, however they hog bandwidth, especially on mobile.
A video of me talking about the performance issues discussed in this post. How are we defining 'page load speed'? How quickly the user can see and interact with core page content after they navigate. Non-core content could be adverts, their user avatar, or recommended links. It's important they appear as quickly as possible but they're not the main reason the user navigated to the page. Where are the biggest gains to be made?