If you want to be confident that your users are able to achieve their goals using your service there’s more to do than monitoring the health of individual micro-services. You need assurance that your set of micro-services are working well together, and when they aren’t, you need the information necessary to fix any problems as soon as you can. This blog follows one Tes team’s mission to better identify and diagnose problems, enabling them to move fast and ship with confidence.
A friend of mine tells a great story of a team avoiding a great deal of grief. All of their system health checks were green, but the live graph of purchases dropped to zero and stayed there. Despite the many positive system indicators, the team were able to see they had a problem and were able able to react quickly to find and to fix it. It turned out that user purchases was a key indicator of success.
Hack days are an interesting idea based on the 15% time from 3M in 1948. The fundamental goal for hack days is to empower engineers to solve problems that they see however they want to. Without the day-to-day delivery pressure they can solve problems that other people don’t even realize they have or in ways that are incredibly creative.
Rachel is a full stack engineer at Tes and works from her home office in Scotland. When she is not working, you will find Rachel either training for a race (she loves to run!), or exploring the outdoors. Want to get her attention? Just say the word “challenge”.
On Tuesday evening, post the launch of the new home page, we had a second set of performance problems that impacted the entire tes.com site around 6pm for 40 minutes, and then subsequently during two periods at 9pm and 12am. The root cause turned out to be a misconfigured Redis caching server that was moved to in response to the issues on the 24th of April. During the post-mortem of the issues the day before we had agreed that a key action was to upgrade and improve the monitoring of the part of our platform that does the composition of the shared fragments (e.g.
We had a number of site related performance issues on Monday 24th April that impacted the entirety of tes.com. The fix to which (as always) was deceptively simple, and resulted in response times on average dropping from 100ms to 10ms, and CPU usage on the server reduce by almost 400%. As part of the rebrand we have been rebuilding the services that supply shared assets to all parts of our platform, which include core styles, images and the fragments of HTML for the masthead, footer and left hand navigation rail.
Debugging allows us developers to assume the role of detective, and like any good detective, we need to consult all of our sources to understand what’s going on. If your application uses MongoDB for persistence, one source you have available is the oplog. What is the oplog? The MongoDB oplog, or operations log, is a standard capped MongoDB collection. Each document in the collection is a record of a write operation (a delete, update or insertion) that has resulted in data being changed.
Denis Fernandez joined Tes in April 2015 as a front end software developer and currently works from Barcelona, Spain. He grew up in Havana, Cuba and went to Higher Design Institute where he graduated as a Graphic Designer. As if designing and creating amazing visual experiences for our digital education platform is not enough, Denis also plays the bass guitar and is a member of two metal bands. Below are a few questions we asked Denis to get to know him better!