Incident report June 13, 2019
Around 02.05 UTC on June 13, our on-call staff member started receiving notifications about a potential problem with the ad management console service. This is not an uncommon situation, as we’ve deliberately made the check for the uptime of the console relatively sensitive. These things normally automatically dissipate after a few minutes.
However, about one hour later at 03.12 UTC, there was another notification about a potential problem, this time more uncommon. After an initial inspection to ensure it was not another false positive, and despite the very early hour – it was around 5.25 AM local time – the on-call staff member decided to phone and wake up our lead engineer.
Upon closer inspection, he determined that there was a problem with storage volumes related to several database servers. For reasons unknown at the time, the database software was no longer connected to the databases on storage. We immediately activated our recovery procedures for severe incidents.
From 03.35 UTC onward, the affected database servers were restarted one after the other. This turned out to have only a partially successful outcome. Even though the database servers – after their reboot – were once again able to connect to their storage clusters, the application software was still not able to access the data. A check showed that the databases had been closed improperly, most likely at the time of the first set of notifications, probably as a result of a connectivity issue in our internal network.
Fortunately, our platform has been designed to be able to continue with ad delivery and collection of delivery statistics, even when the console database servers happen to be non-functional. It was decided that it would be prudent to run a full check-and-repair sequence for all affected database tables. There are a few thousand of those, and the total size is huge, so it was expected to take a considerable length of time. In order to get the sequence completed as quickly as possible, the database server to be used was updated to have more CPU power and more memory, meaning that the sequence could be run at the maximum speed the storage cluster would permit.
At 04.28 UTC, the check-and-repair sequence was started, with one of the engineers constantly monitoring the process to make sure no unexpected warnings would go unnoticed. While this was ongoing, the console front-end servers were temporarily put on halt, so that customers would not accidentally log in in the meantime. Support staff were at hand to be able to respond to support tickets and they also kept the Twitter account at https://twitter.com/StatusAquaPform updated with status and progress reports.
It took a bit less than 6 hours to complete the entire check-and-repair sequence, and then the engineers proceeded to manually run the maintenance procedure to import the ad delivery statistics that had been collecting on the delivery platform all that time. This took another 40 minutes, also because of some extra checks to make sure nothing was missing. A few minutes after 11.00 UTC, the engineering team reported that the service had been put back online and that the incident had been resolved. This was also communicated over the Twitter account about 30 minutes later, after closely monitoring the entire platform for some time to make sure everything was running smoothly.
When we contacted the technical support team of our cloud provider about this incident, asking what had happened shortly after 02.00 UTC, they reported back that “One of the block storage cluster nodes had a minor blip and it was recovering at the time you noticed this issue.”. While it might be true that in their view it was indeed “a minor blip”, it caused a major problem for us and for our customers. We are in close communication with our provider to find out more about what happened, and more importantly, how something like this can be prevented from happening again.
Our learnings from this incident are that our internal communications could be improved by more quickly escalating this from the on-call team to the main team. We also observed that the platform was able to continue with ad delivery as if nothing happened, which is a direct result of the forward thinking that went into its design and architecture. Obviously, we would have loved to be able to get the console component back up and running much more quickly than the 9 hours it took from first alert to full resolution. We are going to investigate if there are options to make the database systems more resilient in the case of connectivity issues, and if there are ways to speed up the recovery process in the event of another incident, however unlikely something like this is from happening again.
Our sincere apologies to our customers and users for any inconvenience they experienced as a result of the unfortunate situation. We are thankful for the kind words and encouragement we received from customers contacting our support team, when they replied to our first responses upon opening support tickets. We noticed that many of you understand the complexities involved in running a platform of the size that we have, and appreciate the effort that is put into a complicated recovery operation like the one that occurred on June 13th.
If you think there is any residual effect in your account after the incident, please do not hesitate to open a support case, so that one of our team members can launch an investigation.