Degraded performance for databases on LUNA cluster

Incident Report for Technolutions

Resolved

There have been no further issues observed with the LUNA cluster since the clearing of the degraded performance on 12/1. We are closing this incident and will be sharing our postmortem review.

Posted Dec 02, 2020 - 23:45 EST

Update

Databases on LUNA have been much more performant and stable since the updates 20+ hours ago, and further mitigations performed overnight have continue to show positive results. Our infrastructure has made it through nearly all backlogged background processes, with the exception of the reporting of email opens/clicks across all databases, and some open/click data from prior to this morning may not be visible until later today/overnight as the backlog is worked through. We're continuing to monitor all systems, and we'll share a postmortem in a few days after we have had time to deconstruct the LUNA event.

Posted Dec 02, 2020 - 13:51 EST

Update

We are continuing to monitor database activity on the LUNA cluster. Throughout the day, we have investigated, developed, and tested various mitigations to help address the multiple underlying causes for intermittent performance issues associated with databases on this single cluster. We have identified components of several underlying systems that were not achieving the performance expectations required for an unprecedented level of activity, which has included more than 250 million requests to our direct and distributed infrastructure in the past 24 hours. We understand and appreciate the significance of 11/30 and 12/1 as application deadlines, today as #GivingTuesday, and this week as a return from a holiday for many, and we have actively worked throughout yesterday and today to ensure that any performance degradation was minimized to the best of the available capabilities. We will continue to monitor these systems overnight as we ramp back up backend processing activity, which in some cases may have been temporarily delayed as we gave priority to forward-facing functionality. We will provide further updates as they are available.

Posted Dec 01, 2020 - 16:41 EST

Update

We have continued to observe intermittent performance degradation on LUNA, which is suggesting broader network connectivity issues associated with this cluster. We have therefore decided to transfer all traffic to LUNA to secondary infrastructure. We will be beginning this transfer process shortly after we confirm secondary infrastructure has fully synchronized with all databases.

Posted Dec 01, 2020 - 12:08 EST

Update

Databases on LUNA have been stable for the past 30+ minutes, and caches are re-warming to restore full performance. We're continuing to monitor elevated processor activity on LUNA, as requests continue to be served directly without the aid of secondaries. Out of an abundance of caution, we would like to avoid bringing the secondaries fully back online during the day to avoid any potential for a reoccurrence of the connectivity issue encountered as the secondaries were brought back into service, an issue not previously encountered and which is still being investigated.

Posted Dec 01, 2020 - 09:26 EST

Update

Databases on the LUNA cluster just began experiencing issues connecting to secondaries upon the completion of synchronization. We're actively working to restore access as we simultaneously investigate the cause of issues on this cluster.

Posted Dec 01, 2020 - 08:48 EST

Monitoring

Emergency maintenance has been completed, and we are monitoring the resynchronization of secondaries to begin offloading read-only traffic as per normal operation. We will provide further updates when we have completed our monitoring phase of this event.

Posted Dec 01, 2020 - 07:35 EST

Identified

We have identified an issue with a component in the underlying storage infrastructure that supports LUNA, and we will be performing emergency maintenance this evening to restore system performance. There will be a brief service interruption for databases on LUNA while the transfer to secondary servers occurs, and all services are intended to remain online during this maintenance. We are continuing to monitor performance on LUNA and will continue to offload traffic, where possible, to help improve performance in advance of this forthcoming emergency maintenance. We have separately identified some backlogging of inbound messages into Slate Inbox. While this is a direct result of the decreased throughput to databases on LUNA, it may result in delayed inbound delivery to databases on other clusters. Any remaining backlog of inbound message processing should clear upon the beginning of emergency maintenance this evening.

We thank you for your patience and understanding as we complete this emergency maintenance.

Posted Nov 30, 2020 - 16:22 EST

Investigating

We are investigating and working to remedy the cause of degraded performance for databases on the LUNA cluster. We continue to offload certain activities to secondary servers to relieve the increased load. Databases on other clusters are unaffected at this time, apart from periodic spillover of queueing at the web nodes which may result in brief periods of slightly degraded performance.

Posted Nov 30, 2020 - 14:52 EST

This incident affected: Slate.