Increased latency for databases on LUNA
Incident Report for Technolutions
Resolved
Since downgrading and rolling back the Microsoft cumulative update, no further issues have been observed. We will forego future cumulative updates until we can confirm that Microsoft has addressed the stability issues observed in this cumulative update. While we were able to mitigate much of the impact caused by this cumulative update (approximately 0.3% of requests yesterday failed as a result of this issue, with 99.7% of all requests yesterday succeeding without errors), the sporadic nature of the issue was disruptive nonetheless.

As part of our conservative, multi-stage roll-out strategy, cumulative updates are first installed onto test servers, and test environments have been running without issue under this cumulative update for the past month. The update has also been running on secondary read-replica servers for several weeks now, similarly without issues. This suggests that the issue is triggered by a quasi-rare set of circumstances, which was encountered quickly following our incremental roll-out to a limited set of production servers.

We will be continuing a process to offload additional databases from LUNA, helping to align it further with the database distribution practices that we've observed and developed that can help to contain further the impact of such issues should they occur in the future. Thank you again for your patience as we worked to restore normal operations.
Posted Mar 19, 2021 - 12:54 EDT
Update
We have successfully completed the downgrade and rollback of the Microsoft cumulative update, and we will continue to monitor to ensure that this resolves the periodic stability issues observed earlier on Thursday for databases on the LUNA cluster. We separately are continuing to migrate additional databases away from this cluster out of an abundance of caution.
Posted Mar 19, 2021 - 01:11 EDT
Update
Thank you for your patience as we continue to work towards both an immediate and long-term solution to the impacts caused by the intermittent stack dumps. We are actively preparing to downgrade from the most-recent Microsoft cumulative update. At the present moment, databases on LUNA are only periodically and briefly being impacted as a result of additional mitigation measures we have taken, and a downgrade during the day is likely to present a larger impact to availability; therefore, the present plan is to perform the downgrade overnight. Meanwhile, we are actively migrating databases from LUNA to a new database server, which will also help to reduce any impact to databases still on LUNA. This migration results in changes to direct SQL connections, and the Slate Captains will be automatically notified of the new connection instructions immediately following the migration of their database.
Posted Mar 18, 2021 - 14:31 EDT
Update
We just observed additional stack dumps occur on LUNA, and we are again investigating what may be triggering what appears to be a bug introduced in the latest SQL Server cumulative update.
Posted Mar 18, 2021 - 12:06 EDT
Monitoring
We are no longer seeing stack dumps occurring on LUNA following the server restart. Stack dumps are caused by bugs and unhandled exceptions in the third-party database server software, and we will be preparing a bug report to submit to Microsoft based upon the log data collected this morning. At this time, all databases on LUNA have been stable and are operating within normal parameters. We will continue to monitor this cluster to ensure that there are no continued stability impacts of the third-party software update. We will also migrate a number of databases on the LUNA cluster to a new cluster this weekend to reduce further the impact and recovery time should any performance or stability issues occur again.
Posted Mar 18, 2021 - 11:25 EDT
Update
We have identified that the database server needs to be restarted to attempt to resolve a stability issue associated with a recent third-party update to the database server. Databases on the LUNA cluster only will be momentarily offline as they are brought back online.
Posted Mar 18, 2021 - 10:37 EDT
Investigating
We are investigating increased transaction latency for databases running on the LUNA cluster. Updates will be provided as our investigation continues and remediation steps are taken.
Posted Mar 18, 2021 - 10:29 EDT
This incident affected: Slate.