Service Incident in AWS us-east-1

Incident Report for Technolutions

Resolved

AWS has not yet provided root cause details on their service incident last night. The following details recount our observations and experiences.

At approximately 10:22pm Eastern Time, our automated monitoring detected a service interruption to test environments. These test environments are not designed to achieve the same levels of high availability as production databases, and our attempts to bring the test environments back online were not successful, as disk IO to the underlying storage infrastructure was continuing to hang. Approximately 1 hour later, AWS acknowledged that they were having issues with their storage infrastructure in their US East 1 region. We then proceeded to see the event spread to other volumes, where disk IO began to hang, and, for some production databases, the synchronization to the high availability secondary servers—located in different AWS availability zones—stopped while others remained in sync. As the storage volumes for some of these production servers became further impacted, the clusters were unable to failover to the high availability infrastructure because some databases had entered a "not synchronizing" state, and a failover could not occur without the potential for loss of recent transactional activity. In an effort to minimize the overall impact time, we failed over infrastructure that could smoothly fail over and suspended background processing jobs that were likely to fail during the service incident. By 5:32am Eastern Time, all services, including test environments, had fully recovered, and suspended jobs were resumed. Over the course of the morning, we applied some additional mitigations to address the overnight impacts of the AWS service incident.

While most production environments remained fully online and operational, some saw impact during this event due to the unavailability of their storage infrastructure and the mixed synchronization status of databases on the cluster preventing a failover to high availability infrastructure. While this is an exceptionally rare event for AWS, we are continuing to look into ways to provide more robust failover capabilities should a future event cause similar mixed synchronization and service interruption.

Posted Sep 27, 2021 - 13:22 EDT

Monitoring

All services are fully online and operational at this point. Background services are currently processing the backlog of automated/scheduled jobs accumulated during the AWS service outage, and jobs that had experienced incident-related errors are being requeued. We are continuing to await additional information from AWS as to the root cause and additional mitigations being taken. In the meantime, we will continue to monitor systems as the backlogged jobs complete.

Posted Sep 27, 2021 - 06:04 EDT

Update

Several volumes are still pending recovery, and recovery is taking longer than AWS had expected. Most services are online at this time, although some background processes remain suspended while remaining volumes are brought online. Further updates will be provided as AWS completes their recovery.

Posted Sep 27, 2021 - 04:26 EDT

Update

AWS is in the process of rolling out a mitigation that they believe will restore access to affected volumes within the next hour. We are continuing to monitor systems to ensure full recovery.

Posted Sep 27, 2021 - 02:58 EDT

Update

AWS is continuing to work towards resolving an issue with a subsystem within their EBS service, which is impairing access to databases on several clusters, including test environments. We will update this status when further updates are provided by AWS.

Posted Sep 27, 2021 - 00:24 EDT

Investigating

AWS is currently experiencing a networking incident in us-east-1 that is impairing access to several database volumes. We will provide updates as provided by AWS.

Posted Sep 26, 2021 - 23:37 EDT

This incident affected: Slate.