Service degradation (SFO)

This report is a post-mortem of the incident that occurred on the 20th of July, 2018. This incident impacted all deployments, including creation, aliasing, and access, within the SFO region.

Degradation Factors

  • Deployments and aliases that had enabled the CDN were largely not affected
  • Many deployments saw elevated latencies but not complete loss of availability
  • Of the APIs, only those associated with creation and scaling were affected
  • We failed-over APIs and our control panel very quickly in order to allow users to make changes and retain control and visibility

Timeline

At 05:15 UTC one of our systems received a spike in requests 20 times higher than the amount that it was usually handling. Within a few moments, our on-call engineer received a pager notification due to the spike in requests and elevated error-rates.

The sudden significant spike in load overwhelmed the system, which led to cascading failures causing degraded unfreeze performance in our SFO1 datacenter. Response times suffered due to our APIs dealing with the load of the additional requests.

At 06:16 UTC, after initial efforts to triage the cascading failures caused by elevated load and unsuccessful attempts to revive SFO1, we chose to perform a full failover to the BRU1 datacenter to reduce the impact. Subsequently, all traffic began to flow to our BRU1 datacenter.

With service partially restored by routing requests to BRU1, our engineers continued to triage the underlying issue and to bring SFO1 back online.

At 06:24 UTC we the first deployments successfully failover to BRU1. Our engineers monitored BRU1 closely following the failover to make sure that deployments were successfully responding requests.

At 07:20 UTC we identified the root cause and started working on fixing it accordingly, in an effort to bring SFO1 back online.

At 14:28 UTC our engineers completed a fix for SFO1 and brought it back online. At this moment in time, we started moving traffic back over from BRU1 gradually.

At 14:39 UTC more than 25% of traffic that was redirected in the failover was moved back to SFO1.

At 15:53 UTC 75% of traffic was successfully migrated to SFO1 from the failover.

At 16:33 UTC our engineers had confirmed that 100% of traffic had moved back to SFO1 from the failover and the service was fully restored.