Service degradation (SFO1)

This report details the incident that occurred on February 9, 2018, impacting the availability of the ZEIT APIs and user deployments in the sfo1 data center.

As acritical path in our infrastructure, all load balancer nodes are extensively monitored. For example, one mechanism tests over 50 production deployments every 60 seconds from over 30 locations around the world.

When it comes to the deployment phase, we have procedures for making sure the roll out of the load balancer nodes is progressive and zero-downtime:

  • Load balancer nodes can be easily staged in development and/or production environment so that they can be verified before production traffic hits them
  • Instances can be replaced one at a time, without dropping traffic, and verified independently
  • The deployment process can be halted and a rollback to the previous version can be initiated and completed in a few minutes, if needed

However, this process so far has required too much discipline on the operator side, and has left room for human error.

On this occasion, we made such an error when deploying some improvements. The verification failed before production traffic was accepted, though we mistakenly proceeded with deployment to production regardless.

At roughly one and a half minutes, at 21:40:18 UTC the first alert was triggered, reporting that traffic to one of the above-mentioned deployments was being dropped. Roughly two minutes of investigation allowed us to identify that the pool of load balancers had 50% of its members in an unhealthy state.

At 21:43:26 UTC we cancelled the ongoing deployment process and started rolling back to the previous version.

At 21:47:57 UTC we received the first notification of a passing test, meaning that at least one load balancer node was healthy and able to route traffic.

At 21:51:13 UTC we received the last notification of a passing test, meaning that all load balancer nodes were healthy and able to route traffic.

Time to detection was approximately one and a half minutes.

Traffic was partially restored in 7 minutes and 39 seconds. Total time to complete restoration was 11 minutes and 5 seconds (21:40:18 UTC to 21:51:13 UTC).

The primary takeaway from this incident is that we will automate the deployment process of the load balancer nodes further, effectively removing human error and variability from the process.