Investigating Issues with an Upstream Infrastructure Provider

This report is a post-mortem of the incident that occurred on the 6th of August, 2018. This incident impacted all API endpoints in SFO1 and BRU1 regions, and by proxy, the zeit.co dashboard.

Timeline

First Outage

At 06:53 UTC our engineers were alerted that the ZEIT APIs were not responding. The on-call engineers immediately looked for the cause and found the issue to be due to a closure of the ZEIT account for an upstream provider.

At 07:15 UTC we were in talks with the upstream provider seeking a solution to the issue. It was discovered that the access block was automated and a mistake.

At 07:45 UTC our support connection from the provider gave back access to ZEIT. At this time, we started bringing back up our API endpoints for all DCs.

Second Outage

At 19:22 UTC a support engineer at our infrastructure provider was attempting to rectify problems with the previous outage. During this process the engineer restored the state with which the first outage occurred.

At 19:44 UTC our infrastructure provider support contact informed us that the blockage had again been automatic, related to the mitigations made for the previous outage. We were once again given access to our account and began re-opening our API endpoints.

Conclusion

Although service was degraded by way of creating and accessing deployments within our system; live dynamic deployments were unaffected, as was the CDN, routing layer, and caches.

We maintain very active communication channels with all our major infrastructure providers (Amazon, Google, and Microsoft). This incident was a combination of automation issues and human mistakes on their side. We are actively working with them and helping design methodologies that will eliminate this category of incident in the future for all users of the Cloud in the world.