Upstream DNS Outage
This report is a post-mortem of the outage that occurred on the 28th and 29th (UTC) of January, 2019. This incident impacted users pointing to Cloudflare or have an enabled Cloudflare extension, only in the SFO1 region.
At 19:00 UTC we received alerts from our monitoring systems of elevated error rates for connections to domains that use Cloudflare. The on-call response team was notified and began an investigation immediately.
We immediately tried to contact Cloudflare with the response that our upstream DNS provider was causing the issue.
At 20:09 UTC we disabled Cloudflare integrations temporarily to restore availability in the SFO1 region.
At 20:26 UTC we received confirmation from Cloudflare that DNS issues were the root cause and started offering users help to move away from Cloudflare.
At 22:20 UTC we active communications in contact with both Cloudflare and our upstream DNS provider, helping them to investigate connectivity and peering issues between the two services.
At 03:30 UTC we saw DNS resolution errors decline and service fully restored.
We are still in close contact with both providers to investigate what happened exactly and to stop this from happening again in the future.
Some facts of the situation, as we move forward:
- ZEIT uses one of the most reliable and trusted (100% uptime) DNS networks for our load balancing (`alias.zeit.co`).
- Many of our customers use Cloudflare and manually configure it to route to ZEIT directly, by configuring the origin as `alias.zeit.co`.
- Many of our customers have enabled the Cloudflare integration via our Domains settings, which makes the above step automatic.
Customers using Cloudflare might have seen intermittent traffic routing errors in the SFO region (impacting primarily traffic from California).
The root cause was a IPv6 connectivity error between Cloudflare and the `alias.zeit.co` DNS. The issue was limited to the IPv6 IP of the nameservers, and only in one specific geographical region.
Due to the nature of the bug and the number of actors involved, debugging and resolving the routing problem was not straightforward. For those customers whose domain we control because they point their nameservers to us, we rolled out a mitigation immediately. It's worth noting the mitigation is still active as we continue to prioritize uptime and availability.
For a very small percentage of customers who use Cloudflare directly and not through the integration provided by us, the exposure to downtime in the region was more prolonged.
As always, we encourage our customers to leverage our nameservers directly, which allows our tools and team to roll out updates at scale efficiently to prioritize uninterrupted world-wide access to your deployments.