Elevated routing latency and errors

This report is a post-mortem of the incident that occurred on the 2nd of January, 2019. This incident impacted incoming traffic, via all regions.

Timeline

All Now customers are protected by a firewall layer that fights bad actors, including DDoS attacks and other forms of abuse. This layer is tightly integrated into the routing component that makes deployments work (both custom domains and the `*.now.sh` namespace).

At 05:59 UTC we received alerts from our monitoring systems of elevated error rates. The on-call response team was notified and began investigation work immediately.

As the outage progressed, we noticed more and more traffic was being dropped, by analyzing the packet counter metrics (network traffic coming in and out of the routing tier.

This turned our eye to the firewall component, which works by blocking ranges of IPs and gossiping them to other nearby clusters. Due to a whitelist not being correctly applied, the routing blocked a critical IP range that included legitimate traffic.

Upon discoverying the malfunction in the firewall logs, we proceeded to propagate a fix to the whitelist.

At 06:27 UTC the routing functionality was fully restored on all regions.

Conclusion

As further remediation, we are ensuring that the critical ranges of the whitelists are fixed directly in code and cannot be altered over time.