VPC Instance Unavailability
All systems are normal.
This concludes the outage. We will continue to update you on the improvements we are making to prevent similar issues from arising.
Availability has been restored. We are closely monitoring all service stability for potential regressions or service interruptions.
We have determined the root cause for the unhealthy nodes and have patched the issue. Unfreeze backlogs are currently being processed, and new deployments for paid accounts will resume within the hour. OSS deployments will follow after the new paid deployment backlog has stabilized.
We encountered several failures, ultimately resulting in what is our largest outage to date:
We will include Lessons Learned in the post-mortem, along with the plethora of fixes we employed both actively (in order to fix the outage) as well as tentatively (to prevent another outage of this nature in the future).
Current system status is still unstable, but improving. We will provide another status update once we re-enable deployment-related API access (new deployments).
Capacity has been increased to an acceptable level; we have begun processing the backlog for new deployments.
We will re-enable new deployments when we have determined the system has returned to a more stable state.
Upstream service has lifted connection limits.
An upstream service began dropping connections due to a misconfigured connection count upper limit, surfacing a bug within our infrastructure that caused a portion of our SFO cluster to become unavailable.
Attempts to flush the backlog have been overloading the system and as such we have been incrementally increasing cluster-wide capacity.
New deployments, as well as unfreezing existing deployments, are currently disabled. Deployments that are currently unfrozen (and that remain unfrozen) should still be available.