At 2:48 PM PST on Wednesday, December 9th an incorrectly generated release manifest was deployed to the BitGo production environment. This caused a number of platform components to be reconfigured in a way that prevented communication with several service components.
(All times stated here are Pacific Standard Time)
2:48 PM - Release manifest is deployed to production.
2:50 PM - Automated monitoring alerts on service disruptions.
2:54 PM - An internal incident is announced and remediation efforts commence.
3:40 PM - Service health was completely restored.
Services that support our wallet platform were primarily impacted, resulting in a cascading effect on ancillary services (i.e., blockchain APIs being unavailable).
The service disruption was caused by deployment misconfiguration only and did not result in any data loss or corruption.This outage did not interact with or impact any systems that process funds or currency.
A generated release manifest was missing a set of production configurations. When deployed, this manifest caused a number of services to update with incorrect settings which disrupted their communication with other services.
Our incident response team was immediately alerted of these services having issues and began assessing the problem. Within minutes we were able to identify the cause and began taking steps to resolve:
Immediately, we are refactoring a portion of our release process that requires human intervention so that we remove the possibility of incorrectly generating our production release manifests. This semi-manual aspect of our process is temporary and we’re actively working on improvements to include more automation as part of our Q1 engineering roadmap.