Platform Issues
Incident Report for BitGo
Postmortem

Overview

At 2:48 PM PST on Wednesday, December 9th an incorrectly generated release manifest was deployed to the BitGo production environment. This caused a number of platform components to be reconfigured in a way that prevented communication with several service components.

Timeline

(All times stated here are Pacific Standard Time)

2:48 PM - Release manifest is deployed to production.

2:50 PM - Automated monitoring alerts on service disruptions.

2:54 PM - An internal incident is announced and remediation efforts commence.

3:40 PM - Service health was completely restored.

Impact

Services that support our wallet platform were primarily impacted, resulting in a cascading effect on ancillary services (i.e., blockchain APIs being unavailable). 

The service disruption was caused by deployment misconfiguration only and did not result in any data loss or corruption.This outage did not interact with or impact any systems that process funds or currency.

Root Cause Analysis

A generated release manifest was missing a set of production configurations. When deployed, this manifest caused a number of services to update with incorrect settings which disrupted their communication with other services.

Mitigation

Our incident response team was immediately alerted of these services having issues and began assessing the problem. Within minutes we were able to identify the cause and began taking steps to resolve:

  • A number of misconfigured service components were removed or rolled back
  • Release manifests were properly regenerated with production configuration and redeployed
  • System components were closely monitored until service health was fully restored

Future Remediation

Immediately, we are refactoring a portion of our release process that requires human intervention so that we remove the possibility of incorrectly generating our production release manifests. This semi-manual aspect of our process is temporary and we’re actively working on improvements to include more automation as part of our Q1 engineering roadmap.

Posted Dec 10, 2020 - 17:44 PST

Resolved
This incident has been resolved.
Posted Dec 09, 2020 - 16:25 PST
Update
We are continuing to work on a fix for this issue.
Posted Dec 09, 2020 - 15:24 PST
Identified
The issue has been identified and a fix is being implemented.
Posted Dec 09, 2020 - 15:22 PST
Investigating
We are currently investigating this issue.
Posted Dec 09, 2020 - 15:05 PST
This incident affected: API, Wallets, Settlement, Portfolio & Tax, Reporting and Digital Assets (Ripple).