Long-running database index causing degraded performance
Incident Report for BitGo
Postmortem

On January 6th at 9:30am PST, we deployed a new index to help improve performance for one of our largest collections of data. Shortly after, our internal alerting notified us that we were experiencing slow performance with our wallet services. The index creation was intended to run in the background and only have a nominal impact on platform performance. However, we misjudged the impact of this change and it began pushing many customer requests beyond their timeout limits.

Once identified, we quickly evaluated a few different paths to resolving the performance hit and decided to quickly optimize the IOPS for that particular volume where the index was being created. The primary reason we chose this path was we would avoid a complete platform outage and we’d be safe from a number of unknowns, including the possibility of data corruption. This index was required for a number of on-going and future projects we have roadmapped for improved performance.

We do not take this impact on service lightly and we’ve conducted a thorough post-mortem to improve the process around how we review the impact of these indexes and when we apply them. BitGo services customers from all over the world and the concept of off-hours is a very small window but we assure you, we are taking steps to make sure this kind of operation does not have as large of an impact on performance or as wide of a customer base.

Posted Jan 08, 2021 - 08:57 PST

Resolved
At this point we are back to full restoration of service. We will continue to monitor for any further degradation.
Posted Jan 06, 2021 - 15:39 PST
Update
We are seeing performance improvements at this time, and are continuing to work toward full restoration. Please stay tuned here for further updates.
Posted Jan 06, 2021 - 15:08 PST
Update
We are continuing to monitor this situation, and will continue to provide updates here as they become available.
Posted Jan 06, 2021 - 14:37 PST
Update
We are still monitoring this situation as the DB index is still running. Please watch here for further updates.
Posted Jan 06, 2021 - 13:45 PST
Update
Currently we are seeing some increased degradation of performance. We are continuing to monitor this.
Posted Jan 06, 2021 - 12:53 PST
Monitoring
Performance has significantly increased and we are seeing a much smaller amount of 5xx errors happening. Index creation is still ongoing, but we seem to be getting progressively closer to full restoration of service.

We will keep you posted with further updates as they become available.
Posted Jan 06, 2021 - 12:09 PST
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 06, 2021 - 11:32 PST
Update
We are continuing to investigate this issue, and will keep you posted here with updates. We do anticipate that we will continue to see degraded performance for a while, and will provide further updates as timeframes become clearer.
Posted Jan 06, 2021 - 10:51 PST
Investigating
We are currently investigating this issue.
Posted Jan 06, 2021 - 10:03 PST
This incident affected: API, Wallets, Settlement, Portfolio & Tax, Reporting and Digital Assets (Bitcoin (V1), Bitcoin, Litecoin, Bitcoin Cash, Bitcoin SV, Ethereum, Dash, ZCash, Ripple, Stellar, Algorand, EOS, Tron, Bitcoin Gold).