At 7:00 AM PST on December 15th, 2020, an internal BTC indexer hung while processing a network transaction, which caused it to fall behind chainhead. This temporarily prevented the BitGo platform from processing BTC transactions.
(All times stated here are Pacific Standard Time)
7:00 AM - Initial BTC indexer disruption.
7:27 AM - Status page incident announced.
8:58 AM - Platform service restored.
9:21 AM - Finished reprocessing transaction backlog.
9:26 AM - Status page incident resolution announced.
The outage impacted our ability to index and process new BTC network transactions, and did not result in any data loss or corruption.This outage did not interact with or impact any systems that handle funds or currency.
An internal BTC indexer failed to process a network transaction which caused an internal retry to exponentially back-off, while also deadlocking. The indexer eventually ended up in a hung state, and was unable to process additional network transactions.
The BTC indexer was gracefully shut down, and a database snapshot was taken to expedite reindexing. The BTC indexer was then restarted and monitored while it finished reindexing, starting from where the snapshot left off, and ending at chainhead. Once the indexer arrived at chainhead, several pending transactions were then reprocessed.
We have already added alerting against the internal process state that preceded this incident. Additionally, we are already migrating to a much more robust and fault-tolerant indexer implementation.