On January 6th at 9:30am PST, we deployed a new index to help improve performance for one of our largest collections of data. Shortly after, our internal alerting notified us that we were experiencing slow performance with our wallet services. The index creation was intended to run in the background and only have a nominal impact on platform performance. However, we misjudged the impact of this change and it began pushing many customer requests beyond their timeout limits.
Once identified, we quickly evaluated a few different paths to resolving the performance hit and decided to quickly optimize the IOPS for that particular volume where the index was being created. The primary reason we chose this path was we would avoid a complete platform outage and we’d be safe from a number of unknowns, including the possibility of data corruption. This index was required for a number of on-going and future projects we have roadmapped for improved performance.
We do not take this impact on service lightly and we’ve conducted a thorough post-mortem to improve the process around how we review the impact of these indexes and when we apply them. BitGo services customers from all over the world and the concept of off-hours is a very small window but we assure you, we are taking steps to make sure this kind of operation does not have as large of an impact on performance or as wide of a customer base.