Date & Time: 2016-11-24 14:40 CST
2:23pm CST: We were alerted to a service interruption and immediately began diagnosing.
2:34pm CST: Service was been partially restored, but intermittent service interruptions continued for the next hour.
3:20pm CST: Service was fully restored to all but one of our users (who had service only partially restored). Some traffic (to the impacted user) was rerouted to ensure our system stayed up.
4:45pm CST: We determined that the "smoking gun" was a misdirection, but that led us to what we believed to be the true cause of the interruption: our main caching cluster. We began working on testing to scale the caching cluster, and to modify our application code to better take advantage of the full cluster.
10:05pm CST: After fully testing the improvements, we were able to fully restore traffic to the account that remained at only partial service. Our determination was that a large traffic spike, combined with insufficient resources in one of our caching clusters, caused the incident earlier in the day.
We are continuing to monitor the situation, but the root cause was insufficient resources in our primary caching cluster. This problem was exacerbated by two additional problems: First, one of our monitoring tools reporting the problem to be in a different caching system. That sent us looking in the wrong direction. Second, the alerts we had configured on the main caching cluster weren't triggered, as they were configured with a higher threshold (appropriate for smaller, lower-powered servers) than they should have been.
We've addressed the root cause by correcting the monitoring, improving our usage of the cache cluster to take advantage of all nodes, and scaling the cluster to better handle the traffic we're seeing from this high-traffic user.
We apologize for the service interruption. We take our uptime incredibly seriously, and will continue to improve our systems.