Monday, 2016-12-26

FoxyCart Service Caching Cluster Issue

4:37pm CST: We're currently exploring an outage that looks to be related to a failed cache cluster node. We'll update shortly.

4:46pm CST: We restored service.

5:00pm CST: We're exploring the source of the problem.

5:12pm CST: We just received confirmation that the underlying AWS hardware had a failure. We'll continue working to determine what improvements we can be made to our processes and infrastructure to prevent this in the future. The underlying hardware for one of our Redis read replicas was taken offline at 26-12-2016 22:39 UTC, at which point the node had already been having problems. We're waiting on AWS to confirm a few things before we update further.

10:27pm CST: We have confirmed that one of our cache cluster nodes had a complete network failure, and was automatically replaced. This was an emergency (unscheduled) replacement at AWS. Though the replacement worked and service was restored, and we were able to monitor the situation, we are still working to improve our handling of a failure like this in the future.