Thursday, 2015-11-26

FoxyCart Service Outage due to Traffic Spike

Thursday 2:15pm PST: Our systems experienced extraordinarily high load, resulting in extremely slow load times. We're currently working to restore service.

2:48pm PST: We've restored service to all of our users but one. This particular user experienced a massive spike in traffic, and we're currently routing their traffic away from our application environment via DNS until we can put additional infrastructure in place.

3:48pm PST: We have fully restored service to the one user who continues to send massive traffic. We're continuing to monitor.

4:30pm PST: Even with increased resources, another traffic spike caused problems.

4:44pm PST: We removed the user from our production traffic and have restored service. We're exploring options.

7:20pm PST: We restored full service to the one user's accounts, and have isolated all their accounts on a new load balancer and web server. We've identified a possible performance improvement available to better utilize the improved infrastructure we currently have available, and are testing these changes currently.

Summary

Our systems saw a record number of sustained requests yesterday, starting early in the day and continuing. Initially, our systems responded fine, but at 2:15pm requests started to back up, and though our systems were still serving traffic, requests were being served so slowly that service was interrupted.

We have one user in particular who has multiple FoxyCart accounts, and has things configured in such a way so that their traffic spikes create 5x the normal number of FoxyCart requests on our end. They also receive a lot of traffic. We made the call to blackhole their traffic, which restored service to all our other users. We then were able to add back 2 of their 5 accounts while we tested some improvements.

After considerable discussion, we moved this user to their own dedicated and isolated webserver instance. This had the benefit of preventing their traffic from impacting others, but also gave us more raw power to meet their needs.

Further, we discovered a webserver setting (a max worker limit) that weren't properly tuned. When we made our infrastructure migration to AWS in September, we'd initially had smaller instance sizes, and the automation provider we use had a bug in their config generation that didn't scale this setting up as we went to larger hardware. We've created our own fix for that while our automation partner works on their own bug fix. Though we haven't yet had a chance to fully stress test our environment, we believe this fix to be instrumental in our continued performance.

Today is Black Friday, and as I write this we're seeing higher traffic than we've ever seen in our 8 year history. The changes we made yesterday are holding very well so far, but we are continuing to monitor to ensure we stay ahead of any problems.