Friday, 2015-10-09

FoxyCart Service Slow Load Times, Service Interruption

6:07pm PDT: We experienced very high traffic surge, and are working to restore normal performance. We will update here as soon as we have info.

6:28pm PDT: We're continuing to work on this issue. We anticipated high traffic (due to one of our users having a large promotion event), and tested more traffic than we're seeing right now, but somehow our testing didn't catch a bottleneck. We're still working on it.

6:32pm PDT: We're back up but continuing to work on stablizing performance.

6:32pm PDT: Performance has been restored. We'll update as we have more info. We identified a bottleneck in one of our caching layers, but are not sure yet why our earlier stress testing didn't reveal this bottleneck.

Summary of Tonight's Events

Preparation for Increased Traffic:

One of our users received a significant amount of traffic in a very small window this evening, starting at about 6:02pm PDT. This user alerted us weeks ahead of time, and we prepared by implementing a fresh round of load testing to ensure we were ready for the anticipated traffic.

As part of this load testing, we did the following:

  • Tested roughly 6x* the anticipated peak traffic. (This testing didn't take advantage of our CDN, and hit the most resource intensive requests. As such, though the traffic itself was only 6x the maximum anticipated traffic, the actual load placed on our systems was roughly 25x what we anticpated seeing.)
  • Added additional web servers to our cluster to further reduce response times, just in case our testing was insufficient.

In all our testing, our servers performed beautifully, maintaining low load and reasonable response times under massive numbers of requests.

Live Traffic, Round 1:

But when the live traffic hit, our systems behaved differently than they did in testing. The number of requests were well under what we'd tested, even at their peak. Further, all our systems maintained low load even as requests were slow or failing.

We already had multiple team members monitoring the situation, and we immediately started working to identify the cause of the slowdown. We discovered a slowdown in one of our caching layers, and upon further examination realized this particular caching layer was set to use less memory than seemed reasonable. We bumped up that memory usage limit and restarted services across our web servers, and immediately saw our systems return to serving traffic at normal speeds.

Live Traffic, Round 2:

Though we had recovered and were reasonably certain we'd fixed the problem, our user had another expected surge in traffic over two hours out (starting at roughly 9:02pm PDT). To prevent further service interruptions, we doubled the number of web servers (just to be safe; we had no evidence to suggest we had too few initially, but better safe than sorry at this point). We also doubled our web servers on our staging environment (a mirror of our production environment), and worked to test the caching memory fix.

We performed as much testing as we could in that 2 hour window, but as with our previous load testing, we were unable to reproduce the issue.

Thankfully, though the second traffic surge saw considerable traffic, the fix we'd made to our caching settings proved effective, and response times weren't impacted at all.

Additional Analysis:

Though we're still working to get 100% confirmation, we do have a hunch as to why our (much heavier) load testing didn't reveal this bottleneck. We will work to improve our load testing procedures to see if we can replicate the failure (and thus confirm the fix).

In some situations, we immediately know what we could have done differently, and what we need to change moving forward. In this case, we don't yet have that clarity.

Though we're incredibly frustrated our preparation didn't catch this problem, we were able to identify and resolve the issue beyond what we've been able to accomplish in the past. Our new infrastructure performed beyond our expectations, and our logging and analysis tools worked as intended to help us identify the problem. We will continue to stress test our environment to ensure this doesn't happen in the future.

—Brett, on behalf of the entire team