9:50am PDT: We are working to restore connectivity after an unknown service interruption took us down. We'll update shortly.
9:57am PDT: We have restored service, and are working to determine the cause of the problem.
10:00am PDT: We are continuing to work to resolve issues.
10:20am PDT: We are currently waiting for additional capacity to be added to our webservers.
10:40am PDT: Additional capacity is being added, but service continues to be very, very slow.
10:55am PDT: We are back and we believe at this point we will remain up. We will continue to update and monitor.
11:10am PDT: Emails that were delayed have been sent.
11:30am PDT: SUMMARY: Shortly before 9:50am PDT we experienced sudden and intense traffic spikes. Though our systems should have handled these. already high traffic caused a runaway situation that cascaded and compounded, resulting in extremely long load times or timeouts. We immediately took action to attempt to bring service back online, which involved scaling up our webservers. Unfortunately, this scaling process took far longer than it should have, due in part to two reason:
The way we handle environment variables and secrets prevented this scaling event from loading the secrets without manual intervention.
There is a sync process that happens during our scaling automation, whereby new servers attempt to sync from existing servers. In this case, due to the old servers being overloaded, this process took entirely too long. (We took action to fix this, but it did add significantly to the time before the new webservers were brought up and added to our load balancer to take traffic.)
We are taking the following steps:
We are increasing the number and size of webservers to ensure system load is kept lower.
We have been working on a new scaling setup that will radically decrease the time-to-new-servers-launched. We are devoting more resources to that effort so we can get that launched sooner than planned.
As part of that new setup, we also are changing the way we handle secrets, which will eliminate the potential snag there as well.
This was the biggest downtime we have suffered in many years, which is a positive on the one hand (that our systems have improved to the point where any downtime has become fairly rare), but obviously a negative in that we didn't adequately prepare for or handle this traffic spike. We take this extremely seriously, and are increasing our efforts to ensure we provide the most stable and reliable ecommerce service possible.