Wednesday, 2016-04-13

FoxyCart Service Unscheduled Downtime: Database Failure

17:25 PST: We are currently working on restoring service. We will post details shortly.

UPDATE 17:36 PST: We are promoting a slave database to master, and expect to be back up within a few minutes.

UPDATE 17:39 PST: Service has been restored.

Summary of the situation: While working to configure security-related services, our production database master instance was inadvertently stopped. Though we do have a slave backup ready to takeover, because of the way the master db was stopped, we wanted to confirm with our DBA before we started the database promotion.

The database promotion was initiated at 17:36 PST, and service was restored at 17:39 PST.

Though we do have safeguards to prevent this, due to the nature of the security work being done, the normal safeguards were bypassed. We have identified 3 additional problems that contributed to this:

  1. The instance in question was named incorrectly. We have corrected the names currently, and are working to ensure instance names are updated when they're promoted.
  2. A UI quirk that resulted in applying changes to more than one instance at a time.
  3. The absense of a flag to prevent inadvertent termination. We are looking into correcting this as well.

This was entirely human error, which makes this downtime even more frustrating than a normal downtime.

We do not anticipate an event like this ever happening again, but we now have updated our procedures to immediately initiate a database promotion. Had we done that this time, downtime would have been limited to under 5 minutes, instead of the 14 minutes we saw.

We are sincerely sorry for this downtime. We are already working to do everything in our power to prevent this from ever happening again.