Saturday August 25, 2018: We experienced multiple issues related to subscription processing, as well as webhook processing. These two issues were distinct, but the subscription processing delay was related to the webhook issues.
Impacted users are being notified.
The root cause of the duplicate charges is still being investigated. Though we know it was caused by a timed-out request being automatically re-attempted, the request should not have reattempted, and we are still working to diagnose why this request was re-attempted.
An improvement to our webhook system resulted in a new index being added to DynamoDB. Unfortunately, this index took longer to create than anticipated, and while it was being created, the normal auto-scaling that should have kicked into gear didn't. We are waiting on additional information from AWS support, but the result of this was a throttling of requests to our webhook system. This resulted in slow subscription processing, but also in dropped requests from our primary application to the webhook system, which resulted in transactions not being pushed through to the webhook system (and to the configured webhook endpoints, Zapier, etc.).
We are notifying impacted users, and refeeding transactions.
For part of the day, emails were delayed as well. All emails were delivered by 3pm PST.
We have identified two concrete steps that we are prioritizing:
2018-08-26 12:41pm PST: There were numerous issues we experienced yesterday, but we have been able to identify exactly what went wrong.
On Friday, we rolled out a change to our webhook system, which uses DynamoDB. This change involved adding a Global Secondary Index (GSI). Though the GSI was configured to auto-scale, the autoscaling didn't function while the GSI was being created. (We're still working with AWS to determine whether that's expected behavior.) Further complicating matters is that the alarms that'd typically alert weren't in place, as the index was still being created, and that the lack of autoscaling caused the creation to take longer than anticipated (which just drew out the impact).
The lack of auto-scaling caused our webhook system's performance to slow significantly. Though the webhook system is asynchronous from our main application, the initial data transfer is synchronous, and the webhook delays caused subscription processing to slow. As such, instead of all subscriptions processing around 5am PST like normal, they were still processing when a backup subscription processing job kicked in at 9am PST. At this point, some subscriptions were double-charged, though the impact was limited almost to a single Foxy merchant.
We do have alerts notifying us if there's a subscription processing error, and our team began exploring the issue. Unfortunately, we misdiagnosed what was happening. What we were used to was seeing the subscription queue decrease rapidly during processing; but what we saw instead was the queue unmoving. What we now realize is that the queue was actually decreasing, but at a far slower rate than we expected. We did, however, realize that something was wrong, and halted the subscription processing.
We then restarted the subscription processing, and worked to diagnose why subscriptions were running so slowly. All subscriptions did process, the one store that had the most impacted users was notified (and all double-charged customers had the duplicate charges voided by us). (There do exist a few additional stores that may have had a small number of customers impacted, but we are working to determine whether that's the case.)
We initially thought that the double subscription processing was related to an entirely separate issue, caused by a recent switch in one of our subdomains from an Application Load Balancer to CloudFront. We now know, however, that the switch from a load balancer to CloudFront for this particular subdomain didn't cause the issue like it did once in the past.
We are making improvements to subscription processing a top priority. Our goal is to make a repeat of this event a technical impossibility.
We sincerely apologize for the issues. —Team Foxy