Thursday, 2017-09-21 Admin Data Lag, Database Replication Issues

We've had multiple reports that the data in the admin or available via the API is not up to date. For example, transactions taking minutes to show up in the transaction report; coupons not showing as updated immediately; issues logging in. At this point, we believe this to be a database replication issue that for some reason didn't trigger alerting. We'll update here as we discover more.

As of 8:41am CST, things appear resolved, but we are continuing to explore to ensure that's the case, and to see what went wrong.

UPDATE: After intensive analysis, we discovered the cause of the issue. As part of routine operations, we have a scheduled job to clean out old carts (ie. carts that were abandoned). Though this process has been running successfully for years, increased volume triggered a different behavior, which resulted in the script running multiple times, concurrently. This resulted in increased load on our databases. When a nightly database backup ran, that pushed the CPU usage beyond normal ranges, which resulted in slow replication (and sporadic errors in the admin).

Note that this didn't impact any transactions. It just would have delayed the display of those transactions in the admin or via the API (as reads tend to be from the replicas, not the master).

We have improved the cleanup process to ensure it cannot runaway in the future, and we will continue to monitor the situation.