The issue started when our BOD (Beginning of Day) batch syncing completed between our ledger and our api servers. This job is responsible to pull the latest position information for all accounts from the ledger system, which is the source of truth in our architecture. This job is scheduled at 2:15am on days after trading days - that is why its running on Saturdays too. When it was completed at around 3:30am ET, the underlying database started to use an unexpected index (out of multiple options) for the query that is being executed for each API request on the /account
endpoints. Since our average load only on this endpoint is in 200/sec - 1000/sec range, this use of inaccurate index in the database query pushed the CPU load in our replica database cluster to 500x higher than usual. This had a cascading effect in our services since the database replicas became extremely slow for all the queries.
The team was able to adjust the indices on the table in a way that the service would again start using an optimal one. It still took time to recover all the replica servers, where we had to remove them one by one from the cluster and wait until the replication is catching up, then add them back to the cluster. Once this process was completed, the services were back in operational. We are monitoring the replica cluster performance closely in the upcoming days.