Our APIs were failing due to high usage of database. We are actively working to resolve the incident.

Incident Report for Alpaca

Postmortem

The issue started when our BOD (Beginning of Day) batch syncing completed between our ledger and our api servers. This job is responsible to pull the latest position information for all accounts from the ledger system, which is the source of truth in our architecture. This job is scheduled at 2:15am on days after trading days - that is why its running on Saturdays too. When it was completed at around 3:30am ET, the underlying database started to use an unexpected index (out of multiple options) for the query that is being executed for each API request on the /account endpoints. Since our average load only on this endpoint is in 200/sec - 1000/sec range, this use of inaccurate index in the database query pushed the CPU load in our replica database cluster to 500x higher than usual. This had a cascading effect in our services since the database replicas became extremely slow for all the queries.

The team was able to adjust the indices on the table in a way that the service would again start using an optimal one. It still took time to recover all the replica servers, where we had to remove them one by one from the cluster and wait until the replication is catching up, then add them back to the cluster. Once this process was completed, the services were back in operational. We are monitoring the replica cluster performance closely in the upcoming days.

Posted Feb 26, 2024 - 13:11 EST

Resolved

The issue was identified with secondary database instances, consuming CPU causing overload. Recovery completed and system back to normal

Posted Feb 24, 2024 - 05:29 EST

Investigating

We are currently investigating this issue.

Posted Feb 24, 2024 - 04:29 EST