API Intermittent Unavailability at Market Open (9:30AM EDT)

Incident Report for Alpaca

Postmortem

Time: Monday, November 9th, 2021 (9.30am-9:40am)

Duration: ~10 Minutes

Severity Level: High

Summary: The intermittent unavailability was triggered by an unusually heavy database load that lead to a cascading failure and eventually made our APIs unavailable for ~10 minutes right after market open. The issue was identified a few minutes after market open and our engineering team was able to localize the issue and start urgent remediation After restarting services and terminating backend database connections, All API services were fully restored to operational. During the outage, our APIs were initially responding with errors then become unavailable and clients were facing widespread timeouts.

Impact: During the outage, our APIs were initially responding with errors then become unavailable and clients were facing timeouts. This affected about 1.8M API calls.

Remediation:

We identified that the extreme high number of database connections from the service as the probable cause behind the API errors and timeouts
We identified that majority of the connections were reported as “idle in transaction” by the Postgres database, while it was reported as “in use” by the service’s internal connection pooler; Indicative that the connections where “leaking” from the service’s client side connection pool.
The “idle in transaction” processes were manually terminated in Postgres to free up connection slots
We rolled the problematic service deployment to reset the client side connection pool

Follow up tasks:

We decreased a service’s maximum available connections to better behave within our connection constraints.
We were able to identify and reproduce the connection leaking behavior and are working on a permanent solution.
We identified that HTTP request from one of the main services weren’t internally timing out and that potential database connections also didn’t receive any timeout propagation. This has been fixed and is undergoing testing.
We are expediting work to urgently isolate read and write queries to increase are overall query per second throughput and improve our write throughput

Posted Nov 11, 2021 - 11:10 EST

Resolved

All API services and endpoints had degraded availability at market open due to an exhausting database connection caused by database connection leaking from one of our primary microservices.

Posted Nov 08, 2021 - 09:30 EST