Time: Monday, November 9th, 2021 (9.30am-9:40am)
Duration: ~10 Minutes
Severity Level: High
Summary: The intermittent unavailability was triggered by an unusually heavy database load that lead to a cascading failure and eventually made our APIs unavailable for ~10 minutes right after market open. The issue was identified a few minutes after market open and our engineering team was able to localize the issue and start urgent remediation After restarting services and terminating backend database connections, All API services were fully restored to operational. During the outage, our APIs were initially responding with errors then become unavailable and clients were facing widespread timeouts.
Impact: During the outage, our APIs were initially responding with errors then become unavailable and clients were facing timeouts. This affected about 1.8M API calls.
- We identified that the extreme high number of database connections from the service as the probable cause behind the API errors and timeouts
- We identified that majority of the connections were reported as “idle in transaction” by the Postgres database, while it was reported as “in use” by the service’s internal connection pooler; Indicative that the connections where “leaking” from the service’s client side connection pool.
- The “idle in transaction” processes were manually terminated in Postgres to free up connection slots
- We rolled the problematic service deployment to reset the client side connection pool
Follow up tasks:
- We decreased a service’s maximum available connections to better behave within our connection constraints.
- We were able to identify and reproduce the connection leaking behavior and are working on a permanent solution.
- We identified that HTTP request from one of the main services weren’t internally timing out and that potential database connections also didn’t receive any timeout propagation. This has been fixed and is undergoing testing.
- We are expediting work to urgently isolate read and write queries to increase are overall query per second throughput and improve our write throughput