Follow-up: Resolution of Monday Market Open Incident
As a follow-up to our communication regarding Monday’s service instability, we are providing a summary of our findings and the corrective actions taken.
Root Cause Analysis Over the weekend, a planned change was implemented which included the rollout of Istio into our production network. Following this deployment, we observed intermittent connectivity issues that resulted in the instability seen on Monday.
Our investigation confirmed that the Istio layer was not stable in establishing connections between services and components over the network. This issue was exacerbated during the market open, where the high traffic footprint led to significant latency and slowness. While this configuration had been present in our staging environments for some time, the issue only manifested in production due to the unique load requirements of the live market open.
System Impact We specifically investigated why a connection and memory issue within specific pods impacted critical trading functions. The analysis showed that the connection instability caused an excessive backlog of concurrent queries. This led to a significant memory spike that exceeded typical thresholds, creating a cascading effect on service responsiveness during peak traffic.
Remediation To address this, we have removed the Istio plane across all critical services. All impacted services were restarted following this change to ensure a clean state.
Current Status Since the removal of the Istio layer, system performance has returned to its baseline and connections remain stable. We are continuing to monitor the environment closely and are utilizing enhanced load testing to ensure our infrastructure remains resilient during peak traffic.
Shortly after market open on January 12, 2026 (approx. 9:36 AM EST), our monitoring systems detected a significant degradation in performance across our core APIs. This resulted in elevated error rates and latency for incoming API requests.
Our engineering team identified that the issue was caused by resource contention within our system. A combination of high market-open traffic and underlying system abnormalities triggered a significant memory spike in a single pod. This exhaustion of resources caused internal services to become unresponsive, resulting in connection timeouts and creating a bottleneck between our API gateway and the database services.
We understand that reliability is paramount for your operations. Below is a summary of the impact observed during the incident window (9:36 AM – 11:20 AM EST):
500 and 504 error responses on Order, Account, and Position endpoints.Our team executed a series of mitigation strategies to restore stability. Immediate action was taken to isolate and restart the affected service instances to clear the connection backlog. We deployed a hotfix to eliminate redundant metrics processing, which reduces unnecessary overhead and helps lower overall memory consumption.
The system was fully stabilized by 11:20 AM EST. We have confirmed that error rates have returned to nominal levels and all backlog queues have been processed. We will continue to investigate any abnormalities.
We are committed to learning from every incident to strengthen our platform. We are prioritizing the following actions: