Elevated Error Rates Across multiple AP's - 500 Error

Incident Report for Alpaca

Postmortem

January 13th update

Follow-up: Resolution of Monday Market Open Incident

As a follow-up to our communication regarding Monday’s service instability, we are providing a summary of our findings and the corrective actions taken.

Root Cause Analysis Over the weekend, a planned change was implemented which included the rollout of Istio into our production network. Following this deployment, we observed intermittent connectivity issues that resulted in the instability seen on Monday.

Our investigation confirmed that the Istio layer was not stable in establishing connections between services and components over the network. This issue was exacerbated during the market open, where the high traffic footprint led to significant latency and slowness. While this configuration had been present in our staging environments for some time, the issue only manifested in production due to the unique load requirements of the live market open.

System Impact We specifically investigated why a connection and memory issue within specific pods impacted critical trading functions. The analysis showed that the connection instability caused an excessive backlog of concurrent queries. This led to a significant memory spike that exceeded typical thresholds, creating a cascading effect on service responsiveness during peak traffic.

Remediation To address this, we have removed the Istio plane across all critical services. All impacted services were restarted following this change to ensure a clean state.

Current Status Since the removal of the Istio layer, system performance has returned to its baseline and connections remain stable. We are continuing to monitor the environment closely and are utilizing enhanced load testing to ensure our infrastructure remains resilient during peak traffic.


What Happened

Shortly after market open on January 12, 2026 (approx. 9:36 AM EST), our monitoring systems detected a significant degradation in performance across our core APIs. This resulted in elevated error rates and latency for incoming API requests.

Our engineering team identified that the issue was caused by resource contention within our system. A combination of high market-open traffic and underlying system abnormalities triggered a significant memory spike in a single pod. This exhaustion of resources caused internal services to become unresponsive, resulting in connection timeouts and creating a bottleneck between our API gateway and the database services.

Impact

We understand that reliability is paramount for your operations. Below is a summary of the impact observed during the incident window (9:36 AM – 11:20 AM EST):

  • API Availability: Partners experienced intermittent 500 and 504 error responses on Order, Account, and Position endpoints.
  • Order Processing: A subset of orders experienced processing delays. In some cases, orders that timed out on the API response were successfully processed in the background.
  • Data Latency: There were short delays in position updates and trade confirmation events (SSE) for executed orders.
  • Critical Data Integrity: No data was lost during this incident. All funds and positions remain safe and secure. All transactions that appeared to time out but were executed have been reconciled.

Resolution

Our team executed a series of mitigation strategies to restore stability. Immediate action was taken to isolate and restart the affected service instances to clear the connection backlog. We deployed a hotfix to eliminate redundant metrics processing, which reduces unnecessary overhead and helps lower overall memory consumption.

The system was fully stabilized by 11:20 AM EST. We have confirmed that error rates have returned to nominal levels and all backlog queues have been processed. We will continue to investigate any abnormalities.

Preventative Measures

We are committed to learning from every incident to strengthen our platform. We are prioritizing the following actions:

  • Hands-on Monitoring: We will have our team monitor the market opens all this week to ensure quick intervention.
  • System Capacity Review: We are auditing our resource allocation thresholds to ensure our services can handle "perfect storm" scenarios where high volume coincides with complex queries.
  • Deployment Process Optimization: We are revising our release procedures to ensure that non-critical background processes (such as metrics collection) cannot impact core transaction performance during peak market hours.
  • Enhanced Monitoring: We are implementing stricter alerts regarding database connection locking to detect and auto-remediate similar contention issues faster in the future.
Posted Jan 12, 2026 - 17:12 EST

Resolved

Our team has mitigated the incident. All systems are operational and customer impact has ceased. We are continuing our root cause analysis to prevent recurrence.
Posted Jan 12, 2026 - 13:50 EST

Update

We have not observed any abnormalities since the system stabilised. We will close this incident after 45 minutes. Root cause analysis will continue, and findings will be shared separately. Thank you for your patience.
Posted Jan 12, 2026 - 13:05 EST

Update

The system has stabilized and no issues are currently being observed. Our team continues to work on identifying the root cause and will provide an update once we have more information.
Posted Jan 12, 2026 - 12:48 EST

Update

we are still monitoring the system.
Posted Jan 12, 2026 - 12:35 EST

Update

All API endpoints are operating normally. We are continuing to investigate the root cause and will share findings once available. Thank you for your patience.
Posted Jan 12, 2026 - 12:20 EST

Update

Our investigation continues to make progress. We have analysed system behavior during the affected periods and have narrowed our focus to connection management, which we believe may be contributing to the intermittent issues. We are encouraged that most impacted requests are ultimately completing successfully. The team remains fully engaged and is working toward a resolution. We appreciate your patience and will keep you informed as we learn more.
Posted Jan 12, 2026 - 12:08 EST

Update

Our investigation is progressing. We have identified several contributing factors and are actively analyzing traffic patterns, system behavior, and recent changes to determine the root cause. The team is working diligently toward a full resolution and will continue to provide updates.
Posted Jan 12, 2026 - 11:55 EST

Update

We continue to see intermittent timeouts on an order-handling service, occasionally impacting orders and account/position lookups—though most requests are completing successfully. Our engineers are actively isolating affected pods, capturing diagnostic data, and restarting them to restore stability. We are also investigating database behavior and traffic patterns to identify the underlying root cause.
Posted Jan 12, 2026 - 11:45 EST

Update

The system has largely stabilised and the majority of requests are completing successfully. We are still observing intermittent timeouts (~3-5%) affecting position updates. Our engineering team is actively investigating the root cause related to high-volume data requests and working toward a full resolution. We will continue to provide updates as we make progress.
Posted Jan 12, 2026 - 11:22 EST

Update

We are continuing to address the issue. Clients may still experience intermittent 5xx errors.
Posted Jan 12, 2026 - 11:02 EST

Update

We are continuing to address the issue. Clients may still experience intermittent 5xx errors
Posted Jan 12, 2026 - 10:50 EST

Update

We have identified another increase in 5xx errors. We are actively addressing and will provide more updates. clients may still experience intermittent timeouts on order placement while we fix the issue.
Posted Jan 12, 2026 - 10:33 EST

Update

We are no longer observing 500 errors across our API endpoints. All services are now accessible, and we are continuing to monitor for any further issues.
Posted Jan 12, 2026 - 10:20 EST

Monitoring

The team identified and addressed the underlying issue. We are seeing our API's recovering. The team is continuing to monitor performance. We will continue to provide updates
Posted Jan 12, 2026 - 10:14 EST

Identified

Our team is actively addressing the issue, clients may still experience intermittent timeouts on order placement while we fix the issue.
Posted Jan 12, 2026 - 10:08 EST

Investigating

We are currently investigating elevated error rates on multiple API. Some requests may fail with HTTP 500 responses. Our engineering team is actively working to identify and resolve the root cause.
Posted Jan 12, 2026 - 09:58 EST
This incident affected: Broker API (broker.accounts.get, broker.journals.get, Journal Events (SSE)), Funding (JNLC), and Live Trading API (Orders API).