Courier Multi Service Outage

Incident Report for Courier

Postmortem

RFO: February 2, 2026 — Service Interruption

Executive Summary

On February 2, 2026, a deployment introduced a configuration change that referenced a file not present in our production build artifacts, causing backend services to become unavailable. API endpoints, automations, and tracking functionality were impacted for 2 hours and 17 minutes in the US region and 2 hours and 42 minutes in the Ireland region. Recovery was prolonged by a concurrent outage at GitHub Actions, our CI/CD provider, which prevented our standard automated rollback. Our team identified the root cause, executed a manual rollback independent of the affected provider, and restored full service across all regions. We have defined targeted action items to prevent recurrence.

Incident Overview

  • Affected services: API endpoints (including message sending), automations, webhooks, tracking links, and authentication
  • Impact: Requests to backend services returned errors for the duration of the incident. Users were unable to send messages, trigger automations, or access tracking data. No data was lost — requests were rejected before ingestion, so no messages were partially processed or left in an inconsistent state.
  • Detection: Our monitoring systems flagged elevated error rates within minutes of the issue beginning.
  • Contributing factor: GitHub Actions, our CI/CD provider, experienced a complete outage from 10:35 to 16:30 PST. This overlapped with our incident window and prevented our standard automated rollback from executing, extending the time to resolution.

Timeline of Events

All times in PST.

Time Event
10:59 Deployment of latest release initiated through standard CI/CD pipeline
11:44 Deployment completed and went live; services immediately began experiencing errors due to a missing configuration dependency
11:49 GitHub Actions, our CI/CD provider, experienced a complete outage, preventing standard rollback procedures
11:52 Monitoring alerts triggered; engineering team engaged
12:00 Incident declared; rollback initiated; engineering team assembled
12:07 Status page updated — issue identified and rollback in progress. Rollback ends up being blocked by GitHub actions outage.
12:39 Team pivoted to an alternative manual rollback approach. This required testing and validating the new approach.
13:50 Team executed the alternative manual rollback independent of GitHub Actions after they were fully satisfied that the new approach was safe. 
14:01 Manual rollback completed in US region; services confirmed operational
14:26 Ireland region deployment completed; services confirmed operational
15:13 All services verified stable across all regions; incident resolved

Root Cause Analysis

The disruption was traced to a configuration change included in the latest release. The change introduced a startup dependency on a utility file that was intended to be bundled with the deployment package. However, the file was not included in the production build artifacts. When backend services attempted to initialize, they were unable to locate the required file and could not start, resulting in all incoming requests being rejected.

This discrepancy was not caught prior to production deployment because the file was present and functioning correctly in the development environment. The difference in how build artifacts are assembled between development and production environments meant the issue only manifested in production.

Mitigation and Resolution

  1. Upon identifying the root cause, the team immediately initiated a rollback to the prior known-good release through our standard CI/CD pipeline.
  2. A complete outage at GitHub Actions, our CI/CD provider, prevented the automated rollback from completing. The team identified this external dependency and pivoted to an alternative approach.
  3. The team executed a manual rollback by retrieving prior deployment artifacts from our backup storage and deploying them directly, bypassing the affected CI/CD pipeline entirely.
  4. Services were restored region by region, with the US region confirmed operational at 14:01 PST and the Ireland region at 14:26 PST.
  5. Extended monitoring was conducted across all services before declaring the incident fully resolved at 15:13 PST after we were satisfied everything had been stable for over 45 minutes. 

Action Items

# Action Item Owner Priority Status
1 Require successful staging deployment and smoke tests before any production deployment Engineering P1 In Progress
2 Improve the reliability of automated smoke tests Engineering P1 In Progress
3 Add build-time validation to confirm all referenced startup dependencies are present in deployment packages Engineering P2 Open
4 Adopt an expedited rollback process as the standard emergency procedure, independent of CI/CD provider availability, reducing recovery time by approximately 35 minutes Engineering P2 In Progress
Posted Feb 12, 2026 - 10:12 PST

Resolved

All services operational. A public-facing RFO will be available in the coming days. Thank you for your understanding and continued partnership.
Posted Feb 02, 2026 - 15:13 PST

Update

EU Courier instance deployment has landed and services operational.
Posted Feb 02, 2026 - 14:26 PST

Update

EU instance redeployment is around 10 minutes out to land production.
Posted Feb 02, 2026 - 14:14 PST

Monitoring

The manual rollback has successfully landed, and all services are operational. The team is diligently monitoring and testing other services.
Posted Feb 02, 2026 - 14:01 PST

Update

The team has rolled back the deployment manually bypassing GH Actions outage. We're waiting for the deployment to land and confirm that all core functionality has been restored.
Posted Feb 02, 2026 - 13:50 PST

Update

The team is exploring alternate methods to redeploy the release until GH Actions recovers.
Posted Feb 02, 2026 - 13:23 PST

Update

The Courier team has a redeploy ready and waiting on GH Actions to recover from their degradation to publish the release.
Posted Feb 02, 2026 - 13:03 PST

Update

The current outage still impacts the current services:
- webhooks
- login
- API endpoints
- FE access.
Our team is actively working on the redeployment.
Posted Feb 02, 2026 - 12:44 PST

Update

We are continuing to work on a fix for this issue. The following services are still impacted:
- Login
- Webapp
Posted Feb 02, 2026 - 12:29 PST

Update

Due to GH Actions experiencing an outage(https://www.githubstatus.com/), the redeploy took longer than expected and are now cutting a release.
Posted Feb 02, 2026 - 12:17 PST

Identified

The Courier team has encountered an issue on our platform impacting several services. The team has rolled out a redeploy to roll back changes.
Posted Feb 02, 2026 - 12:07 PST
This incident affected: Web Application, API, Automations, and Courier Inbox.