Message send delays
Incident Report for Courier
Postmortem

Impact

Courier experienced delayed message delivery in its send pipeline impacting 0.1% of messages from 12:50pm to 21:50pm PT on 7/14. No messages were dropped as a result of the incident. 99.9% of send calls experienced no delivery delay. The average message send delay was 3 hours and 20 minutes for impacted messages.

Root Cause

Courier uses feature flags to safely roll out new features. Due to a misconfiguration of a flag, a larger than expected volume of send requests were included in a validation experiment meant to verify a refactor of the send pipeline was safe to rollout. These requests added significant additional load on key stages of the send pipeline, and caused non-validation related requests to queue.

Remediation

Courier incrementally scaled up processing capacity in the send pipeline to work through the large accumulated backlog of messages. Additionally, a hotfix release was pushed to production in order to drop validation messages that had already entered the send pipeline.

Follow up actions

  • Courier has established a process to better validate flag configuration in the future, as well as made changes to its feature flag helper library to make use less error-prone.
  • Courier has created an incident playbook to guide on-call engineers through options to quickly scale up message processing in the send pipeline.
Posted Jul 19, 2022 - 09:57 PDT

Resolved
The incident has been resolved.
Posted Jul 14, 2022 - 22:06 PDT
Monitoring
A fix has been implemented and we are monitoring system health. All backlogged messages are being processed.
Posted Jul 14, 2022 - 21:49 PDT
Update
We are continuing to work towards resolution of the issue. We currently are seeing delays of approximately 2 hours for some message delivery
Posted Jul 14, 2022 - 18:26 PDT
Identified
The issue has been identified and a resolution is being deployed to our production services.
Posted Jul 14, 2022 - 15:50 PDT
Investigating
We are currently investigating an issue that is affecting send times for some messages.
Posted Jul 14, 2022 - 14:34 PDT
This incident affected: Integrations: Outbound from Courier (Mailgun API, Mailgun Outbound Delivery, Mailjet REST API, Mailjet SEND API, Nexmo Outbound SMS, Plivo SMS API, Segment Data Ingestion (Tracking) API, Segment Cloud Sources, Slack Apps/Integrations/APIs, SparkPost SMTP API - USA, SparkPost SMTP API - EUROPE, SparkPost SMTP Delivery - USA, SparkPost SMTP Delivery - EUROPE, Twilio AUTOPILOT, Twilio SMS) and Courier (API).