System Outage
Incident Report for Flow XO
Postmortem

At 07:02am UTC a critical component of Flow XO’s infrastructure, our Redis cache server, stopped communicating due to a failed load balancer at our service provider. Although our Redis deployment uses a redundant configuration, including a primary and secondary load balancer, the application did not failover to the secondary in a timely manner, leading to an outage of our front end application servers which control all messaging orchestration and powers the web application.

At about 07:31, the load balancer was manually failed over, our application servers and workers quickly recovered, and service was restored.

During this half hour outage, a backlog of scheduled tasks developed, causing some delays in scheduled tasks and broadcasts for most users of up to an hour, even though realtime messaging was fully operational throughout this period.

We sincerely apologize for the disruption this may have caused your business or your users. We are still investigating the root cause of the failure of our redundancy configuration for Redis, but once we can confirm the exact cause we will correct the defect.

Furthermore, we plan on implementing over the medium term an additional safeguard to prevent a future outage in the case of a catastrophic Redis failure:

  • introducing a secondary vendor for Redis caching services, rather than simply providing redundancy in a single cloud.
Posted Aug 13, 2020 - 23:17 UTC

Resolved
Major outage - all messaging processing offline.
Posted Aug 12, 2020 - 07:30 UTC