Awell - [US] Orchestration - GraphQL API outage – Incident details

[US] Orchestration - GraphQL API outage

Resolved
Major outage
Started 5 months agoLasted about 1 hour

Affected

Production - US

Operational from 8:23 AM to 8:23 AM, Major outage from 8:23 AM to 9:17 AM, Operational from 9:17 AM to 9:17 AM

[US] Orchestration - GraphQL API

Operational from 8:23 AM to 8:23 AM, Major outage from 8:23 AM to 9:17 AM, Operational from 9:17 AM to 9:17 AM

Updates
  • Update
    Update

    We’re happy to let you know that we’ve wrapped up all the steps to fix the recent outage. Our team has identified and resumed 26 stuck care flows.

    Here’s what we’ve done:

    1. Fixing Bottlenecks: We found the root causes of the bottlenecks and made the necessary changes to sort them out.

    2. Better Alerts: We’ve set up new alerting policies that will notify us proactively if something similar happens again.

    3. Improved Recovery: We’ve added more recovery strategies to make sure care flows don’t get stuck after incidents like this.

    Thank you for your patience and understanding as we worked through this. We’re committed to providing you with reliable service and will continue to improve our systems.

  • Update
    Update

    An unexpected increase in the usage of the product put two key systems under stress (Message broker and Application database). We have identified what caused the bottleneck in each system (memory leak in the message broker, throttled CPU in the application database) and made the necessary changes to remove them.

    In addition, we've created new alerting policies that will proactively inform us should a similar scenario play out. This will enable us to take mitigation actions early enough to prevent system failures.

    We are still investigating the impact of this incident and will post another update when it has been identified.

  • Update
    Update

    The team is investigating the root cause to ensure the issue does not reoccur. Further updates will be provided once the investigation is complete.

  • Resolved
    Resolved

    [US] Orchestration - GraphQL API is now operational! This update was created by an automated monitoring service.

  • Investigating
    Investigating

    [US] Orchestration - GraphQL API cannot be accessed at the moment. This incident was created by an automated monitoring service.