Awell - [Sandbox] Orchestration - GraphQL API outage – Incident details

[Sandbox] Orchestration - GraphQL API outage

Resolved
Major outage
Started 5 months agoLasted about 11 hours

Affected

Sandbox

Operational from 4:26 AM to 4:26 AM, Major outage from 4:26 AM to 10:57 AM, Operational from 10:57 AM to 3:22 PM

[Sandbox] Orchestration - GraphQL API

Operational from 4:26 AM to 4:26 AM, Major outage from 4:26 AM to 10:57 AM, Operational from 10:57 AM to 3:22 PM

[Sandbox] Design - GraphQL API

Operational from 4:26 AM to 10:38 AM, Major outage from 10:38 AM to 10:57 AM, Operational from 10:57 AM to 3:22 PM

[Sandbox] Care - Web App

Operational from 4:26 AM to 10:38 AM, Major outage from 10:38 AM to 10:57 AM, Operational from 10:57 AM to 3:22 PM

[Sandbox] Studio - Web App

Operational from 4:26 AM to 10:38 AM, Major outage from 10:38 AM to 10:57 AM, Operational from 10:57 AM to 3:22 PM

[Sandbox] Awell Platform - Web App

Operational from 4:26 AM to 10:38 AM, Major outage from 10:38 AM to 10:57 AM, Operational from 10:57 AM to 3:22 PM

Updates
  • Resolved
    Resolved

    Early this morning, our Sandbox environment experienced an outage due to a new data replication feature intended to improve the availability of the database cluster. During a routine maintenance operation, a database server failed to shut down gracefully because of the data replication configuration.

    While the server eventually restarted, it was not able to synchronize with the other servers, leading to a continuous loop of failed synchronization attempts which eventually made the cluster unresponsive.

    The Production environments have no risk of being impacted by this issue as they use a different configuration for data replication. The configuration of the Sandbox environment has been aligned with the Production environments to eliminate the risk of this occurring again.

    The issue has now been fully resolved. We apologize for any inconvenience caused and appreciate your understanding.

  • Monitoring
    Monitoring

    The database backup has been successfully restored. All services are back online. We will keep investigating the root cause of this issue and will post updates as we find more information.

  • Identified
    Identified

    Our database cluster started experiencing issues around 4:15 AM UTC. Despite our best attempt we were not able to restore it to a healthy state. In order to restore service we decided to restore data from the latest backup. This operation is ongoing and should complete within the next hour. More information will be posted once service is restored.

  • Investigating
    Investigating

    [Sandbox] Orchestration - GraphQL API cannot be accessed at the moment. This incident was created by an automated monitoring service.