Ongoing Service Disruption ...

Degraded

Ongoing Service Disruption in NYC/EWR Due to Storage Backend Issue

Started 1 May at 10:19am EDT, resolved 2 May at 07:37am EDT.

Affected services nyc-vds-1 nyc-vds-2 nyc-vds-5 nyc-vds-6 nyc-vds-7 nyc-vds-8 nyc-vds-9 nyc-vds-10 nyc-vds-11 nyc-vds-12 nyc-vds-13 nyc-vds-14 nyc-vds-15 nyc-vds-16 nyc-vds-17 nyc-vds-18 nyc-vds-19 nyc-pkvm-1 nyc-pkvm-2 nyc-pkvm-3 nyc-pkvm-4 nyc-pkvm-5 nyc-pkvm-6 nyc-pkvm-8 nyc-pkvm-9 nyc-pkvm-10 nyc-pkvm-11 nyc-pkvm-12 nyc-mkvm-1 nyc-mkvm-2 nyc-mkvm-3 nyc-mkvm-4 nyc-skvm-1 nyc-skvm-2 nyc-skvm-3 nyc-skvm-4 nyc-skvm-5 nyc-skvm-6 nyc-skvm-7 nyc-skvm-8

Resolved

We have an "all clear" from our engineering team now.
If you continue experiencing any issue with your server, don't hesitate to contact our support team.

Posted 2 May at 07:37am EDT.

Updated

We’ve made significant progress toward recovery. The previously unreachable storage node is back online. Our team has begun reintegrating drives into the CEPH cluster and is closely monitoring the process.

This is a critical step in restoring full functionality, and we’re proceeding carefully to ensure cluster stability and data integrity. Instance operations are working again, and we expect to see all services gradually recover as the cluster health improves.

Thank you again for your patience — we’ll continue to keep you updated as we move forward.

Posted 2 May at 04:28am EDT.

Updated

This remains our highest priority internally. We are continuing recovery work on the backend storage system and are taking every precaution to ensure data integrity throughout the process.

Following extensive discussions with our engineering team and storage vendor, we now have a more concrete estimate: we expect full restoration of functionality within the next 6 to 12 hours, barring any unforeseen complications. This estimate reflects the time required to complete filesystem recovery and verify the health of the affected Ceph cluster components.

We understand how disruptive this incident has been and we deeply appreciate your patience as we work toward a safe and stable resolution. We will provide a final update once recovery is complete or if there is any significant change in the timeline.

Thank you again for bearing with us.

Posted 1 May at 06:17pm EDT.

Updated

Recovery work is still in progress. While no major change has occurred yet, our team is engaged and working through several layers of the system to safely bring services back online. Thank you again for bearing with us — we will keep these updates coming regularly.

Posted 1 May at 04:01pm EDT.

Updated

We’re still working through the ongoing storage issue impacting instance management in this region. Our team is in continuous contact with the datacenter team and openstack engineers, ensuring all recovery efforts remain active. We’ll continue to share updates as we make progress.

Posted 1 May at 02:30pm EDT.

Updated

Our engineering team continues to work on restoring full functionality in the affected region. We remain focused on resolving the underlying storage issue and are closely monitoring system behavior as we proceed through recovery steps. We understand how disruptive this is and sincerely appreciate your ongoing patience.

Posted 1 May at 12:44pm EDT.

Created

We are currently experiencing a disruption in our NYC/EWR cloud region due to a failure in the storage backend following our ongoing datacenter migration (NYC to EWR). A portion of our CEPH storage cluster was physically relocated ahead of schedule, while another portion remains at the old location. One critical node is also currently offline, resulting in a degraded cluster state.

As a result, the CephFS (used to access base images and volumes) is in a hung state and blocking all instance operations. This means that starting, stopping, or rebooting instances is currently not possible, although running instances remain online.
Additionally, instances with attached volumes are also unable to boot.

Our engineering team is actively working to recover the filesystem and restore the cluster’s health. Due to the complexity of this failure and the nature of Ceph’s consistency mechanisms, this process is delicate and time-consuming. At this time, we cannot provide a precise ETA for full restoration.

We will continue to post updates as we progress. We sincerely apologize for the inconvenience and understand how disruptive this is. Thank you for your continued patience.

Posted 1 May at 10:19am EDT.