Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Apigee, Cloud Filestore, Cloud Logging, Google BigQuery, Google Cloud Bigtable, Google Cloud Dataflow, Google Cloud Networking, Google Cloud Pub/Sub, Google Cloud SQL, Google Compute Engine, Operations, Persistent Disk, Virtual Private Cloud (VPC)

Multiple services impacted in australia-southeast1.

Incident began at 2024-05-08 19:00 and ended at 2024-05-08 22:28 (all times are US/Pacific).

Previously affected location(s)

Sydney (australia-southeast1)

Date Time Description
21 May 2024 15:51 PDT

Incident Report

Summary

On Wednesday, 8 May 2024, multiple Google Cloud services experienced a partial service outage in australia-southeast1-a for varying durations of up to 2 hours and 55 minutes. The full list of impacted products and services is detailed below.

To our Google Cloud customers whose businesses were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer you.

Root Cause

On 8 May, at 18:44 US/Pacific, a public utility power issue resulted in an undervoltage condition followed by power loss that affected a portion of Google’s third-party data center in Sydney. As a result of this issue, the operating current exceeded the trip settings of the automatic transfer switch (ATS) units.

ATS units have trip settings to protect the load from electrical faults. Additionally, ATS units are configured in pairs to provide a redundant power path to the critical load.

In this case, both ATS units feeding the affected rows exceeded their trip settings due to overcurrent. Further investigation into the ATS units determined that they were configured with trip settings that were not in accordance with the site design.

Remediation and Prevention

Google engineers were alerted to the outage via internal monitoring on Wednesday, 8 May at 18:55 US/Pacific and immediately started an investigation. On-site data center operations were engaged at 19:00 US/Pacific, and the scope of the power loss was confirmed at 19:22 US/Pacific.

On-site engineers restored power to the affected rows at 19:00 US/Pacific by manually closing breakers for both of the ATS units.

On Wednesday, 8 May at19:42 US/Pacific, network connectivity for the affected racks began recovering. All services had recovered by 21:55 US/Pacific with the exception of a very small percentage of Persistent Disk devices which required manual intervention.

On Thursday, 9 May at 07:47 US/Pacific, the public utility power issue was resolved.

All power had been switched back to utility feeds on Thursday, 9 May at 09:26 US/Pacific.

Google is committed preventing a repeat of this issue in the future and is completing the following actions:

  • A case has been opened with the utility provider to determine the cause of the power event that led to the undervoltage condition.
  • Audit and update all ATS device settings if needed based on current site load.
  • Complete a full audit of all operational procedures for the Sydney third-party data center location.

Detailed Description of Impact

On Wednesday 8 May, from 18:45 to 21:40 US/Pacific, multiple Google Cloud services experienced a partial service outage in the australia-southeast1-a zone.

Persistent Disk:

  • From 18:45 to 21:45 US/Pacific, approximately 0.4% of Persistent Disk devices in australia-southeast1-a experienced failures for disk operations, including snapshots, clones, and attachments.
  • Approximately 0.05% of PD devices experienced extended impact and required manual intervention.

Google Cloud Dataflow:

  • From 18:45 to 21:55 US/Pacific, customers experienced increased latency for affected streaming jobs.
  • From 19:45 to 20:45 US/Pacific, some batch jobs took longer than normal to execute.
  • From 18:49 to 19:15 US/Pacific, a small number of new job submissions failed.

Google Cloud Pub/Sub:

  • For a total of 13 minutes between 18:46 to 19:19 US/Pacific, affected customers experienced intermittent request error rates up to 1% and elevated message delivery latency.

Google BigQuery:

  • From 18:45 to 21:50 US/Pacific, 0.8% of job failures for Job API and 5 minutes of failures for Metadata API.
  • From 19:00 to 19:20: 17% of projects experienced slower than normal performance.
  • From 21:25 to 22:25: over 10% of projects experienced slower than normal performance.

Google Compute Engine:

  • From 18:45 to 19:30 US/Pacific, 11.8% of VMs in australia-southeast1-a were paused and restarted, and another 4.0% experienced pauses without restarting.

Cloud Filestore:

  • From 18:45 to 21:43 US/Pacific, affected customers were unable to access their NFS filestore in the australia-southeast1-a zone.

Virtual Private Cloud (VPC):

  • From 18:45 to 21:31 US/Pacific, affected customers experienced delays while creating new VMs and packet loss / unreachability for existing VMs. The VMs in australia-southeast1-a which went down could have faced delayed programming upon recreation. Roughly half of the traffic to australia-southeast1-a VMs was dropped.

Cloud SQL:

  • From 18:45 to 21:45 US/Pacific, affected customers were unable to access their Cloud SQL instances in australia-southeast1-a. High Availability instances successfully failed over to other zones and recovered in 1-4 minutes. The majority of affected zonal instances were inaccessible for 20-30 minutes, with a few experiencing extended recovery times of up to 3 hours.

Cloud Logging:

  • From 18:45 to 19:05 US/Pacific, affected customers may have experienced a small increase in error rates for inflight requests in this zone.
  • Cloud Logging is a regional service so the vast majority of the requests in the australia-southeast1 region were not affected.

Cloud Bigtable:

  • From 18:45 to 19:10 US/Pacific, affected customers would have experienced high error rates in australia-southeast1-a.

Cloud Apigee:

  • From 18:45 to 19:15 US/Pacific, there were multiple periods of impact ranging from 5 to 30 minutes, with error rates between 5% and 38% respectively, due to nodes restarting.
  • During the impact periods, customers may have experienced a “GKE cluster is currently undergoing repair" error.

Memorystore for Redis

  • From 18:46 to 21:31 US/Pacific, a subset of basic tier instances in australia-southeast1-a zone would have been unavailable.
  • Affected standard tier instances may have experienced brief unavailability as they failed over to replicas.

Dialogflow

  • From 18:45 to 19:15 US/Pacific, affected customers would have experienced up to a 3% error rate in australia-southeast1.

Google Kubernetes Engine

  • From 18:45 to approximately 19:45 US/Pacific, 14% of GKE clusters in australia-southeast1 were unavailable.
9 May 2024 10:58 PDT

Mini Incident Report

We apologize for any inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 8 May, 2024 18:45

Incident End: 8 May, 2024 21:40

Duration: 2 hrs 55 minutes

Affected Services and Features:

  • Persistent Disk,
  • Google Cloud Dataflow
  • Google Cloud Pub/Sub
  • Google Big Query
  • Google Compute Engine
  • Cloud Filestore
  • Virtual Private Cloud (VPC)
  • Cloud SQL
  • Cloud Logging
  • Cloud Bigtable
  • Cloud Apigee

Regions/Zones: australia-southeast1

Description:

Multiple Google Cloud products experienced service disruptions of varying impact and duration, with the longest lasting being 2 hours and 55 minutes in the australia-southeast1 region. From preliminary analysis, the root cause of this incident is currently believed to be an unplanned power event caused by a power failover due to a utility company outage. Google will complete a full Incident Report in the following days that will provide a detailed root cause.

Customer Impact:

  • Persistent Disk - impacted users experienced slow or unavailable devices.
  • Google Cloud Dataflow - impacted users experienced an increase in streaming jobs with watermarks in australia-southeast1-a zone for a duration of 30 minutes.
  • Google Cloud Pub/Sub - users experienced an increased error rate for “Publish requests” for a duration of about 35 minutes.
  • Google Big Query - impacted users experienced failures for BigQuery jobs in the australia-southeast1 region.
  • Google Compute Engine - impacted VMs went into repair mode for about 45 minutes.
  • Cloud Filestore - multiple Filestore instances in australia-southeast1-a were unavailable and had missing metrics for a duration of 2 hours 55 minutes, with the last impacted instance confirmed to have recovered at 21:43 PT.
  • Virtual Private Cloud (VPC) - the impacted users experienced packet loss, unavailability of existing VMs and delays while creating new VMs.
  • Cloud SQL - impacted users experienced errors when accessing their Cloud SQL database instances in the australia-southeast1-a zone.
  • Cloud Logging - Cloud Logging experienced a minor increase in ingestion error in australia-southeast1 for a duration of 15 minutes.
  • Cloud Bigtable - users experienced a high error rate in the impacted region for a duration of about 25 minutes.
  • Cloud Apigee - impacted users received 5XX and 2XX error for a duration of 30 minutes.

Additional details:

After service mitigation and full closure of the incident, there was continued Persistent Disk impact for a narrowed group of customers identified. This has since been resolved with no further isolated impact.

8 May 2024 22:28 PDT

The issue with Apigee, Cloud Filestore, Cloud Logging, Google BigQuery, Google Cloud Bigtable, Google Cloud Dataflow, Google Cloud Pub/Sub, Google Cloud SQL, Google Compute Engine, Persistent Disk, Virtual Private Cloud (VPC) has been resolved for all affected users as of Wednesday, 2024-05-08 21:40 US/Pacific.

We will publish an analysis of this incident once we have completed our internal investigation.

We thank you for your patience while we worked on resolving the issue.

8 May 2024 21:32 PDT

Summary: Multiple services impacted in australia-southeast1.

Description: We are experiencing an issue with Persistent Disk, Google Cloud Dataflow, Google Cloud Pub/Sub, Google BigQuery, Google Compute Engine, Cloud Filestore, Virtual Private Cloud (VPC), Cloud logging, Cloud SQL, Cloud Bigtable, Apigee beginning at Wednesday, 2024-05-08 18:45 US/Pacific.

Mitigation strategy has been identified. The services are now recovering.

We will provide an update by Wednesday, 2024-05-08 23:00 US/Pacific with current details.

Diagnosis: Multiple GCP services are experiencing issues in australia-southeast1 region.

Persistent Disk: While most devices have restored their functionality, some users might encounter slow or unavailable devices.

Google Cloud Dataflow: Users experienced issues for streaming jobs with Watermark increasing. The issue with Google Cloud Dataflow is mitigated at 2024-05-08 19:47:27 PDT.

Google Cloud Pub/Sub: The PubSub impact is mitigated.

Google BigQuery: The impacted users experienced failures with the bigquery jobs in the australia-southeast1 Region. The issue with Google Bigquery has been resolved for all the affected users as of Wednesday, 2024-05-08 21:13 US/Pacific.

Google Compute Engine: VM’s went into repair for around 45 minutes and have started recovering.

Cloud Filestore: The Filestore is partially recovered. However, a small subset of users would not able to access the NFS filestore in the australia-southeast1-a zone.

Virtual Private Cloud (VPC): The impacted users may face delays while creating new VMs and packet loss / unreachability for existing VMs.

Cloud SQL: A subset of the Cloud SQL users are experiencing errors when accessing their Cloud SQL database instances in the australia-southeast1-a zone.

Cloud logging: All requests are failing at the send request step. The issue with Cloud logging has been resolved for all the affected users as of Wednesday, 2024-05-08 21:16:07 US/Pacific.

Cloud Bigtable: Cloud Bigtable experienced a high error rate for 25 minutes in australia-southeast1-a due to a power event. The issue with Cloud Bigtable has been resolved for all the affected users as of Wednesday, 2024-05-08 20:08:30 US/Pacific.

Apigee: There was a minor outage due to the GKE error which caused all of the nodes to restart. The GKE cluster is currently undergoing repair. This resulted in a 30 minute outage for the customer. The issue with Apigee has been resolved for all the affected users as of Wednesday 2024-05-08 20:34:47 US/Pacific.

Workaround: None at this time.

8 May 2024 20:29 PDT

Summary: Multiple services impacted in australia-southeast1.

Description: We are experiencing an issue with Big Query, Google filestore, Cloud PubSub beginning at Wednesday, 2024-05-08 18:45 US/Pacific.

Mitigation strategy has been identified. The services are now recovering.

We will provide an update by Wednesday, 2024-05-08 21:30 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Multiple GCP services are experiencing issues in australia-southeast1 region.

Persistent Disk: While most devices have restored their functionality, some users might encounter slow or unavailable devices.

Google Cloud Dataflow: Users experienced issues for streaming jobs with Watermark increasing. The issue with Google Cloud Dataflow is mitigated at 2024-05-08 19:47:27 PDT.

Google Cloud Pub/Sub: The PubSub impact is mitigated.

Google BigQuery: The impacted users may experience failures with the bigquery jobs in the australia-southeast1 Region.

Google Compute Engine: VM’s went into repair for around 45 minutes and have started recovering. The issue with the Compute Engine is mitigated at 2024-05-08 19:43:43 PDT.

Cloud Filestore: The impacted customers are unable to access the NFS Filestores in the australia-southeast1-a Zone.

Workaround: None at this time.