Google Cloud Pub/Sub Reliability User Guide: Part 1 Publishing

Published in

Google Cloud - Community

13 min readOct 19, 2020

This is first in a series of posts that will help users of Google Cloud Pub/Sub write reliable applications that use the service. The following parts will cover subscribing and administrative operations.

Yes, Pub/Sub is very reliable and highly available. No, it is not perfectly reliable. And that is where you, as an application developer, come in. My hope is that this set of articles will give you the background to design for extreme reliability. The articles are written by the product manager of Cloud Pub/Sub with a lot of help from Kamal Aboul-Hosn and others.

Types of publish errors

The first concern of an application using Pub/Sub is publishing: getting messages to durable storage offered by Pub/Sub. Any unavailability of the publish API can create risk of data loss. So, as an application designer, you must strike a balance between complexity of the code and the types of unavailability it can handle without losing data. There are three main classes of unavailability to anticipate:

Transient failures: individual requests or seconds of degraded availability
Temporary unavailability: seconds to minutes
Extended unavailability: tens of minutes to hours

We look at some design considerations for each of these as well as other cases. In addition we briefly discuss the set of metrics and recommended alerts required to support any practical reliability design. We recommend reviewing the Cloud Pub/Sub Monitoring Guide for a more thorough review of metrics before delving into design.

Transient failures

Pub/Sub, as many multitenant, distributed services, is subject to a background rate of transient request failures. These are rare: typically, one or fewer per thousands requests. The error rate will vary with your application behavior, network connectivity to GCP, and a host of other factors. You should benchmark it for your particular application. There are two main kinds of transient request failures: errors explicitly returned by the service and request timeouts. These are usually not strongly correlated in time, at least at the scale of seconds. Therefore, the common strategy to dealing with both is simply retrying requests. Note that some errors, such as those caused by malformed requests or invalid resource identifiers, are not retryable; these are beyond the scope of this reliability discussion. While the Pub/Sub client libraries generally time-out and retry requests for you, it is important to understand the retry settings and to handle, log and alert (e.g. using log based metrics) on permanent failures. The publisher guide is a good starting point to understand these.

Request timeouts can be caused by the client, based on the configuration you chose as a developer, the Pub/Sub service and the network connection your application has to it. As any multi-layered, multi-tenant, distributed system, Pub/Sub has a “long tail” of request latencies: while most requests complete in tens of milliseconds, it is possible for some to get stuck for seconds or even minutes, before succeeding or failing. This may be surprising if you are used to interacting with a system with pre-provisioned, dedicated capacity and a single, local network hop between the publisher and the machine that processes the requests. The long tail latencies and a background rate of errors is the price of horizontal scalability. The latency and errors can be caused by a temporarily overloaded machine that happens to handle your request, a network problem, or a host of other issues. By assuming your application can cope with this, Pub/Sub can load-balance traffic to resolve such problems, scale up or scale down.

As the developer of a publisher application, you have to choose a timeout duration. The price of shorter timeouts are duplicate messages: if a request were going to succeed, given time, timing out and retrying too early will effectively publish the same message twice, with different message ids. This creates more trouble for subscribers and comes with additional processing and network cost per message. The benefit of aggressive timeouts, on the other hand, is that they can cut the tail latency. We recommend timeouts of 10 seconds as a starting point.

Once you have a request failure, you have to design a retry policy. Start by understanding typical errors and request timeouts for your application. You can do this using client-side metrics or estimating this using the GCP API metrics (serviceruntime.googleapis.com/api/request_latencies,with 99th percentile and 50th percentile aggregators, in particular). Note that this measures how long it takes the service to complete the request, and does not capture the latency of the network connection between your application and GCP. You will have to measure those with client-side instrumentation.

Once you have a sense of the distribution of error rates and latencies, you have to configure the retry policy. The Pub/Sub client libraries generally adopt exponential backoff starting from an initial timeout and continuing to try for a total timeout (See “Retrying Requests” in the Publisher Guide; documentation for Java RetrySettings, Python config). The most important settings here are:

Initial RPC Timeout: this determines how soon to abandon the initial RPC.
Total Timeout: how long to keep retrying.

We recommend starting with a total timeout of 10 minutes and an initial RPC timeout of 10 seconds.

Longer retry periods make your applications more robust to short term failures, but they also may extend the time it takes you to find out about the error condition through logs and alerts. They also expose your application to risk of running out of memory: every new message that is to be retried has to be stored in memory. So your machine has to have enough to store the amount of data greater than the timeout duration multiplied by the rate of incoming data. A shorter timeout, say 1 minute, may be appropriate if the publisher application is downstream from a load balancer and an application that can retry the request. For example, if the requests originate on a mobile client with a retry policy, failing quickly allows the mobile client to attempt sending the request to another front end rather than waiting for the response. The publisher flow control mechanism implemented in the Pub/Sub Client Libraries can help implement such failures safely.

Retrying does not always solve the problem. For example, you may encounter a non-retryable error: the application’s credentials may become invalid or you may no longer have IAM permission to publish on the topic. In addition, only some server error response codes are treated as retryable by the client library, while others are not retried. The client libraries generally provide safe defaults that can be configured. The Java Client Library default settings are a good guide, but you should be aware of the settings used by the client library you choose. It is also possible that the transient unavailability lasts more than your total timeout, in which case, timeouts and retryable errors effectively become permanent errors. For example, you may have experienced a network partition or degradation lasting longer than the total timeout period.

You have to handle this case explicitly, through publish request future callbacks when using Client Libraries or another, custom mechanism otherwise. You could choose to save the message on a local disk in this case. The decision is a matter of trading off complexity and reliability. The more important thing to do is to surface the failure as a client-side metric (e.g. with custom metrics or log-based metrics) in Cloud Monitoring.

If retries do not solve your problem, you must give your application more time and more retries by adjusting the corresponding client settings; or you may need more machine or network capacity to in case the issue was caused by an increase in traffic. In fact, insufficient CPU and network capacity is one the most common source of publish timeouts seen by the Pub/Sub team. It is also possible that you are facing a non-random failure: a temporary or permanent unavailability — the subject of the next section.

Temporary unavailability

Retries will not be effective if you cannot get through to Pub/Sub for long enough. This might happen due to a network issue between your application and the Pub/Sub service or a regional outage of Pub/Sub. The issue may not need to be total unavailability: partial unavailability with high request error rate may cause a high rate of retries leading you to exceed available network, RAM, or CPU capacity. This is particularly likely if your application has limited connectivity to GCP or operates at rates on the order 100MB/s or above per machine in GCP. (This blog post offers some tips on understanding and optimizing bandwidth available to your machines.)

Pub/Sub is designed to make outages regional for data operations (publishing and receiving messages). Specifically, Pub/Sub handles outages of a single zone without significant visible impact. For an outage to become regional, two availability zones must fail at the same time, as described in this architecture overview. Multi-regional or “global” outages affect only the regions where you have running publishers. If you have an application publishing from regions A, B and C and regions A and B go down, the application in region C is unaffected. At the same time, traffic from applications in A and B is not load-balanced out to C or another region. This is because Pub/Sub deliberately keeps traffic within the region to prevent cascading failures. An exception to this is traffic from outside of the GCP network connecting to Pub/Sub’s global service endpoint. Such requests will be load balanced to the nearest available region. Therefore, prolonged outages of data operations, such as publishing, can by design be assumed to be regional. Possible global failures may be failures of global dependencies, such as networking (depending on how you connect to GCP), IAM or load balancing, rather than Pub/Sub itself.

A simple strategy to deal with temporary unavailability is to use a long retry time and ensure that you have enough CPU and memory capacity on each publisher machine to buffer the data in memory while the service is unavailable. For example, a front end that handles user request logs by publishing them to Pub/Sub may handle 1000 requests per second, each resulting in a 1KB message. Weathering a 10 minute outage requires under 1 GB of RAM for messages. This is easily available so you may opt for a 10 or even 20 minute retry duration. In contrast, if your publisher is a part of a high throughput analytics pipeline it may easily handle 100MB/s of traffic, which would require 60GB of RAM to weather a 10 minute outage. So while a 10 minute outage can be seen as temporary to be handled through retries, for higher volume systems, 10 minutes is no longer “temporary”: it is extended. Therefore, it is important for your application design to determine the conditions and time scale of extended outages, based on the expected traffic patterns. In either case, it is important to set up logs on publish errors and alerts on high publish error rates and RAM utilization of your machines. This will be useful in mitigation and analysis of failures.

Ways to help your application to actually survive a regional outage are discussed below.

Extended unavailability

The real difference between extended and temporary unavailability is whether you believe that simply retrying will eventually resolve the issue. While you generally cannot predict whether a temporary outage will be extended or not, the following ideas may help you decide how long temporary is for your application:

The duration of time that the incoming message data can be accumulated in the machine RAM
Machine lifetime for machines with limited lifetimes (e.g. a preemptible VM)
Time between common network problems if your connection to GCP is prone to bouts of congestion of predictable duration
Explicit, tight recovery point objective for your requirements for data.

Ultimately, you will have to find the balance between simple, low cost design and extreme availability.

There are a few basic approaches to mitigating extended unavailability: alternative storage, failover, and redundant publishing. A simple approach to alternative storage is to first write all message data to logs on disk and then tail the logs. If a machine restarts, you can set it up to continue reading the logs automatically or recover the data from disks manually. As an optimization, you can write to disk only when Pub/Sub is unavailable and ensure that the data from disk is republished to Pub/Sub when service is restored. This approach can help you survive much longer term unavailability than RAM would and even survive machine restarts. The downsides include additional publish latency under normal conditions, higher CPU and storage cost, and the need to maintain correctly functioning log rotation daemons and as well as some mechanism to track which parts of which log files have yet to be published on each machine.

For added redundancy, you might consider using a backup storage system instead of Pub/Sub to deal with this. For example, your log rotation mechanism may upload all log files to GCS as soon as they are closed. This will allow for data loss between log rotations, but is a relatively simple and cost effective design to prevent catastrophic data loss. Note that GCS alone may be a viable alternative to log storage, however, it does not offer a mechanism for tracking acknowledgements and has inherently higher latency.

Another approach is regional failover, relying on the fact that Pub/Sub is designed to make regions be independent failure domains for data operations. This independence of failure domains makes it possible to achieve a higher degree of availability by publishing to Pub/Sub from different regions or, if publishing from outside GCP, by using regional endpoints. The approach can be applied at different levels of your stack, from the front ends that generate traffic to the publisher code itself. The simplest and likely most reliable setup is full stack redundancy. Here, you would run two sets of publishing front-ends in two distinct regions, say A and B, fronted by a load balancer. Once front ends in region A determine that they are in an extended outage, they would start rejecting any upstream requests leading those to be balanced out to front ends in region B. This will leave some number of messages stuck on publishers in region A, but at least will make newer data available, limiting the scope of the outage. The cost of this will be higher overall, application latency will go up, and some region egress costs go up as requests are routed to another region B.

You can also create a failover policy within the publisher client itself, using the Pub/Sub regional service endpoints. Each publisher maintains a list of Pub/Sub endpoints in different GCP regions from where it is running. Once a publisher determines it is facing an extended outage, it would re-initialize the Pub/Sub client with a different regional endpoint and resolve all outstanding publish futures, when they fail, by publishing the messages to the new client instance. This, too, may cause additional latency and cross-region egress costs. An implementation would keep an internal estimate of error rates and latencies and at some point determine that the primary endpoint is no longer viable. At this point, it would rebuild the client with a fail-over regional endpoint and start publishing there. It would also regularly try publishing to the primary endpoint and switch back when it is determined to be viable. The failovers should be performed with exponential backoff and scale up. Alternatively, failover can be performed manually with minimal tooling that restarts clients with a new endpoint. This kind of failover may be useful if load-balancing at a higher level is not feasible. It can also offer an additional layer of protection when combined with full stack failover since messages that are waiting to be published in a region suffering a failure have a greater chance of making it out.

For cases where extreme reliability is required, the regional isolation of Pub/Sub may be used to publish each message twice or more in different regions. You can have every message be published to two regional endpoints synchronously. Alternatively, you may extend redundancy higher up your application stack and, say, forward each incoming user request to front ends in two distinct regions. This model is more costly than load balancing, but once it works, single region failures would not cause any change to your applications. The cost is not just in duplication of data and code. The real complexity penalty is in that subscribers must deduplicate the messages from two sources. Nevertheless, there are use cases where such measures are worthwhile. In fact this is done by some of the bigger users of Pub/Sub within Google.

Extreme redundancy for unusual errors

There are errors that do not neatly fit into the time-based classes described here and require an alternative approach. First, there are rare internal Pub/Sub errors that arise from misconfiguration or human error. These tend to affect subsets of projects or topics. Therefore, when designing failover of redundant systems it is worth putting primary and failover resources into distinct GCP projects even when they are in the same region.

Second, there are human errors having to do with access control. For example, a production account used by the publisher app may be accidentally deleted by a human operator. To mitigate the risk involved in this, use narrowly scoped permissions, separate service accounts per workload or per project, and if possible maintain and deploy your IAM configuration as code, including reviews and integration testing.

Alerts and Metrics

Whatever choices you make, there is always a finite possibility of a failure you have not anticipated. It is important, therefore, that you have metrics with alerts that will let you know when your assumptions about the system behavior are violated. The alerts to start with are on publish request error rates, exceptionally high tail latency of the publish requests, and machine memory usage. In addition, you should look for unexpected drops in publish traffic reported by the service, which may indicate that messages are not making it to the service.

In addition, you might choose to instrument your application to report custom metrics such as

Amount of outstanding message data on client machines
Number of permanent (non-retryable) failures

Error counts may be set up as log-based metrics if some latency is acceptable.

Client side logs may also be very useful. For example, you may choose to have your application log warnings whenever the backlog of outstanding requests exceeds a certain threshold or compute total request latencies, including retries, and log those when they are high. Such log entries can be turned into log-based metrics with alerts.

Flow control when recovering from failures

When designing for systems that recover from temporary and extended failures, it is important to pace the rate of recovery. Once you build a large backlog due to, say, a network outage, a naive application might attempt to publish the entire buffer as quickly as possible once service is restored. This will commonly lead to errors as both the outbound bandwidth available to a machine and the overall publish quota are finite.

It is critical that you understand the bandwidth available to your machine and keep the rate at which you attempt to send message data to Pub/Sub under that limit. Otherwise, you will create network congestion leading most requests to time out with DEADLINE_EXCEEDED errors even if the service is available and the network is healthy. A simple approach to doing this is to pause your publisher code whenever the number of outstanding requests or publish futures is above a threshold. This is supported by most cloud client libraries. For example, in the Java client library flow control is configured using the FlowControlSettings of a BatchingSettings object.

Next Steps

This is it for publishing. Next step is subscribing, covered in Part 2 of this series.