Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS connector sends a high volume of ListObjects requests, triggering GCS 429 throttling error #151

Closed
githubwua opened this issue Feb 21, 2019 · 13 comments
Assignees

Comments

@githubwua
Copy link
Contributor

githubwua commented Feb 21, 2019

While enabling Implicit Directories can help list all file objects, it has the side effect of sending too many ListObjects requests to GCS. When this happens, GCS returns the following error:

Caused by: com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException: 429 unknown 
<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now. 

GcsFuse does mention about this caveat here:
https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/semantics.md#implicit-directories

Can we please roll back the change?

@zeldrinn
Copy link

zeldrinn commented Feb 21, 2019

I'm seeing an issue that may or may not be related, via GCP Dataproc. Specifically, I have a Spark job running on Dataproc that reads and writes an RDD from a GCS bucket containing implicit directories, and starting on Feb 15, certain interactions with the bucket appears to generate an enormous number of ListObjects operations (rising to 40K req/sec for over an hour). As far as I can tell, this spike in ListObjects operations seems to happen at the point at which the RDD starts being written to the GCS bucket (<bucket>/<implicit_rdd_dir>/part_00001, <bucket>/<implicit_rdd_dir>/part_00002, <bucket>/<implicit_rdd_dir>/part_00003, etc.).

I don't fully understand the commit referenced by this issue, but from what I gather, if this were in fact the issue causing my spike in ListObjects calls, I should expect a 1:1 relationship between the number of ReadObject calls and the number of ListObjects calls, correct? Which isn't quite what I'm seeing. I'm seeing a rapid rise to about 40K req/sec ListObjects calls, with zero ReadObject calls during the period. However, there are other bucket operations during the period:

  • WriteObject (starts at about 12 req/sec and gradually diminishes to 4 req/sec over the span of the job)
  • DeleteObject (same as WriteObject)
  • RewriteObject (similar throughput pattern, but half the rate of WriteObject; i.e. starts at 6 req/sec and drops to 2 req/sec)
  • GetBucketMetadata (same as RewriteObject)

None of my Spark job code changed, and there haven't been any notable changes in the size or nature of the data that we're passing in to the Spark job. As mentioned above, this started happening spontaneously on Feb 15.

@medb
Copy link
Contributor

medb commented Feb 21, 2019

Most probably this is regression introduced in GCS connector 1.9.12 during performance optimizations and not related to linked commit, we will have fix shortly and release new GCS connector with it.

medb added a commit that referenced this issue Feb 21, 2019
@zeldrinn
Copy link

@medb Thanks for the update. From the information I provided, do you suspect that this issue would be the root cause of the scenario I described as well? Or d'you think what I'm seeing is unrelated?

@zeldrinn
Copy link

zeldrinn commented Feb 21, 2019

@medb Also, do I need to do anything to upgrade to 1.9.14 (or is it 1.9.15?)? Or will the latest version be used automatically behind the scenes by dataproc/spark? Thanks!

@medb
Copy link
Contributor

medb commented Feb 21, 2019

I think that this is a related issue. The problem with list request is that it's paginated (1,000 objects per single request), that's why if you have a lot of objects in the same directory you could have higher list requests ratio to other requests.

I think that we will have GCS connector 1.9.15 build in hour or so, so it will be great if you will be able to validate fix using connectors init action:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/connectors

We will update Dataproc with new GCS connector version automatically but it will take around a week for new Dataproc release to roll out and you will need to recreate cluster with latest image.

@zeldrinn
Copy link

Great, thanks for the info @medb. Is there anything we can do in the short term to mitigate the issue prior to the new Dataproc release?

  1. If we create our root directory as explicit rather than implicit, would that resolve the issue?
  2. Would we also need to make sure any nested directories are also explicitly created?
  3. If this strategy would in fact resolve the issue, what's the best way to create explicit directories in a bucket? The only way I've been able to find so far is by creating them in the Storage UI (GCP Console)...

@medb
Copy link
Contributor

medb commented Feb 21, 2019

  1. I don't think that creating root directory will resolve this issue, because GCS connector will still send list request in parallel to improve performance in case it doesn't exist (root cause of the bug).
  2. It will be the best if all explicit directories are created, but if you write your data through Spark/Hive they should be already created.
  3. The best option for you will be to either upgrade/downgrade GCS connector to 1.9.15/1.9.11 using connectors init action or pin your cluster to older Dataproc image version (1.3.24-deb9) that has GCS connector 1.9.11.

@medb medb changed the title Commit 52f5055 (Fix directory inference) sends a high volume of ListObjects requests, triggering GCS 429 throttling error GCS connector sends a high volume of ListObjects requests, triggering GCS 429 throttling error Feb 21, 2019
@adamf
Copy link

adamf commented Feb 21, 2019

We are also seeing this issue; we moved back to 1.3.24-deb9 and this solves the problem for now. Impact is a 100,000x fold increase in ListObject requests.

@medb
Copy link
Contributor

medb commented Feb 21, 2019

GCS connector 1.9.15 that fixes this issue was just released:
https://github.com/GoogleCloudPlatform/bigdata-interop/releases/tag/v1.9.15

@medb medb self-assigned this Feb 21, 2019
@medb medb closed this as completed in f4b5ec9 Feb 21, 2019
@zeldrinn
Copy link

zeldrinn commented Feb 21, 2019

@medb We've reverted to dataproc 1.3.24-deb9 as well and it solves the problem. We won't be able to verify connector version 1.9.15 just yet, but we'll likely be attempting it in the next week.

Thanks for the quick turnaround on this issue!

@jaketf
Copy link

jaketf commented Jun 18, 2019

Still seeing this issue on latest dataproc image 1.4 so GCS connection v1.9.16.
Additional context: in this case we may have multiple clusters making list requests via GCS connector at the same time.
Can this be better handled with exponential back-off or similar to account for multiple systems on the same subnet querying GCS?

@medb
Copy link
Contributor

medb commented Jun 18, 2019

@jaketf from your description it seems that this is a different issue. May you file a new GitHub issue for this? And add more details about current behavior (how many clusters with how many nodes are running simultaneously? what is observed behavior (list requests QPS, etc)?) and add expected/desirable behavior.

@jaketf
Copy link

jaketf commented Jun 19, 2019

@medb sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants