Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry 'Quota Exceeded' errors during function deployment #3246

Merged
merged 33 commits into from
Apr 6, 2021

Conversation

joehan
Copy link
Contributor

@joehan joehan commented Mar 31, 2021

Description

Adds logic to retry 'Quota Exceeded' errors during functions deployment. Fixes #2606 and #1372.

I also took this opportunity to deduplicate + test some error handling logic that was repeated across the different tasks by moving it into handler functions for the throttler.

Scenarios Tested

I tested this by deploying 240 functions across 3 regions, and was able to do so successfully (with quite a few quota error retries).
I also tested out functions:delete by successfully deleting a large group of functions.

@google-cla google-cla bot added the cla: yes Manual indication that this has passed CLA. label Mar 31, 2021
@joehan joehan marked this pull request as ready for review March 31, 2021 21:28
@joehan joehan requested review from inlined and mbleigh March 31, 2021 21:34
@joehan joehan requested a review from taeold March 31, 2021 21:34
}

export function updateFunctionTask(
params: TaskParams,
fn: CloudFunctionTrigger,
sourceToken?: string,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this from TaskParams after realizing I had accidentally introduced a bug where a sourceToken from a different region would be used. I could have fixed this by creating a different copy of TaskParams for every task - however, I think the code is easier to understand overall if we just separate this out.

try {
await Promise.all(queuePromises);
} catch (err) {
utils.reject(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error always going to be encountered because of large function batches? Can we detect whether quota errors specifically were the cause of the rejection?

Also wonder about literally forking the error message based on the number of functions deployed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only ever throw if we exceed the max retries on a task, and we only retry quota errors.

I don't see any quotas that we're likely to exceed that aren't related to large function deployments, but it might still be worth forking the error message at 50? functions or so.

trigger: helper.getFunctionTrigger(fn),
labels: Object.assign({}, deploymentTool.labels(), fn.labels),
sourceUploadUrl: params.sourceUrl,
sourceToken: sourceToken,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So where does this come from now if not the params above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same place as before (the onPollFn in runRegionalDeployment), I just moved it out of the TaskParams object and passed it into this function separately.

I found a bug while developing where, because every task was sharing the same TaskParam object, we'd overwrite the sourceToken, and cause some deployments to fail because we sent a sourceToken from a different region.

I could have fixed it by making a new TaskParams for every region, but I figured it was cleaner and less bug prone to separate it - having a mutating object passed around and shared by a bunch of simultaneously running async functions seemed poor.

@mbleigh
Copy link
Contributor

mbleigh commented Apr 1, 2021

Oh also can you add a CHANGELOG for this?

src/deploy/functions/tasks.ts Show resolved Hide resolved
src/test/deploy/functions/tasks.spec.ts Show resolved Hide resolved
| "make public";

export interface DeploymentTask {
(): Promise<any>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The firebase-functions SDK had a good reason to have a function that was also an object. This design pattern comes with rough edges though (e.g. you can't reassign the value). I'm curious whether it would be better to have the function be a named member of the object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is probably a little too cute, and would be easier to work with if the function was a named member. I'll make the switch.

}
timer.endTimer(task.functionName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be in a finally block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters because the only time we don't hit this line is if we throw a quota error, and the Queue will retry on errors, so endTimer will eventually be called.

backoff: 20000,
concurrency: 40,
maxBackoff: 40000,
handler: tasks.functionsDeploymentHandler(timer, errorHandler),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking aloud but why doesn't schedulerQueue time its operations and why is errorHandler a parameter of handler instead of part of Queue?

It seems like an alternative design would be for Queue to always time its tasks and accept errorHandler. It would, in one place, handle quota errors and possibly only then do a longer backoff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SchedulerQueue doesn't time its operations because we don't do anything with the results. We could record those times in GA like we do for function deployments, but scheduler deployments are an order of magnitude quicker than function deployments so it's not very interesting.

I like the idea of moving some of this functionality into Queue/Throttler though. I've been hesitant to change that code more than necessary because its used in a bunch of other places that i don't understand well enough to properly test - however, I've definitely noticed a bunch of things I'd like to improve in there while working with it.

I don't really want to hold up this PR on moving that functionality over to queue. At some point though (maybe our next fixit week?) I'd like to give throttler.ts some love and:

  • add timers for individual tasks
  • fix up some confusing cases where promises returned by the throttler never resolve
  • add explicit support for quotas (instead of tuning the throttler params, you'd provide your quota number and time period, and the throttler would set the concurrency appropriately, automatically retry quota errors, and slow down all tasks if it got a quota error on one)

@joehan joehan requested review from mbleigh and inlined April 2, 2021 18:30

describe("Function Deployment tasks", () => {
describe("functionsDeploymentHandler", () => {
const sandbox = sinon.createSandbox();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird that this is initialized in global scope instead of beforeEach but reset in afterEach

@joehan joehan merged commit 810c54c into master Apr 6, 2021
@joehan joehan deleted the jh-tune-throttler branch April 6, 2021 20:35
devpeerapong pushed a commit to devpeerapong/firebase-tools that referenced this pull request Dec 14, 2021
* implement retries for quota errors

* added retries on quota errors to functions deployment

* refactoring functions deployment to use queue handler

* Merge master

* clean up

* add support for maxBackoff to throttler

* formats

* clean up

* chasing a error handling bug

* cleaning up debugging code

* separate out timeToWait and add unit tests for it

* style fixes to tests

* cleaning up console.logs

* added error handling on exceeded retries

* Improved error message

* fixing a race condition that could cause the wrong sourceToken to be used, and adding unit tests for handlers

* fixing up whitespace

* formats

* adding changelog and refactoring to avoid using hybrid type for tasks

* add newline

* format

* formatting

* pr fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Manual indication that this has passed CLA.
Projects
None yet
3 participants