-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry 'Quota Exceeded' errors during function deployment #3246
Conversation
…used, and adding unit tests for handlers
…ase-tools into jh-tune-throttler
} | ||
|
||
export function updateFunctionTask( | ||
params: TaskParams, | ||
fn: CloudFunctionTrigger, | ||
sourceToken?: string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this from TaskParams after realizing I had accidentally introduced a bug where a sourceToken from a different region would be used. I could have fixed this by creating a different copy of TaskParams for every task - however, I think the code is easier to understand overall if we just separate this out.
try { | ||
await Promise.all(queuePromises); | ||
} catch (err) { | ||
utils.reject( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this error always going to be encountered because of large function batches? Can we detect whether quota errors specifically were the cause of the rejection?
Also wonder about literally forking the error message based on the number of functions deployed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only ever throw if we exceed the max retries on a task, and we only retry quota errors.
I don't see any quotas that we're likely to exceed that aren't related to large function deployments, but it might still be worth forking the error message at 50? functions or so.
trigger: helper.getFunctionTrigger(fn), | ||
labels: Object.assign({}, deploymentTool.labels(), fn.labels), | ||
sourceUploadUrl: params.sourceUrl, | ||
sourceToken: sourceToken, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So where does this come from now if not the params above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same place as before (the onPollFn in runRegionalDeployment), I just moved it out of the TaskParams object and passed it into this function separately.
I found a bug while developing where, because every task was sharing the same TaskParam object, we'd overwrite the sourceToken, and cause some deployments to fail because we sent a sourceToken from a different region.
I could have fixed it by making a new TaskParams for every region, but I figured it was cleaner and less bug prone to separate it - having a mutating object passed around and shared by a bunch of simultaneously running async functions seemed poor.
Oh also can you add a |
src/deploy/functions/tasks.ts
Outdated
| "make public"; | ||
|
||
export interface DeploymentTask { | ||
(): Promise<any>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The firebase-functions SDK had a good reason to have a function that was also an object. This design pattern comes with rough edges though (e.g. you can't reassign the value). I'm curious whether it would be better to have the function be a named member of the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is probably a little too cute, and would be easier to work with if the function was a named member. I'll make the switch.
} | ||
timer.endTimer(task.functionName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be in a finally block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters because the only time we don't hit this line is if we throw a quota error, and the Queue will retry on errors, so endTimer will eventually be called.
backoff: 20000, | ||
concurrency: 40, | ||
maxBackoff: 40000, | ||
handler: tasks.functionsDeploymentHandler(timer, errorHandler), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking aloud but why doesn't schedulerQueue time its operations and why is errorHandler a parameter of handler instead of part of Queue?
It seems like an alternative design would be for Queue to always time its tasks and accept errorHandler. It would, in one place, handle quota errors and possibly only then do a longer backoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SchedulerQueue doesn't time its operations because we don't do anything with the results. We could record those times in GA like we do for function deployments, but scheduler deployments are an order of magnitude quicker than function deployments so it's not very interesting.
I like the idea of moving some of this functionality into Queue/Throttler though. I've been hesitant to change that code more than necessary because its used in a bunch of other places that i don't understand well enough to properly test - however, I've definitely noticed a bunch of things I'd like to improve in there while working with it.
I don't really want to hold up this PR on moving that functionality over to queue. At some point though (maybe our next fixit week?) I'd like to give throttler.ts some love and:
- add timers for individual tasks
- fix up some confusing cases where promises returned by the throttler never resolve
- add explicit support for quotas (instead of tuning the throttler params, you'd provide your quota number and time period, and the throttler would set the concurrency appropriately, automatically retry quota errors, and slow down all tasks if it got a quota error on one)
|
||
describe("Function Deployment tasks", () => { | ||
describe("functionsDeploymentHandler", () => { | ||
const sandbox = sinon.createSandbox(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird that this is initialized in global scope instead of beforeEach but reset in afterEach
* implement retries for quota errors * added retries on quota errors to functions deployment * refactoring functions deployment to use queue handler * Merge master * clean up * add support for maxBackoff to throttler * formats * clean up * chasing a error handling bug * cleaning up debugging code * separate out timeToWait and add unit tests for it * style fixes to tests * cleaning up console.logs * added error handling on exceeded retries * Improved error message * fixing a race condition that could cause the wrong sourceToken to be used, and adding unit tests for handlers * fixing up whitespace * formats * adding changelog and refactoring to avoid using hybrid type for tasks * add newline * format * formatting * pr fix
Description
Adds logic to retry 'Quota Exceeded' errors during functions deployment. Fixes #2606 and #1372.
I also took this opportunity to deduplicate + test some error handling logic that was repeated across the different tasks by moving it into handler functions for the throttler.
Scenarios Tested
I tested this by deploying 240 functions across 3 regions, and was able to do so successfully (with quite a few quota error retries).
I also tested out
functions:delete
by successfully deleting a large group of functions.