Skip to content

handle overflow in TaskInstance next_retry_datetime fixes 47971 #48557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

perry2of5
Copy link
Contributor

@perry2of5 perry2of5 commented Mar 30, 2025

This handles overflow when calculating the next execution time for a task instance by falling back to the configured maximum delay. The solution uses the same strategy that tenacity uses:
https://github.com/jd/tenacity/blob/main/tenacity/wait.py#L167

An alternate solution would be the determine the maximum tries that wouldn't exceed the maximum delay and then not calculate the timeout for values larger than that.

Something like

max_delay = self.task.max_retry_delay if self.task.max_retry_delay is not null else MAX_RETRY_DELAY
tries_before_max_delay = math.floor(math.log2(max_delay))
if self.try_number <= tries_before_max_delay:
     # existing logic
else:
    delay = max_delay

This fixes 47971

@perry2of5 perry2of5 requested review from XD-DENG and ashb as code owners March 30, 2025 22:25
@perry2of5 perry2of5 changed the title Use max delay to handle overflow in TaskInstance next_retry_datetime 47971 Use max delay to handle overflow in TaskInstance next_retry_datetime fixes #47971 Mar 30, 2025
@perry2of5 perry2of5 changed the title Use max delay to handle overflow in TaskInstance next_retry_datetime fixes #47971 Use max delay to handle overflow in TaskInstance next_retry_datetime fixes 47971 Mar 30, 2025
@perry2of5 perry2of5 force-pushed the task-next-retry-datetime-fix-overflow-47971 branch from 2eac3d5 to 3c95374 Compare March 30, 2025 22:54
@perry2of5
Copy link
Contributor Author

If you turn off whitespace changes, only 6 lines are added in the new version and none are removed. :)

@perry2of5 perry2of5 changed the title Use max delay to handle overflow in TaskInstance next_retry_datetime fixes 47971 handle overflow in TaskInstance next_retry_datetime fixes 47971 Apr 8, 2025
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label May 24, 2025
@uranusjr
Copy link
Member

Is it possible to localise the try-except block a bit more? Currently the block is a bit too large IMO.

@perry2of5 perry2of5 force-pushed the task-next-retry-datetime-fix-overflow-47971 branch from 3c95374 to 37deb82 Compare May 26, 2025 15:19
@perry2of5
Copy link
Contributor Author

The overflow error is likely two places:

on line 1120:
min_backoff = math.ceil(delay.total_seconds() * (2 ** (self.try_number - 1)))

And on line 1145: https://docs.python.org/3/library/datetime.html#timedelta-objects
delay = timedelta(seconds=delay_backoff_in_seconds)

Keeping the try block as it currently is avoids having two try blocks, but possibly two try blocks would be more readable. I'll fiddle with it and try to improve. Thanks for the feedback.

@perry2of5
Copy link
Contributor Author

perry2of5 commented May 26, 2025

Looking a little closer, we cap MAX_RETRY_DELAY at 24 * 60 * 60 so probably it can't throw at line 1145.... I'll need to be sure that isn't configurable at run time, but as long as it is always less than 999999999 days then I can shorten up the section in the try block. I'll try to get to it tonight.

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label May 27, 2025
@perry2of5 perry2of5 force-pushed the task-next-retry-datetime-fix-overflow-47971 branch from 37deb82 to 0eff257 Compare June 1, 2025 17:42
@perry2of5
Copy link
Contributor Author

Is it possible to localise the try-except block a bit more? Currently the block is a bit too large IMO.

Done. Thanks for the feedback, it is much cleaner now.

@perry2of5 perry2of5 force-pushed the task-next-retry-datetime-fix-overflow-47971 branch from dd39dbd to 67b6c67 Compare June 1, 2025 20:25
@perry2of5 perry2of5 force-pushed the task-next-retry-datetime-fix-overflow-47971 branch from 67b6c67 to 185ed8c Compare June 21, 2025 23:03
Comment on lines +1124 to +1125
except OverflowError:
min_backoff = MAX_RETRY_DELAY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s add a log message here so people know better what’s going on if they get unexpected behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't warn when limiting the retry time for other reasons. I think it should be consistent one way or the other. I'm happy to add a message here and add another ticket to log when limiting the time until the next retry at other locations. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to another instance we do this kind of limiting? (We seem to only catch OverflowError in one other place and it’s for a different purpose)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lower in the function the code ensures the delay isn’t longer than a day, but it doesn’t log anything when it does so. I'm talking about this:

min(modded_hash, MAX_RETRY_DELAY) around line 1149

We don't inform the user we are putting a cap on the exponential time, so it seems inconsistent to log about capping the value due to overflow but not due to exceeding the MAX_RETRY_DELAY. Overflow is just a implementation detail due number representation limits.... I think capping the value due to overflow and due to exceeded the MAX_RETRY_DELAY could both be unexpected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retry exponential backoff max float overflow
2 participants