Description
Apache Airflow version
2.11.0
If "Other Airflow 2 version" selected, which one?
2.9, 2.10
What happened?
During handling of an on_failure_callback
in the dag processor, the callback failed due to a deadlock on the TI record in question.
Because we only call session.flush()
and session.commit()
L768-9 outside of the for request in callback_requests:
loop, the session remains in an unusable state.
This particular DAG had hundreds of TIs fail, due to a worker OOM issue, which subsequently caused every callback in the callback_requests
list to fail to run, because the session was never trashed and a new one opened.
What you think should happen instead?
A new session should be created for each callback in the loop, or short of that, if we fall to the exception block we should trash the session there and create a new one.
How to reproduce
- induce multiple on_failure_callbacks
- during callback processing, another component must hold a lock on a TI while the callback is also trying to execute for that TI
Operating System
debian
Versions of Apache Airflow Providers
No response
Deployment
Astronomer
Deployment details
Runtime 11.18.0 / Airflow 2.9.3+astro.11
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct