Skip to content

Deadlock during callback renders session unusable #52264

Closed as not planned
Closed as not planned
@seanmuth

Description

@seanmuth

Apache Airflow version

2.11.0

If "Other Airflow 2 version" selected, which one?

2.9, 2.10

What happened?

During handling of an on_failure_callback in the dag processor, the callback failed due to a deadlock on the TI record in question.

Because we only call session.flush() and session.commit() L768-9 outside of the for request in callback_requests: loop, the session remains in an unusable state.

This particular DAG had hundreds of TIs fail, due to a worker OOM issue, which subsequently caused every callback in the callback_requests list to fail to run, because the session was never trashed and a new one opened.

What you think should happen instead?

A new session should be created for each callback in the loop, or short of that, if we fall to the exception block we should trash the session there and create a new one.

How to reproduce

  1. induce multiple on_failure_callbacks
  2. during callback processing, another component must hold a lock on a TI while the callback is also trying to execute for that TI

Operating System

debian

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

Runtime 11.18.0 / Airflow 2.9.3+astro.11

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions