Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix backfill occassional deadlocking #26161

Merged
merged 2 commits into from
Sep 6, 2022

Conversation

ephraimbuddy
Copy link
Contributor

@ephraimbuddy ephraimbuddy commented Sep 5, 2022

During backfilling, we set all task instances of the dagrun being run to scheduled,
this causes deadlocking of the dagrun whenever dagrun.update_state is called and
there's no running or schedulable task instances.
The fix was to remove the batch update of the task instances to scheduled state
and have the executor queue the tasks that the dependencies have been met.

closes: #25353, closes: #26044

@boring-cyborg boring-cyborg bot added the area:Scheduler Scheduler or dag parsing Issues label Sep 5, 2022
@potiuk
Copy link
Member

potiuk commented Sep 5, 2022

Comment: I think it would be nice to get some description of state changes and what State.NONE really is.

I thin it would be really great to get a visual representation of all state changes for tasks in a few "state diagrams". Or maybe I am missing it and it is somewhere? Maybe we can have a nice set of state diagrams in mermaid since it is now nice integrated in GitHub ?

Happy to collaborate on that one.

@kaxil
Copy link
Member

kaxil commented Sep 5, 2022

Comment: I think it would be nice to get some description of state changes and what State.NONE really is.

I thin it would be really great to get a visual representation of all state changes for tasks in a few "state diagrams". Or maybe I am missing it and it is somewhere? Maybe we can have a nice set of state diagrams in mermaid since it is now nice integrated in GitHub ?

Happy to collaborate on that one.

We have one at https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#task-instances but that might not be in mermaid ->

image

(https://airflow.apache.org/docs/apache-airflow/stable/_images/task_lifecycle_diagram.png)

@potiuk
Copy link
Member

potiuk commented Sep 5, 2022

Ah Thanks @kaxil -> I remembered we had one - just could not find it :)

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. So if I understand now - Scheduler will simply move tasks to "Scheduled" state and further and that should prevent the deadlocks as only scheduler(s) will be doing it ?

During backfilling, we set all task instances of the dagrun being run to scheduled,
this causes deadlocking of the dagrun whenever dagrun.update_state is called and
there's no running or schedulable task instances.
The fix was to remove the batch update of the task instances to scheduled state
 and have the executor queue the tasks that the dependencies have been met.
@ephraimbuddy
Copy link
Contributor Author

LGTM. So if I understand now - Scheduler will simply move tasks to "Scheduled" state and further and that should prevent the deadlocks as only scheduler(s) will be doing it ?

I have updated the code. Previously, it would be sent to the executor to be queued but now it follows scheduled -> queued etc.
Thanks for the question :)

@@ -351,18 +351,19 @@ def _task_instances_for_dag_run(self, dag_run, session=None):
dag_run.refresh_from_db()
make_transient(dag_run)

dag_run.dag = dag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dag_run does not generally have a dag attribute (and I don’t think there’s any code expecting it), why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we call task_instance_scheduling_decision it tries to do self.get_dag() and if the dag attribute is not set, it fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test I added failed due to a lack of dag attribute but that could be because the dagrun on the test does not have the dag attribute. However, I decided that it's better to have the dag attribute set in the code instead of making the test pass by adding the attribute on the dr in the test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn these old code are so hard to follow.

@uranusjr uranusjr merged commit 6931fbf into apache:main Sep 6, 2022
@uranusjr uranusjr deleted the fix-backfill-stalling branch September 6, 2022 09:43
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Sep 13, 2022
@ephraimbuddy ephraimbuddy added this to the Airflow 2.4.0 milestone Sep 14, 2022
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Dec 7, 2022
cherry-picked change from the community
apache/airflow#26161

Internal bug

Change-Id: I44c690ed56561adef420b7935947647f417d7f2e
GitOrigin-RevId: f846495d50f8de8412cccff345f6f265fd65adf9
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Dec 8, 2022
cherry-picked change from the community
apache/airflow#26161

Internal bug

Change-Id: I62478c4c1142a00f1f984e5d14d1af7754946b82
GitOrigin-RevId: c582c826563065ef2c7c37213bc2f7a4fdcb81d8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler Scheduler or dag parsing Issues type:bug-fix Changelog: Bug Fixes
Projects
None yet
4 participants