Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bad delete logic for dagruns #32684

Merged
merged 5 commits into from
Jul 19, 2023

Conversation

dstandish
Copy link
Contributor

airflow/www/utils.py Outdated Show resolved Hide resolved
airflow/www/utils.py Outdated Show resolved Hide resolved
dstandish and others added 3 commits July 19, 2023 09:53
@dstandish dstandish merged commit 7092cfd into apache:main Jul 19, 2023
42 checks passed
@dstandish dstandish deleted the fix-bad-delete-logic-for-dagruns branch July 19, 2023 20:27
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Aug 2, 2023
@ephraimbuddy ephraimbuddy added this to the Airflow 2.7.0 milestone Aug 2, 2023
@hussein-awala
Copy link
Member

I spent all day debugging to figure out how our ITs disappear from all dags at the same time, and why most of the dags were stuck (we're using depend_on_past), and after many analyzes of the log table and the RDS query history in datadog, I found that the query that should remove TIs from 2 dag runs removed over 2k TIs.

Luckily I found this PR which confirms my hypothesis. Thank you for this fix @dstandish!

However, I wonder if we can improve our release notes by adding a level for the bug and its impact on the whole stack. For example, this bug is very important and it could have a significant impact on the whole platform, despite this, it is added in the middle of the bug list with a generic commit message:

Fix bad delete logic for dagruns (#32684)

@RNHTTR
Copy link
Collaborator

RNHTTR commented Aug 31, 2023

Yeah this took me many hours to find :( I wonder if we should add a newsfragment?

hussein-awala added a commit to leboncoin/airflow that referenced this pull request Sep 1, 2023
@potiuk
Copy link
Member

potiuk commented Sep 4, 2023

And we have another similar issue from the user who suffered from it I think https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1693805145064939

@potiuk
Copy link
Member

potiuk commented Sep 4, 2023

@hussein-awala @RNHTTR -> what is the way to recover from that when your task instances have been deleted? Would back-filling work You think? Or do you need manually delete the DagRun ?

@lihan
Copy link
Contributor

lihan commented Sep 4, 2023

Related (#34082)

@hussein-awala
Copy link
Member

what is the way to recover from that when your task instances have been deleted? Would back-filling work You think? Or do you need manually delete the DagRun ?

For the old DagRuns I did nothing because they are not important, but for the new ones blocked because of the depends_on_past, I had to delete all these dag runs and the empty ones, and wait the scheduler to re-create them (catchup is True in my case):

DELETE from dag_run dr where dr.dag_id in (
    SELECT distinct (dr.dag_id) as dag_id
    FROM dag_run dr left join task_instance ti on dr.dag_id = ti.dag_id and dr.run_id = ti.run_id and ti.state is not NULL
    WHERE ti.task_id is NULL and dr.execution_date >= '2023-08-30 08:00:00'
) and dr.execution_date >= '2023-08-30 08:00:00';

DELETE from task_instance ti where ti.dag_id in (
    SELECT distinct (dr.dag_id) as dag_id
    FROM dag_run dr left join task_instance ti on dr.dag_id = ti.dag_id and dr.run_id = ti.run_id and ti.state is not NULL
    WHERE ti.task_id is NULL and dr.execution_date >= '2023-08-30 08:00:00'
) and ti.run_id in (
    SELECT distinct (dr.run_id) as run_id
    FROM dag_run dr left join task_instance ti on dr.dag_id = ti.dag_id and dr.run_id = ti.run_id and ti.state is not NULL
    WHERE ti.task_id is NULL and dr.execution_date >= '2023-08-30 08:00:00'
);

Then to fix the issue completely without upgrading to 2.7.0, I cherry-picked the fix (https://github.com/leboncoin/airflow/tree/lbc/2.6.3-r1), built Airflow and pushed it to our private PyPi then I used the patched version instead of the official one.

@potiuk
Copy link
Member

potiuk commented Sep 4, 2023

For the old DagRuns I did nothing because they are not important, but for the new ones blocked because of the depends_on_past, I had to delete all these dag runs and the empty ones, and wait the scheduler to re-create them (catchup is True in my case):

So likely deleting the DagRun and THEN back-filling should also work?

@hussein-awala
Copy link
Member

Yes, the TIs are already deleted from the DB, he can just delete the empty dag runs and recreate them using the backfill command, or the UI (trigger DAG w/ config, then just update the logical date)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:webserver Webserver related Issues type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants