Skip to content

Add support for S3 dag bundle #46621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Conversation

ismailsimsek
Copy link

implements S3 dag bundle, with this end user could set an S3 bucket and prefix(sub folder) as a dag bundle source.

related wiki AIP-66: DAG Bundles & Parsing
related project: AIP-66: DAG Bundles & Parsing
closes #9555
closes #8657

Copy link

boring-cyborg bot commented Feb 10, 2025

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@ismailsimsek
Copy link
Author

moved it to provider package. its ready for review. any chance to trigger the test workflows?

@ismailsimsek ismailsimsek force-pushed the s3bundle branch 4 times, most recently from f79f69f to d7073b2 Compare April 4, 2025 18:06
@o-nikolas
Copy link
Contributor

moved it to provider package. its ready for review. any chance to trigger the test workflows?

Triggered workflows again, will review soon 👍

Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me (though I'm no expert on Bundles). There are some failing tests and static checks though that need looking at.

@ismailsimsek ismailsimsek force-pushed the s3bundle branch 4 times, most recently from 974a4a3 to ed17613 Compare April 10, 2025 09:49
@ismailsimsek
Copy link
Author

code related tests are passing now. remaining test is related to airflow version.
the BaseDagBundle code available only in airflow 3 but the test is running with airflow 2.9.x . any idea how to fix it?

@ismailsimsek
Copy link
Author

@o-nikolas @jedcunningham its ready for review.

@ismailsimsek ismailsimsek requested a review from o-nikolas April 28, 2025 15:35
@ismailsimsek ismailsimsek force-pushed the s3bundle branch 2 times, most recently from 5780ad3 to a1c5c1c Compare May 3, 2025 18:59
@o-nikolas
Copy link
Contributor

Looks reasonable to me, but I'm no Bundles specialist. I talked to Jed before he left on leave and he mentioned to loop in @dstandish on Bundles stuff. Does this look reasonable to you as well Daniel?

@potiuk
Copy link
Member

potiuk commented May 21, 2025

is it possible to run following command with podman? its asking for docker

 breeze static-checks --all-files --show-diff-on-failure --color always --initialize-environment

Did you try https://podman-desktop.io/docs/migrating-from-docker/managing-docker-compatibility

Copy link
Member

@jedcunningham jedcunningham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on reviewing this. Happy to see it!

)
local_s3_objects = []
s3_bucket = self.s3_hook.get_bucket(self.bucket_name)
for obj in s3_bucket.objects.filter(Prefix=self.prefix):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive my s3 ignorance, but is there no higher level construct we can use to bring the local side up to date? Having to iterate through and do our own sync feels a bit odd.

If we do need to do this, it feels like this might be better in the hook - the concept of syncing a directory isn't necessarily s3 bundle specific, but could be more generally useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately nothing in existing libs. There are some cli packages. Moved to S3 hook

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, S3 is very odd in that the API and CLI diverge in that CLI supports a sync function and the API does not 😢
You could potentially shell out to the CLI command, but I don't love that either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I'll let you make that call then @o-nikolas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing this. I say we stick with this code, so that we don't have a direct dependency to the CLI

Copy link
Member

@jedcunningham jedcunningham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good, some naming and test refactors and, from my side at least, I'm happy with this.

@uranusjr
Copy link
Member

Don’t use caplog in general. I think it’s not too difficult to rewrite the tests.

@alonahmias
Copy link

Would really like to see the pr merged 😍

@eladkal
Copy link
Contributor

eladkal commented Jun 26, 2025

The branch is 600 commits behind. I rebased it. lets see what CI says.

@o-nikolas
Copy link
Contributor

I would love to see this merged as well. @ismailsimsek the use of caplog needs to be addressed (i.e. removed). Do you need any help with this?

@ismailsimsek
Copy link
Author

I would love to see this merged as well. @ismailsimsek the use of caplog needs to be addressed (i.e. removed). Do you need any help with this?

That would be a great help, thank you! I don't have much time to work on this recently, so I'd really appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

AIP-5 Remote DagFetcher Add a configuration option to enable Airflow to look for DAGs in a specified S3 bucket.
8 participants