Skip to content

Added Apache Arrow provider #52330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

dabla
Copy link
Contributor

@dabla dabla commented Jun 27, 2025

Added Apache Arrow provider in Airflow and implemented basic AdbcHook

@zeroshade


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@potiuk potiuk marked this pull request as draft June 27, 2025 11:12
@potiuk
Copy link
Member

potiuk commented Jun 27, 2025

Just for anyone looking here - this is a draft for discussion between me, @dabla and @zeroshade - we will still need to start a DISCUSSION thread for the new provider - and we think Arrow and ADBC is a good addition. But we have to first discuss the approach :)

.. IF YOU WANT TO MODIFY TEMPLATE FOR THIS FILE, YOU SHOULD MODIFY THE TEMPLATE
``PROVIDER_README_TEMPLATE.rst.jinja2`` IN the ``dev/breeze/src/airflow_breeze/templates`` DIRECTORY

Package ``apache-airflow-providers-apache-arrow``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this provider is an apache-arrow-adbc provider? I think we should be explicit about that since an "Arrow" provider could mean many different things (Flight, FlightSQL, ADBC, adapters between pandas/polars/duckdb/etc.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it could become more in the future. For example in Airflow we also have the Microsoft Azure provider, there we have multiple integrations available, namely fabric, powerbi, msgraph, etc. That could also be the case in the future for the Apache Arrow provider, now it will only offer ADBC integration but that could change in the future. In Airflow providers, you could have multiple hooks/operators/triggers/sensors present in one provider for a specific domain, in this case it would be Apache Arrow.

maintainers = [
{name="Apache Software Foundation", email="[email protected]"},
]
keywords = [ "airflow-provider", "apache.arrow", "adbc", "airflow", "integration" ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be apache.arrow or apache-arrow or both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally providers like Apache Kafka or here Apache Arrow will have a dot in between, maybe both are possible, dunno if that's allowed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for keywords better -

Comment on lines +59 to +61
if driver:
# Wheels bundle the shared library
root = importlib_resources.files(driver)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only true for the current open source python packages, it wouldn't be true for other arbitrary drivers. this is probably not the best route to go with, the adbc_driver_manager package embeds the C++ adbc driver manager package. That C++ library will perform the necessary searching based on the driver name, so since we aren't explicitly bundling the drivers with this hook provider, there isn't any need to include this method at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but how would the adbc_driver_manager then be able to load the driver? Or will this be possible once the manifest is implemented in the adbc_driver_manager like you mentioned in our discussion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently you would specify the driver to the adbc_driver_manager by providing either a shared lib on the LD_LIBRARY_PATH or a full path to the desired dynamic library. Once the manifest stuff is implemented then you'd be able to also provide the name of the manifest to load


def get_conn(self) -> Connection:
return connect(
driver=self._driver_path(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should likely just be driver = self.connection_extra_lower.get("driver")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then driver would just be a python string, not the actual driver?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even currently, the driver is a string that points to the driver to load (self._driver_path() is the path to the driver to load).

return connect(
driver=self._driver_path(),
#entrypoint: Optional[str] = None,
db_kwargs={"uri": self.uri},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we likely need to add more potential options here that can be passed along

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I just did the bare minimum so we could kick off the discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we can also add arbitrary values. Connection in Airfow can have "extras" freely defined and you can remove all the "known" values from the dict and pass the remaining ones directly as driver args.

host="file::memory:?cache=shared",
extra=json.dumps(
{
"driver": "adbc_driver_sqlite",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this requires libadbc_driver_sqlite.so existing on the LD_LIBRARY_PATH, that might require adding something to the test requirements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeroshade Thanks Matt for the thorough review, really appreciate it, will already fix typos and stuff.

@vikramkoka
Copy link
Contributor

As @potiuk mentioned, I believe this needs a devlist conversation first.
Looking forward to it

@potiuk
Copy link
Member

potiuk commented Jun 28, 2025

As @potiuk mentioned, I believe this needs a devlist conversation first. Looking forward to it

Yep. This one is mostly to gather learnigs, get feedback from @zeroshade and see how we can turn it into a "convincing" devlist proposal - by showing some use cases and small POC of implementation and what it allows :).

We'll experiment a bit with it and gather our thoughts and see what can come out of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants