-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Added Apache Arrow provider #52330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Added Apache Arrow provider #52330
Conversation
Just for anyone looking here - this is a draft for discussion between me, @dabla and @zeroshade - we will still need to start a DISCUSSION thread for the new provider - and we think Arrow and ADBC is a good addition. But we have to first discuss the approach :) |
.. IF YOU WANT TO MODIFY TEMPLATE FOR THIS FILE, YOU SHOULD MODIFY THE TEMPLATE | ||
``PROVIDER_README_TEMPLATE.rst.jinja2`` IN the ``dev/breeze/src/airflow_breeze/templates`` DIRECTORY | ||
|
||
Package ``apache-airflow-providers-apache-arrow`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this provider is an apache-arrow-adbc
provider? I think we should be explicit about that since an "Arrow" provider could mean many different things (Flight, FlightSQL, ADBC, adapters between pandas/polars/duckdb/etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it could become more in the future. For example in Airflow we also have the Microsoft Azure provider, there we have multiple integrations available, namely fabric, powerbi, msgraph, etc. That could also be the case in the future for the Apache Arrow provider, now it will only offer ADBC integration but that could change in the future. In Airflow providers, you could have multiple hooks/operators/triggers/sensors present in one provider for a specific domain, in this case it would be Apache Arrow.
maintainers = [ | ||
{name="Apache Software Foundation", email="[email protected]"}, | ||
] | ||
keywords = [ "airflow-provider", "apache.arrow", "adbc", "airflow", "integration" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be apache.arrow
or apache-arrow
or both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally providers like Apache Kafka or here Apache Arrow will have a dot in between, maybe both are possible, dunno if that's allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for keywords better -
if driver: | ||
# Wheels bundle the shared library | ||
root = importlib_resources.files(driver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only true for the current open source python packages, it wouldn't be true for other arbitrary drivers. this is probably not the best route to go with, the adbc_driver_manager
package embeds the C++ adbc driver manager package. That C++ library will perform the necessary searching based on the driver name, so since we aren't explicitly bundling the drivers with this hook provider, there isn't any need to include this method at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but how would the adbc_driver_manager then be able to load the driver? Or will this be possible once the manifest is implemented in the adbc_driver_manager like you mentioned in our discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently you would specify the driver to the adbc_driver_manager by providing either a shared lib on the LD_LIBRARY_PATH
or a full path to the desired dynamic library. Once the manifest stuff is implemented then you'd be able to also provide the name of the manifest to load
|
||
def get_conn(self) -> Connection: | ||
return connect( | ||
driver=self._driver_path(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should likely just be driver = self.connection_extra_lower.get("driver")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then driver would just be a python string, not the actual driver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even currently, the driver
is a string that points to the driver to load (self._driver_path()
is the path to the driver to load).
return connect( | ||
driver=self._driver_path(), | ||
#entrypoint: Optional[str] = None, | ||
db_kwargs={"uri": self.uri}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we likely need to add more potential options here that can be passed along
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I just did the bare minimum so we could kick off the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we can also add arbitrary values. Connection in Airfow can have "extras" freely defined and you can remove all the "known" values from the dict and pass the remaining ones directly as driver args.
host="file::memory:?cache=shared", | ||
extra=json.dumps( | ||
{ | ||
"driver": "adbc_driver_sqlite", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this requires libadbc_driver_sqlite.so
existing on the LD_LIBRARY_PATH
, that might require adding something to the test requirements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeroshade Thanks Matt for the thorough review, really appreciate it, will already fix typos and stuff.
As @potiuk mentioned, I believe this needs a devlist conversation first. |
…ion regarding the configuration parameters
…ge python files changed
Yep. This one is mostly to gather learnigs, get feedback from @zeroshade and see how we can turn it into a "convincing" devlist proposal - by showing some use cases and small POC of implementation and what it allows :). We'll experiment a bit with it and gather our thoughts and see what can come out of it. |
Added Apache Arrow provider in Airflow and implemented basic AdbcHook
@zeroshade
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in airflow-core/newsfragments.