-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[AIRFLOW-7104] Add Secret backend for GCP Secrets Manager #7795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@kaxil PTAL Thanks! |
**kwargs | ||
): | ||
self.connections_prefix = connections_prefix.rstrip("/") | ||
self.credentials, self.project_id = get_credentials_and_project_id( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_credentials_and_project_id method is an expensive operation. This causes an HTTP request/requests to be sent. Can you move it out of the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it inside the client now:
@cached_property
def client(self) -> SecretManagerServiceClient:
"""
Create an authenticated KMS client
"""
scopes = _get_scopes(self.gcp_scopes)
self.credentials, self.project_id = get_credentials_and_project_id(
key_path=self.gcp_key_path,
scopes=scopes
)
_client = SecretManagerServiceClient(
credentials=self.credentials,
client_info=ClientInfo(client_library_version='airflow_v' + version.version)
)
return _client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick question: I understand that get_credentials_and_project_id
can be expensive, and it seems to be a good practice within the project. But I don't understand the impact of putting it in the constructor vs. in other methods.
Doesn't we always call the client method after constructing an instance? Otherwise, what would be the reason for constructing the instance without calling the method?
Another confusion is that if we put it in the constructor, we only need to call the method once during instantiation, and then we can keep calling other methods without invoking the HTTP request. Wouldn't it be better than making this HTTP request every time when we call the method that contains it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason for avoiding db / API calls in the constructor (in operators) is to avoid requests during DAG parsing. However, in this case, I don't think any operator will call GcpSecretsManagerSecretsBackend()
in its constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic because the backend is initialized/constructor during the initialization of the airflow.secret
module.
If you write something similar to the code below
from airflow.secret import CONFIG_SECTION
An HTTP request will be sent. You don't have to use this class for it to happen. It's enough to load it in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked more closely. Even the following code is enough to initiate the backend.
from airflow.models.base_hook import BaseHook
So it is done very often.
Codecov Report
@@ Coverage Diff @@
## master #7795 +/- ##
==========================================
- Coverage 86.93% 86.20% -0.73%
==========================================
Files 924 925 +1
Lines 44674 44836 +162
==========================================
- Hits 38836 38651 -185
- Misses 5838 6185 +347
Continue to review full report at Codecov.
|
[secrets] | ||
backend = airflow.providers.hashicorp.secrets.vault.VaultSecrets | ||
backend_kwargs = {"path":"connections","url":"http://127.0.0.1:8200","mount_point":"airflow"} | ||
backend_kwargs = {"connections_path":"conns","url":"http://127.0.0.1:8200","mount_point":"airflow"} | ||
|
||
For example, if your keys are under ``connections`` path in ``airflow`` mount_point, this | ||
would be accessible if you provide ``{"path": "connections"}`` and request | ||
For example, if your keys are under ``conns`` path in ``airflow`` mount_point, this | ||
would be accessible if you provide ``{"connections_path": "conns"}`` and request | ||
conn_id ``smtp_default``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@turbaszek @kaxil I changed connections
to conns
in the docstring to reduce the line length < 110.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use connections
and just break the lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer connections
too, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be I should have asked this before. For some reason, I was under the impression that we can not break lines in the .cfg file. So can we break lines?
Something like this.
backend_kwargs =
{"connections_path":"connections","url":"http://127.0.0.1:8200","mount_point":"airflow"}
# or
backend_kwargs = {"connections_path":"connections",
"url":"http://127.0.0.1:8200",
"mount_point":"airflow"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiline config are handled fine as long as they are indented.
Example:
[Multiline Values]
chorus: I'm a lumberjack, and I'm okay
I sleep all night and I work all-day
I am also fine for you to disable long-line check for this particular case so it is easy for users to copy/paste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I just updated the docstring with multiline values.
e0ae431
to
03b2d9e
Compare
Rebase with the master because of the |
""" | ||
secret_id = self.build_secret_id(conn_id=conn_id) | ||
# always return the latest version of the secret | ||
secret_version = "latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there. I'm Seth from the Secret Manager team. Thanks for building this integration. Feel free to reach out if you have future questions.
We strongly discourage users from pinning to "latest" in production workloads. Users should specifically pin to a version (e.g. 1, 2, 3). Is there a way we can update this to allow the user to specify a version id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello. I'm Kamil from Polidea. We are a Google partner and we are building many integrations for Google services at the request of Cloud Composer.
We have an abstraction layer that does not provide for this possibility, but I think we can introduce. special parameter syntax that lets you reference a specific version.
But I will explain the small context of how this change works so that you can make better design decisions. In previous versions of Airflow, all data was stored in a database (this behavior may have been overwritten by environment variables, so you can also store secret data in env var).
The user selects the connection using the operator parameter. The following example will use connections with ID = `'my-company-id``.
download_file = GCSToLocalOperator(
task_id="download_file",
object_name=BUCKET_FILE_LOCATION,
bucket=BUCKET_1,
filename=PATH_TO_SAVED_FILE,
gcp_conn_id='my-company-id`
)
Please note that the user does not specify the type of backend secret. It is a universal mechanism and this data may come from a database, from environment variables or from another service.
What do you think so that the user can define the connection version after the colon? The sample call will then look like this.
download_file = GCSToLocalOperator(
task_id="download_file",
object_name=BUCKET_FILE_LOCATION,
bucket=BUCKET_1,
filename=PATH_TO_SAVED_FILE,
gcp_conn_id='my-company-id:2'
)
It won't be a perfect solution, but I think it's easy to implement and easy to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethvargo Thanks for letting us know. Why is it not recommended, at least from Airflow's perspective, our users would want to pull the secrets with the most recent version (hence 'latest').
Can you give us some context to let us know how a user who does not know how many versions of a secret exists pull it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kaxil - here's the protocol we'd like most integrators to try and align on.
Latest is a floating alias that resolves to "the highest numbered secret version" for the secret. Often times, the team creating the secret is separate from the team consuming the secret. In these cases, the latest version may not be active or usable yet.
Generally, you want a workload to be coupled to a specific secret so you can roll forward and roll backward in the event something breaks.
It's fine to allow users to specify latest, but I don't think it should be the default behavior. You can see how we recommend specifying the format in the linked issue above.
Add Secret backend for GCP Secrets Manager
Issue link: AIRFLOW-7104
Make sure to mark the boxes below before creating PR: [x]
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.