Skip to content

[AIRFLOW-7104] Add Secret backend for GCP Secrets Manager #7795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Mar 23, 2020

Conversation

xinbinhuang
Copy link
Contributor

@xinbinhuang xinbinhuang commented Mar 21, 2020

Add Secret backend for GCP Secrets Manager


Issue link: AIRFLOW-7104

Make sure to mark the boxes below before creating PR: [x]


In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@boring-cyborg boring-cyborg bot added area:docs area:secrets provider:google Google (including GCP) related issues labels Mar 21, 2020
@xinbinhuang
Copy link
Contributor Author

@kaxil PTAL Thanks!

**kwargs
):
self.connections_prefix = connections_prefix.rstrip("/")
self.credentials, self.project_id = get_credentials_and_project_id(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_credentials_and_project_id method is an expensive operation. This causes an HTTP request/requests to be sent. Can you move it out of the constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it inside the client now:

@cached_property
def client(self) -> SecretManagerServiceClient:
        """
        Create an authenticated KMS client
        """
        scopes = _get_scopes(self.gcp_scopes)
        self.credentials, self.project_id = get_credentials_and_project_id(
            key_path=self.gcp_key_path,
            scopes=scopes
        )
        _client = SecretManagerServiceClient(
            credentials=self.credentials,
            client_info=ClientInfo(client_library_version='airflow_v' + version.version)
        )
        return _client

Copy link
Contributor Author

@xinbinhuang xinbinhuang Mar 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick question: I understand that get_credentials_and_project_id can be expensive, and it seems to be a good practice within the project. But I don't understand the impact of putting it in the constructor vs. in other methods.

Doesn't we always call the client method after constructing an instance? Otherwise, what would be the reason for constructing the instance without calling the method?

Another confusion is that if we put it in the constructor, we only need to call the method once during instantiation, and then we can keep calling other methods without invoking the HTTP request. Wouldn't it be better than making this HTTP request every time when we call the method that contains it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for avoiding db / API calls in the constructor (in operators) is to avoid requests during DAG parsing. However, in this case, I don't think any operator will call GcpSecretsManagerSecretsBackend() in its constructor

Copy link
Member

@mik-laj mik-laj Mar 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic because the backend is initialized/constructor during the initialization of the airflow.secret module.
If you write something similar to the code below

from airflow.secret import CONFIG_SECTION

An HTTP request will be sent. You don't have to use this class for it to happen. It's enough to load it in memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked more closely. Even the following code is enough to initiate the backend.

from airflow.models.base_hook import BaseHook

So it is done very often.

@codecov-io
Copy link

codecov-io commented Mar 22, 2020

Codecov Report

Merging #7795 into master will decrease coverage by 0.72%.
The diff coverage is 95.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7795      +/-   ##
==========================================
- Coverage   86.93%   86.20%   -0.73%     
==========================================
  Files         924      925       +1     
  Lines       44674    44836     +162     
==========================================
- Hits        38836    38651     -185     
- Misses       5838     6185     +347     
Impacted Files Coverage Δ
airflow/providers/hashicorp/secrets/vault.py 76.47% <0.00%> (+11.03%) ⬆️
...oviders/google/cloud/utils/credentials_provider.py 93.58% <96.66%> (+1.75%) ⬆️
airflow/providers/google/cloud/hooks/base.py 97.02% <100.00%> (+1.34%) ⬆️
.../providers/google/cloud/secrets/secrets_manager.py 100.00% <100.00%> (ø)
...flow/providers/apache/cassandra/hooks/cassandra.py 21.51% <0.00%> (-72.16%) ⬇️
...w/providers/apache/hive/operators/mysql_to_hive.py 35.84% <0.00%> (-64.16%) ⬇️
airflow/kubernetes/volume_mount.py 44.44% <0.00%> (-55.56%) ⬇️
airflow/providers/postgres/operators/postgres.py 50.00% <0.00%> (-50.00%) ⬇️
airflow/providers/redis/operators/redis_publish.py 50.00% <0.00%> (-50.00%) ⬇️
airflow/kubernetes/volume.py 52.94% <0.00%> (-47.06%) ⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fee827e...3a4b925. Read the comment docs.

Comment on lines 41 to 51
[secrets]
backend = airflow.providers.hashicorp.secrets.vault.VaultSecrets
backend_kwargs = {"path":"connections","url":"http://127.0.0.1:8200","mount_point":"airflow"}
backend_kwargs = {"connections_path":"conns","url":"http://127.0.0.1:8200","mount_point":"airflow"}

For example, if your keys are under ``connections`` path in ``airflow`` mount_point, this
would be accessible if you provide ``{"path": "connections"}`` and request
For example, if your keys are under ``conns`` path in ``airflow`` mount_point, this
would be accessible if you provide ``{"connections_path": "conns"}`` and request
conn_id ``smtp_default``.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turbaszek @kaxil I changed connections to conns in the docstring to reduce the line length < 110.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use connections and just break the lines?

Copy link
Member

@kaxil kaxil Mar 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer connections too, please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be I should have asked this before. For some reason, I was under the impression that we can not break lines in the .cfg file. So can we break lines?

Something like this.

backend_kwargs = 
        {"connections_path":"connections","url":"http://127.0.0.1:8200","mount_point":"airflow"}

# or
backend_kwargs = {"connections_path":"connections",
                  "url":"http://127.0.0.1:8200",
                  "mount_point":"airflow"}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiline config are handled fine as long as they are indented.

Example:

[Multiline Values]
chorus: I'm a lumberjack, and I'm okay
    I sleep all night and I work all-day

I am also fine for you to disable long-line check for this particular case so it is easy for users to copy/paste.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I just updated the docstring with multiline values.

@xinbinhuang xinbinhuang force-pushed the add_kms_secret_backend branch from e0ae431 to 03b2d9e Compare March 22, 2020 20:49
@xinbinhuang
Copy link
Contributor Author

Rebase with the master because of the elasticsearch breaking change. Travis should be green after the run.

"""
secret_id = self.build_secret_id(conn_id=conn_id)
# always return the latest version of the secret
secret_version = "latest"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there. I'm Seth from the Secret Manager team. Thanks for building this integration. Feel free to reach out if you have future questions.

We strongly discourage users from pinning to "latest" in production workloads. Users should specifically pin to a version (e.g. 1, 2, 3). Is there a way we can update this to allow the user to specify a version id?

Copy link
Member

@mik-laj mik-laj May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. I'm Kamil from Polidea. We are a Google partner and we are building many integrations for Google services at the request of Cloud Composer.

We have an abstraction layer that does not provide for this possibility, but I think we can introduce. special parameter syntax that lets you reference a specific version.

But I will explain the small context of how this change works so that you can make better design decisions. In previous versions of Airflow, all data was stored in a database (this behavior may have been overwritten by environment variables, so you can also store secret data in env var).

The user selects the connection using the operator parameter. The following example will use connections with ID = `'my-company-id``.

    download_file = GCSToLocalOperator(
        task_id="download_file",
        object_name=BUCKET_FILE_LOCATION,
        bucket=BUCKET_1,
        filename=PATH_TO_SAVED_FILE,
        gcp_conn_id='my-company-id`
    )

Please note that the user does not specify the type of backend secret. It is a universal mechanism and this data may come from a database, from environment variables or from another service.

What do you think so that the user can define the connection version after the colon? The sample call will then look like this.

    download_file = GCSToLocalOperator(
        task_id="download_file",
        object_name=BUCKET_FILE_LOCATION,
        bucket=BUCKET_1,
        filename=PATH_TO_SAVED_FILE,
        gcp_conn_id='my-company-id:2'
    )

It won't be a perfect solution, but I think it's easy to implement and easy to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethvargo Thanks for letting us know. Why is it not recommended, at least from Airflow's perspective, our users would want to pull the secrets with the most recent version (hence 'latest').

Can you give us some context to let us know how a user who does not know how many versions of a secret exists pull it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kaxil - here's the protocol we'd like most integrators to try and align on.

Latest is a floating alias that resolves to "the highest numbered secret version" for the secret. Often times, the team creating the secret is separate from the team consuming the secret. In these cases, the latest version may not be active or usable yet.

Generally, you want a workload to be coupled to a specific secret so you can roll forward and roll backward in the event something breaks.

It's fine to allow users to specify latest, but I don't think it should be the default behavior. You can see how we recommend specifying the format in the linked issue above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dev-tools area:secrets provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants