-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Improve GCSToSFTPOperator paths handling #11284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The CI and PROD Docker Images for the build are prepared in a separate "Build Image" workflow, You can checks the status of those images in The workflow run |
Should we make this configurable to avoid breaking change? I see the point of downloading a single file but this may be problematic when downloading multiple files with the same name (for example structure like |
From what I understood by reading the code, only the part up to the wildcard character is stripped from the Did I get this right, @TobKed? |
@olchas , yes, you are right. In your example of wildcard folders will be created. @turbaszek I made this behavior optional for backward compatibility. PTAL |
@@ -52,6 +52,9 @@ class GCSToSFTPOperator(BaseOperator): | |||
:param destination_path: The sftp remote path. This is the specified directory path for | |||
uploading to the SFTP server. | |||
:type destination_path: str | |||
:param ignore_gcs_path_in_destination: (Optional) When set to False the path of the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It confused me a little bit. Can you explain this parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example: If you want to copy file from bucket: gs://bucket_name/data/2020/file.txt
to /tmp
directory, by default it will be copied reflecting object path on the bucket, so file path on sftp will be /tmp/data/2020/file.txt
. If parameter ignore_gcs_path_in_destination
will be set to True the file path on sftp will be /tmp/file.txt
.
I am open for proposals for better name for this parameter :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I do not have better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an example to the description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Example was added.
590be6c
to
6b1e58e
Compare
if self.ignore_gcs_path_in_destination: | ||
source_object_basename = os.path.basename(self.source_object) | ||
destination_path = os.path.join(self.destination_path, source_object_basename) | ||
else: | ||
destination_path = os.path.join(self.destination_path, self.source_object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
def _resolve_destination_path(self, prefix=None):
if self.ignore_gcs_path_in_destination:
source_object_basename = os.path.basename(self.source_object, start=prefix)
return os.path.join(self.destination_path, source_object_basename)
else:
return os.path.join(self.destination_path, self.source_object)
then we can avoid duplication of the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @turbaszek , it is a great suggestion.
I modified it slightly: 9b9ef60
ec06638
to
e7694f4
Compare
@@ -52,6 +52,9 @@ class GCSToSFTPOperator(BaseOperator): | |||
:param destination_path: The sftp remote path. This is the specified directory path for | |||
uploading to the SFTP server. | |||
:type destination_path: str | |||
:param ignore_gcs_path_in_destination: (Optional) When set to False the path of the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:param ignore_gcs_path_in_destination: (Optional) When set to False the path of the file | |
:param use_relative_paths: (Optional) When set to False the path of the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or keep_directory_structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to keep_directory_structure
. Thanks @mik-laj !
DESTINATION_SFTP = "destination_path" | ||
|
||
|
||
# pylint: disable=unused-argument | ||
class TestGoogleCloudStorageToSFTPOperator(unittest.TestCase): | ||
@parameterized.expand( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect 💪
5276964
to
40f928f
Compare
task_id="file-move-gsc-to-sftp", | ||
source_bucket=BUCKET_SRC, | ||
source_object=OBJECT_SRC_2, | ||
destination_path=DESTINATION_PATH_1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use values here instead of undefined constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed it. Thank you! FIxed.
40f928f
to
19d2004
Compare
I rebased on the latest master and fixed the logic mistake. Previously i forget that after changing parameter from |
We test it together with the customer. Pllease wait for customer feedback. |
e85e838
to
0cfa7d9
Compare
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*. |
7da3d9b
to
8466393
Compare
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*. |
8466393
to
a3192d4
Compare
@mik-laj any feedback about your tests? |
The PR needs to run all tests because it modifies core of Airflow! Please rebase it to latest master or ask committer to re-run it! |
As far as I know, they were successful. Tobias has more details. |
a3192d4
to
f4cde05
Compare
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*. |
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*. |
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*. |
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*. |
7b59717
to
50931c9
Compare
50931c9
to
2036748
Compare
This does not appear to be working for me - the documentation is unclear which should be used to exclude the GCS path structure from the destination path - True or False? I have tested with both and with both it is still adding elements of the GCS filepath that I do not want to include. This is a useful feature in some cases, but it feels like it should be default that the user controls definition of the source and destination paths without any unexpected behaviour. With
With
|
@marksworn I am sorry to hear that. Would like to create issue with link to this PR? Would you like to create fix for it? I would be happy to assist you with this. |
Improve GCSToSFTPOperator paths handling.
Now for every file on GCS is created separate folder on SFTP.
now:
EDIT:
I made previous behavior default (for backward compatibility) but possible to change it by
ignore_gcs_path_in_destination
argument: