Feat: Onboard The General Index Dataset #342

gkodukula · 2022-04-13T17:20:31Z

Checklist

Note: If an item applies to you, all of its sub-items must be fulfilled

(Required) This pull request is appropriately labeled
Please merge this pull request after it's approved
I'm adding or editing a dataset
- The Google Cloud Datasets team is aware of the proposed dataset
- I put all my code inside datasets/the_general_index> and nothing outside of that directory

…elines.

…d processing

happyhuman · 2022-04-13T17:27:37Z

datasets/the_general_index/pipelines/_images/run_csv_transform_kub/csv_transform.py

+ ".csv", "-" + str(chunk_number) + ".csv"
+ )
+ df = pd.DataFrame()
+ df = pd.concat([df, chunk])


Looks like you are concatenating an empty dataframe to chunk each time here. I think the result of lines 209 and 210 is the same as just df = chunk

@happyhuman Gowtham and myself are working on this together. It turns out that data in some of the source files exceeds pandas csv row string length maximum and pandas does not provide a method to be able to extend that maximum, so for this implementation we need to rewrite the file reading process to use "import csv" module instead. As a result we are not yet ready to push to production.

happyhuman · 2022-04-13T17:31:38Z

datasets/the_general_index/pipelines/_images/run_csv_transform_kub/csv_transform.py

+) -> pd.DataFrame:
+ for column in null_string_list:
+ logging.info(f"Removing null strings from column {column}")
+ df[column] = df[column].str.replace("\\N", "", regex=False)


Just wondering if it should be "\n"?

@happyhuman The \N is actual text in the data "\N" is put in place of null when the source data was dumped; so we are not actually looking for newlines. Therefore, the "\N" is correct here.

happyhuman · 2022-04-13T17:38:08Z

datasets/the_general_index/pipelines/_images/run_csv_transform_kub/csv_transform.py

+
+
+def convert_dt_format(dt_str: str) -> str:
+ if not dt_str or str(dt_str).lower() == "nan" or str(dt_str).lower() == "nat":


If dt_str can be None, or nan or nat, a good way to check them all is to use pd.notnull(dt_str).

Also, if dt_str is expected to be of type str, then using str(dt_str) seems redundant.

@happyhuman The nan/nat issue has been identified before in previous PR's. Use of notnull has not been successful before. However, we will try both issues again to see if we can get your suggestion/s to work. Thanks for the reminder.

@happyhuman pd.notnull will not work for NaN or NaT. Please see https://pandas.pydata.org/docs/reference/api/pandas.notnull.html

the only way to access lower is through the "str" object so the only other alternative would be str.lower(dt_str) which is not much different so I think this is moot. Let me know if you still want me to change to str.lower(dt_str) or not.

…re available for running the pipelines.

nlarge-google · 2022-05-26T23:58:06Z

@gkodukula this is failing in dev. please review and fix. thanks!

…tion ready

gkodukula added 11 commits March 31, 2022 18:59

feat: onboarding early un sdg version

1a8f539

feat: onboarding early un sdg version

27263b0

feat: Initial checkin. Not production ready.

f8662e5

fix: loads test data into dataframe. Not production ready.

507169b

fix: issues with datatypes when loading data.

6d5796a

fix: Tested locally. runs as intended.

ad0b260

fix: Changes to variables in pipeline.yaml preparing for multiple pip…

98be1ad

…elines.

fix: misc changes. not ready for production.

5ad216b

fix: completed loading one file from end-to-end.

0c462a4

fix: expanded pipeline.yaml to include all data files for download an…

8ed0bd5

…d processing

fix: Added deletion of source data if it is being reloaded.

28316b4

gkodukula requested review from adlersantos, happyhuman and nlarge-google April 13, 2022 17:20

Merge remote-tracking branch 'upstream/main' into general_index

a7c8756

happyhuman reviewed Apr 13, 2022

View reviewed changes

gkodukula added 4 commits April 13, 2022 18:20

fix: next update

7920beb

fix: Testing in AF.

400747b

fix: Increased number of nodes in cluster in AF to ensure resources a…

a949057

…re available for running the pipelines.

fix: Resolved download file reference in code.

162680c

nlarge-google changed the title ~~Feat: Onboard The General index Dataset~~ Feat: Onboard The General Index Dataset Apr 15, 2022

gkodukula added 4 commits April 18, 2022 18:18

fix: Fixed delete source data function.

f759c0f

fix: resolved yaml lint issues

8b75e39

fix: black errors

7b51e69

fix: black errors

d36c350

gkodukula assigned adlersantos and nlarge-google Apr 22, 2022

fix: code fix, production ready

1884d9e

fix: Fixed the bad charecters from data giving loading issues. Produc…

ca16625

…tion ready

nlarge-google approved these changes Jun 8, 2022

View reviewed changes

nlarge-google merged commit 67d7216 into GoogleCloudPlatform:main Jun 8, 2022

release-please bot mentioned this pull request Jun 8, 2022

chore(main): release 4.1.0 #366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Onboard The General Index Dataset #342

Feat: Onboard The General Index Dataset #342

gkodukula commented Apr 13, 2022 •

edited

happyhuman Apr 13, 2022

nlarge-google Apr 15, 2022

happyhuman Apr 13, 2022

nlarge-google Apr 15, 2022

happyhuman Apr 13, 2022

nlarge-google Apr 15, 2022

nlarge-google Apr 15, 2022

nlarge-google commented May 26, 2022



		def convert_dt_format(dt_str: str) -> str:
		if not dt_str or str(dt_str).lower() == "nan" or str(dt_str).lower() == "nat":

Feat: Onboard The General Index Dataset #342

Feat: Onboard The General Index Dataset #342

Conversation

gkodukula commented Apr 13, 2022 • edited

Checklist

happyhuman Apr 13, 2022

Choose a reason for hiding this comment

nlarge-google Apr 15, 2022

Choose a reason for hiding this comment

happyhuman Apr 13, 2022

Choose a reason for hiding this comment

nlarge-google Apr 15, 2022

Choose a reason for hiding this comment

happyhuman Apr 13, 2022

Choose a reason for hiding this comment

nlarge-google Apr 15, 2022

Choose a reason for hiding this comment

nlarge-google Apr 15, 2022

Choose a reason for hiding this comment

nlarge-google commented May 26, 2022

gkodukula commented Apr 13, 2022 •

edited