Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add write_engine parameter to read_FORMATNAME methods to control how data is written to BigQuery #371

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Feb 6, 2024

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes internal issue 323176126
🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Feb 6, 2024
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Feb 7, 2024
@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Feb 9, 2024
@tswast
Copy link
Collaborator Author

tswast commented Feb 9, 2024

Blocked by googleapis/python-bigquery#1815

@tswast tswast marked this pull request as ready for review February 29, 2024 17:48
@tswast tswast requested review from a team as code owners February 29, 2024 17:48
@tswast tswast requested a review from shobsi February 29, 2024 17:48
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense to expose some of these methods in a public place since they are shared by the client library, pandas-gbq, and now bigframes.

A similar method is exposed publicly at

https://github.com/googleapis/python-bigquery-pandas/blob/b3b9202980a17faa9df6fc8bba7785d984452893/pandas_gbq/gbq.py#L1147

but it's been deprecated for quite some time. I think we may want to revisit that deprecation and revive it. Will need to think carefully about circular dependencies, though.

@tswast
Copy link
Collaborator Author

tswast commented Mar 1, 2024

Test failure is a real one:

E TypeError: Object of type bool_ is not JSON serializable
=================================== FAILURES ===================================
_____________ test_read_csv_gcs_default_engine[bigquery_streaming] _____________
[gw19] linux -- Python 3.11.6 /tmpfs/src/github/python-bigquery-dataframes/.nox/system-3-11/bin/python

session = <bigframes.session.Session object at 0x7f0d0f897cd0>
scalars_dfs = ( bool_col bytes_col
rowindex ...... 2038-01-19 03:14:17.999999+00:00
8 False ...

[9 rows x 13 columns])
gcs_folder = 'gs://bigframes-dev-testing/bigframes_tests_system_20240229220731_1845bf/'
write_engine = 'bigquery_streaming'

@skip_legacy_pandas
@pytest.mark.parametrize(
    ("write_engine",),
    (
        ("default",),
        ("bigquery_inline",),
        ("bigquery_load",),
        ("bigquery_streaming",),
    ),
)
def test_read_csv_gcs_default_engine(session, scalars_dfs, gcs_folder, write_engine):
    scalars_df, _ = scalars_dfs
    if scalars_df.index.name is not None:
        path = gcs_folder + "test_read_csv_gcs_default_engine_w_index*.csv"
    else:
        path = gcs_folder + "test_read_csv_gcs_default_engine_wo_index*.csv"
    read_path = path.replace("*", FIRST_FILE)
    scalars_df.to_csv(path, index=False)
    dtype = scalars_df.dtypes.to_dict()
    dtype.pop("geography_col")
  df = session.read_csv(
        read_path,
        # Convert default pandas dtypes to match BigQuery DataFrames dtypes.
        dtype=dtype,
        write_engine=write_engine,
    )

tests/system/small/test_session.py:435:


bigframes/session/init.py:1162: in read_csv
return self._read_pandas(
bigframes/session/init.py:933: in _read_pandas
return self._read_pandas_bigquery_table(
bigframes/session/init.py:988: in _read_pandas_bigquery_table
table_expression = bigframes_io.pandas_to_bigquery_streaming(
bigframes/session/_io/bigquery.py:294: in pandas_to_bigquery_streaming
for errors in bqclient.insert_rows_from_dataframe(
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3662: in insert_rows_from_dataframe
result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs)
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3605: in insert_rows
return self.insert_rows_json(table, json_rows, **kwargs)
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3801: in insert_rows_json
response = self._call_api(
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:827: in _call_api
return call()
.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:293: in retry_wrapped_func
return retry_target(
.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:153: in retry_target
_retry_error_helper(
.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_base.py:212: in _retry_error_helper
raise final_exc from source_exc
.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:144: in retry_target
result = target()
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/_http/init.py:479: in api_request
data = json.dumps(data)
/usr/local/lib/python3.11/json/init.py:231: in dumps
return _default_encoder.encode(obj)
/usr/local/lib/python3.11/json/encoder.py:200: in encode
chunks = self.iterencode(o, _one_shot=True)
/usr/local/lib/python3.11/json/encoder.py:258: in iterencode
return _iterencode(o, 0)


self = <json.encoder.JSONEncoder object at 0x7f0d3d5e8b90>, o = True

def default(self, o):
    """Implement this method in a subclass such that it returns
    a serializable object for ``o``, or calls the base implementation
    (to raise a ``TypeError``).

    For example, to support arbitrary iterators, you could
    implement default like this::

        def default(self, o):
            try:
                iterable = iter(o)
            except TypeError:
                pass
            else:
                return list(iterable)
            # Let the base class default method raise the TypeError
            return JSONEncoder.default(self, o)

    """
  raise TypeError(f'Object of type {o.__class__.__name__} '
                    f'is not JSON serializable')

E TypeError: Object of type bool_ is not JSON serializable

/usr/local/lib/python3.11/json/encoder.py:180: TypeError
----------------------------- Captured stdout call -----------------------------
Query job 2b56fdd1-5dcc-4afd-8521-69b517d5afae is DONE.1.1 kB processed.
https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:2b56fdd1-5dcc-4afd-8521-69b517d5afae&page=queryresults
Query job 7d127271-a535-48d0-b092-df329771fb89 is RUNNING.
https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:7d127271-a535-48d0-b092-df329771fb89&page=queryresults
Query job 7d127271-a535-48d0-b092-df329771fb89 is DONE.0 Bytes processed.
https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:7d127271-a535-48d0-b092-df329771fb89&page=queryresults
=============================== warnings summary ===============================

This will likely require a fix upstream in google-cloud-bigquery, but in the meantime I can make sure to use a vendored version of insert_rows_from_dataframe that can serialize a numpy bool_ value, similar to how there's a special case for NaN already.

Edit: I already fixed this in googleapis/python-bigquery#1816, waiting on version 3.18.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[typo] file name *helpers.py

import bigframes.constants

ENGINE_ERROR_TEMPLATE = (
"write_engine='{write_engine}' is incompatible with engine='{engine}'. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it more helpful by suggesting what are valid combinations?

("engine", "write_engine"),
(
("bigquery", "bigquery_streaming"),
("bigquery", "bigquery_inline"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add (None, "bigquery_external_table") too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't implemented that one yet.

@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 6, 2024
@tswast
Copy link
Collaborator Author

tswast commented Mar 6, 2024

Marking as do not merge for now. Thanks for your feedback so far. I will wait until we implement go/pandas-gbq-and-bigframes-redundancy before merging this (likely mid-April).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. do not merge Indicates a pull request not ready for merge, due to either quality or timing. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants