feat: add `write_engine` parameter to `read_FORMATNAME` methods to control how data is written to BigQuery #371

tswast · 2024-02-06T19:49:24Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes internal issue 323176126
🦕

…ntrol how data is written to BigQuery

tswast · 2024-02-09T22:55:38Z

Blocked by googleapis/python-bigquery#1815

…n-bigquery-dataframes into b323176126-write_engine

tswast · 2024-02-29T17:55:33Z

third_party/bigframes_vendored/google_cloud_bigquery/_pandas_helpers.py

Might make sense to expose some of these methods in a public place since they are shared by the client library, pandas-gbq, and now bigframes.

A similar method is exposed publicly at

https://github.com/googleapis/python-bigquery-pandas/blob/b3b9202980a17faa9df6fc8bba7785d984452893/pandas_gbq/gbq.py#L1147

but it's been deprecated for quite some time. I think we may want to revisit that deprecation and revive it. Will need to think carefully about circular dependencies, though.

tswast · 2024-03-01T16:19:00Z

Test failure is a real one:

E TypeError: Object of type bool_ is not JSON serializable

=================================== FAILURES ===================================
_____________ test_read_csv_gcs_default_engine[bigquery_streaming] _____________
[gw19] linux -- Python 3.11.6 /tmpfs/src/github/python-bigquery-dataframes/.nox/system-3-11/bin/python
session = <bigframes.session.Session object at 0x7f0d0f897cd0>

scalars_dfs = (          bool_col                                          bytes_col  

rowindex                                    ......  2038-01-19 03:14:17.999999+00:00

8            False  ...                              
[9 rows x 13 columns])

gcs_folder = 'gs://bigframes-dev-testing/bigframes_tests_system_20240229220731_1845bf/'

write_engine = 'bigquery_streaming'
@skip_legacy_pandas
@pytest.mark.parametrize(
    ("write_engine",),
    (
        ("default",),
        ("bigquery_inline",),
        ("bigquery_load",),
        ("bigquery_streaming",),
    ),
)
def test_read_csv_gcs_default_engine(session, scalars_dfs, gcs_folder, write_engine):
    scalars_df, _ = scalars_dfs
    if scalars_df.index.name is not None:
        path = gcs_folder + "test_read_csv_gcs_default_engine_w_index*.csv"
    else:
        path = gcs_folder + "test_read_csv_gcs_default_engine_wo_index*.csv"
    read_path = path.replace("*", FIRST_FILE)
    scalars_df.to_csv(path, index=False)
    dtype = scalars_df.dtypes.to_dict()
    dtype.pop("geography_col")


  df = session.read_csv(


        read_path,
        # Convert default pandas dtypes to match BigQuery DataFrames dtypes.
        dtype=dtype,
        write_engine=write_engine,
    )

tests/system/small/test_session.py:435:

bigframes/session/init.py:1162: in read_csv

return self._read_pandas(

bigframes/session/init.py:933: in _read_pandas

return self._read_pandas_bigquery_table(

bigframes/session/init.py:988: in _read_pandas_bigquery_table

table_expression = bigframes_io.pandas_to_bigquery_streaming(

bigframes/session/_io/bigquery.py:294: in pandas_to_bigquery_streaming

for errors in bqclient.insert_rows_from_dataframe(

.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3662: in insert_rows_from_dataframe

result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs)

.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3605: in insert_rows

return self.insert_rows_json(table, json_rows, **kwargs)

.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3801: in insert_rows_json

response = self._call_api(

.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:827: in _call_api

return call()

.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:293: in retry_wrapped_func

return retry_target(

.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:153: in retry_target

_retry_error_helper(

.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_base.py:212: in _retry_error_helper

raise final_exc from source_exc

.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:144: in retry_target

result = target()

.nox/system-3-11/lib/python3.11/site-packages/google/cloud/_http/init.py:479: in api_request

data = json.dumps(data)

/usr/local/lib/python3.11/json/init.py:231: in dumps

return _default_encoder.encode(obj)

/usr/local/lib/python3.11/json/encoder.py:200: in encode

chunks = self.iterencode(o, _one_shot=True)

/usr/local/lib/python3.11/json/encoder.py:258: in iterencode

return _iterencode(o, 0)

self = <json.encoder.JSONEncoder object at 0x7f0d3d5e8b90>, o = True
def default(self, o):
    """Implement this method in a subclass such that it returns
    a serializable object for ``o``, or calls the base implementation
    (to raise a ``TypeError``).

    For example, to support arbitrary iterators, you could
    implement default like this::

        def default(self, o):
            try:
                iterable = iter(o)
            except TypeError:
                pass
            else:
                return list(iterable)
            # Let the base class default method raise the TypeError
            return JSONEncoder.default(self, o)

    """


  raise TypeError(f'Object of type {o.__class__.__name__} '


                    f'is not JSON serializable')

E       TypeError: Object of type bool_ is not JSON serializable
/usr/local/lib/python3.11/json/encoder.py:180: TypeError

----------------------------- Captured stdout call -----------------------------

Query job 2b56fdd1-5dcc-4afd-8521-69b517d5afae is DONE.1.1 kB processed.

https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:2b56fdd1-5dcc-4afd-8521-69b517d5afae&page=queryresults

Query job 7d127271-a535-48d0-b092-df329771fb89 is RUNNING.

https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:7d127271-a535-48d0-b092-df329771fb89&page=queryresults

Query job 7d127271-a535-48d0-b092-df329771fb89 is DONE.0 Bytes processed.

https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:7d127271-a535-48d0-b092-df329771fb89&page=queryresults

=============================== warnings summary ===============================

This will likely require a fix upstream in google-cloud-bigquery, but in the meantime I can make sure to use a vendored version of insert_rows_from_dataframe that can serialize a numpy bool_ value, similar to how there's a special case for NaN already.

Edit: I already fixed this in googleapis/python-bigquery#1816, waiting on version 3.18.0.

bump minimum google-cloud-bigquery after merging chore(main): release 3.18.0 python-bigquery#1817

shobsi · 2024-03-05T06:59:43Z

third_party/bigframes_vendored/google_cloud_bigquery/tests/unit/test_pyarrow_hlpers.py

[typo] file name *helpers.py

third_party/bigframes_vendored/google_cloud_bigquery/_pyarrow_helpers.py

third_party/bigframes_vendored/google_cloud_bigquery/tests/unit/test_pyarrow_hlpers.py

shobsi · 2024-03-05T07:37:03Z

bigframes/session/validation.py

+import bigframes.constants
+
+ENGINE_ERROR_TEMPLATE = (
+ "write_engine='{write_engine}' is incompatible with engine='{engine}'. "


Should we make it more helpful by suggesting what are valid combinations?

shobsi · 2024-03-05T07:40:13Z

tests/unit/session/test_session.py

+ ("engine", "write_engine"),
+ (
+ ("bigquery", "bigquery_streaming"),
+ ("bigquery", "bigquery_inline"),


Add (None, "bigquery_external_table") too?

Haven't implemented that one yet.

tswast · 2024-03-06T18:35:47Z

Marking as do not merge for now. Thanks for your feedback so far. I will wait until we implement go/pandas-gbq-and-bigframes-redundancy before merging this (likely mid-April).

feat: add write_engine parameter to read_FORMATNAME methods to co…

e61bbdd

…ntrol how data is written to BigQuery

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Feb 6, 2024

implement write_engine='bigquery_load' and start on 'bigquery_streaming'

c8d4baf

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Feb 7, 2024

use dataframe_to_bq_schema in third_party

ca6d752

product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Feb 9, 2024

Merge branch 'main' into b323176126-write_engine

b63bd4c

tswast added 9 commits February 22, 2024 16:12

Merge remote-tracking branch 'origin/main' into b323176126-write_engine

9839ab8

Merge branch 'b323176126-write_engine' of github.com:googleapis/pytho…

08b2676

…n-bigquery-dataframes into b323176126-write_engine

fix outdated comment

6d065b6

Merge remote-tracking branch 'origin/main' into b323176126-write_engine

2585ce5

fix for sqlglot parse error

c4f7c08

add write_engine to read_json and read_pickle

ebb9ca5

fix unit tests

e684dcf

Merge remote-tracking branch 'origin/main' into b323176126-write_engine

7eab9f9

allow for more flexible ibis types due to memtable behavior

45edfcd

tswast marked this pull request as ready for review February 29, 2024 17:48

tswast requested review from a team as code owners February 29, 2024 17:48

tswast requested a review from shobsi February 29, 2024 17:48

fix table expression construction

807efae

tswast commented Feb 29, 2024

View reviewed changes

tswast added 2 commits February 29, 2024 21:24

Merge remote-tracking branch 'origin/main' into b323176126-write_engine

49f03e6

fix unit test

2f0d606

Merge remote-tracking branch 'origin/main' into b323176126-write_engine

06c7723

bump minimum google-cloud-bigquery

eebdd0e

shobsi reviewed Mar 5, 2024

View reviewed changes

tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `write_engine` parameter to `read_FORMATNAME` methods to control how data is written to BigQuery #371

feat: add `write_engine` parameter to `read_FORMATNAME` methods to control how data is written to BigQuery #371

tswast commented Feb 6, 2024

tswast commented Feb 9, 2024

tswast Feb 29, 2024

tswast commented Mar 1, 2024 •

edited

shobsi Mar 5, 2024

shobsi Mar 5, 2024

shobsi Mar 5, 2024

tswast Mar 6, 2024

tswast commented Mar 6, 2024

feat: add write_engine parameter to read_FORMATNAME methods to control how data is written to BigQuery #371

Are you sure you want to change the base?

feat: add write_engine parameter to read_FORMATNAME methods to control how data is written to BigQuery #371

Conversation

tswast commented Feb 6, 2024

tswast commented Feb 9, 2024

tswast Feb 29, 2024

Choose a reason for hiding this comment

tswast commented Mar 1, 2024 • edited

shobsi Mar 5, 2024

Choose a reason for hiding this comment

shobsi Mar 5, 2024

Choose a reason for hiding this comment

shobsi Mar 5, 2024

Choose a reason for hiding this comment

tswast Mar 6, 2024

Choose a reason for hiding this comment

tswast commented Mar 6, 2024

feat: add `write_engine` parameter to `read_FORMATNAME` methods to control how data is written to BigQuery #371

feat: add `write_engine` parameter to `read_FORMATNAME` methods to control how data is written to BigQuery #371

tswast commented Mar 1, 2024 •

edited