GCP BigQuery dialect use_insertmanyvalues_wo_returning enablement #12038

jlynchMicron · 2024-10-28T18:55:43Z

jlynchMicron
Oct 28, 2024

Hi All,

I am attempting to enable ORM session supported bulk inserting (added in v2.0) would be an easy add-on feature to the existing BigQuery dialect. Since BigQuery does not enforce primary key constraints, I assume the easiest way to enable this would be with the use_insertmanyvalues_wo_returning dialect argument. I have attempted to do this by modifying the dialect before creating my engine with the following code:

from sqlalchemy_bigquery.base import BigQueryDialect
BigQueryDialect.use_insertmanyvalues = True
BigQueryDialect.use_insertmanyvalues_wo_returning = True

This seems to correctly turn on the feature, but I am having issues with the compiled bulk insert statements where my parameter keys are not aligning to my statement keys:

imv_batch.replaced_statement: "INSERT INTO `...` (`id`, ...) VALUES (%(id:INT64)s, ... , %(id:INT64)s , ... , %(id:INT64)s, ... etc.)
imv_batch.replaced_parameters: {'id__0': ... , 'id__1': ... , 'id__2': ... , etc.}

It appears my replaced_parameters key enumeration is not matching up with the replaced_statement keys which are not being enumerated. Are there any settings I should enable to properly enable this feature?

Dialect: https://github.com/googleapis/python-bigquery-sqlalchemy
Related ticket: googleapis/python-bigquery-sqlalchemy#497
BigQuery Example Statements: https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#insert_examples

Answered by zzzeek

Oct 29, 2024

the official way one uses the Python DBAPI to process many rows efficiently is the executemany method. This is where all DBAPIs should have an existing implementation to receive any number of dictionaries or tuples and process them in the fastest way possible. In the absence of any need to deliver server-generated information about each recordset after it's processed, executemany() is what SQLAlchemy Core and ORM use normally for this purpose.

The way that executemany works is up to the driver. For example, the MySQLClient DBAPI converts INSERT statements into a single, batched INSERT in a similar way as SQLAlchemy's insertmanyvalues feature. The pyodbc dialect includes a similar feature …

View full answer

jlynchMicron · 2024-10-28T19:01:25Z

jlynchMicron
Oct 28, 2024
Author

Would any of the following arguments assist:
Dialect.positional
Dialect.paramstyle
_InsertManyValues.sentinel_columns
_InsertManyValues.embed_values_counter

0 replies

zzzeek · 2024-10-28T20:06:20Z

zzzeek
Oct 28, 2024
Maintainer

"use_insertmanyvalues_wo_returning" is only used if you have a dialect where a regular .executemany() is significantly slower than the "insertmanyvalues" form, even when RETURNING is not used and there is no need to fetch newly generated primary key values. Typically, a DBAPI should be providing for .executemany() behavior that is very efficient for a large number of input parameters, as .executemany() does not have any need to fetch server-generated values. so "use_insertmanyvalues" when used in a context where SQLAlchemy does not need to fetch server generated values is used strictly as a workaround for what we would view as a performance issue in the DBAPI itself (and it's preferable that part is fixed in the DBAPI).

as far as primary key constraints that has nothing to do with the use case for use_insertmanyvalues, where SQLAlchemy needs to know the values of server-generated columns as it performs an INSERT. If bigquery has no server generated values, then that's another reason why use_insertmanyvalues would not be needed unless you are working around performance issues in .executemany().

for the reason that the bigquery dialect is not replacing parameters I would first look at what seems to be an unusual casting syntax where I see a single colon and a datatype embedded in the bound parameter: %(id:INT64)s . that's not a bound parameter format that SQLAlchemy's insertmanyvalues expects. You want to look at what form is present here and then watch what's happening with the replace that's here . you might need to apply that ":INT64" token after the fact, not sure

8 replies

jlynchMicron Oct 28, 2024
Author

Hi @zzzeek ,

I have been jumping all around the codebase to try and figure out the right set of attributes to enable for executemany or do_executemany support as you suggested. It seems like the bigquery dbapi cursor has support for executemany (shown here: https://github.com/googleapis/python-bigquery/blob/main/google/cloud/bigquery/dbapi/cursor.py#L236) but the dialect as a whole does not support the idea of "returning" as far as I can tell. With all this in mind, can you suggest the set of attributes I should set to get the best ORM bulk insert performance (better than single row entry)?

I can try and debug the bound parameter formatting, but my breakpoint statements are not even being reached because of various modes I am putting the program into via argument flags. I know that the sqlalchemy bigquery dialect was partially based on the Postgress dialect, so I have been trying to follow that as an example.

Here are some of my attempts:

use_insertmanyvalues=True -> single row inserts
use_insertmanyvalues=True and use_insertmanyvalues_wo_returning=True -> multi value inserts
insert_executemany_returning=True -> single row inserts

zzzeek Oct 28, 2024
Maintainer

does BigQuery support the concept of server-generated values, most primarily when you INSERT a bunch of rows, there's an auto-generated integer primary key? the ORM needs to get those back. if there isn't, then RETURNING is not very critical. Overall, if there's no RETURNING, there's not much point in implementing "insertmanyvalues" unless executemany() has performance issues.

If there are server-generated integer primary keys, then what is the means by which a newly generated integer primary key is returned to the client after a single-row INSERT statement?

jlynchMicron Oct 28, 2024
Author

BigQuery does not have server-generated values or the concept of primary keys, you could insert the same row over and over again if you wanted. That being said, I am just trying to ferry data over in the most efficient and performant way possible. The only indication I see of multivalue inserts in the current BigQuery dialect is "BigQueryDialect.supports_multivalues_insert = True"

Side Note: Although I know its completely un-safe and not guaranteed server-side, I am using SQLalchemy to create primary/foreign key relationships/constraints in my ORM. My application takes great lengths to ensure best effort primary key unique-ness before inserting data via ORM session eventing. I only say this in case it matters regarding application state when performing multi-value inserts.

zzzeek Oct 29, 2024
Maintainer

the official way one uses the Python DBAPI to process many rows efficiently is the executemany method. This is where all DBAPIs should have an existing implementation to receive any number of dictionaries or tuples and process them in the fastest way possible. In the absence of any need to deliver server-generated information about each recordset after it's processed, executemany() is what SQLAlchemy Core and ORM use normally for this purpose.

The way that executemany works is up to the driver. For example, the MySQLClient DBAPI converts INSERT statements into a single, batched INSERT in a similar way as SQLAlchemy's insertmanyvalues feature. The pyodbc dialect includes a similar feature called fast_executemany (this can be used with SQLAlchemy but it has problems with some datatypes, so we don't recommend it).

From the above discussion I want to make three points:

BigQuery has no concept of server-generated values; no RETURNING, nor any need for data to be passed back to the client during an INSERT. therefore the rationale for SQLAlchemys "insertmanyvalues" feature is not there; this feature is first and foremost meant to allow batched INSERTs that have RETURNING at the same time.
The Python DBAPI executemany() method is where all other "batch" operations should happen. The presence of the the "insertmanyvalues_wo_returning" feature is used only for two drivers that are both somewhat legacy, which have known performance issues in their executemany() methods. In prior versions of SQLAlchemy, we fixed psycopg2's performance issue using their own helper in SQLAlchemy's do_executemany hook, which provides modifying behavior for cursor.executemany.
BigQuery seems to have a non-standard bound parameter format that is not supported by the current implementation for "insertmanyvalues".

I would mostly recommend that you look into the way the BigQuery dialect handles the do_executemany use case and look into a strategy similar to what MySQLClient does if you want to turn INSERT statements into single-statement bulk operations.

Answer selected by jlynchMicron

jlynchMicron Oct 29, 2024
Author

Thanks for your reply! After looking at the DBAPI cursor executemany function, it looks like no work has been done to optimize its batch INSERT performance and it executes statements one at a time. I made this ticket pointing out the issue: googleapis/python-bigquery#2048

Thanks again for your guidance @zzzeek !

jlynchMicron Nov 4, 2024
Author

Hi @zzzeek, one more question for you. Do you know if there is any mechanism to speed up session inserts on a complex object by issuing child inserts in parallel?

All the documentation I see on sessions and async sessions say they are designed to insert objects one at a time and any parallelism should be done through multiple sessions. But what if I am trying to only insert one complex object and trying to speed up the underlying child object inserts? I was thinking of making some sort of thread pool that could issue all the inserts in parallel and delete all inserted child objects if there was an issue on one of the threads, but I was wondering if you have any experience with this?

Any suggestions on trying to "hide" the latency of serial inserts of child objects with complex relational ORM objects would be helpful, thanks!

zzzeek Nov 5, 2024
Maintainer

you can't emit SQL statements in parallel on a single database connection. you need to have multiple connections, and you also have the challenge of rows that are dependent on each other being inserted in the correct order. so you'd probably want to do chunking horizontally, that is, have chunks of parent objects each with all of their child objects, so that different chunks are not dependent on each other.

the ORM Session has absolutely zero capability to itself do multiple tasks in parallel, at every level, so you'd need to use a distinct session per chunk. you can put this behind a fairly minimal facade but you'll never make it invisible and IMO nor should you want to.

jlynchMicron Nov 5, 2024
Author

Understood, thanks again!

Uh oh!

GCP BigQuery dialect use_insertmanyvalues_wo_returning enablement #12038

Uh oh!

jlynchMicron Oct 28, 2024

Replies: 2 comments · 8 replies

Uh oh!

jlynchMicron Oct 28, 2024 Author

Uh oh!

zzzeek Oct 28, 2024 Maintainer

Uh oh!

Uh oh!

jlynchMicron Oct 28, 2024 Author

Uh oh!

zzzeek Oct 28, 2024 Maintainer

Uh oh!

jlynchMicron Oct 28, 2024 Author

Uh oh!

zzzeek Oct 29, 2024 Maintainer

Uh oh!

jlynchMicron Oct 29, 2024 Author

Uh oh!

Uh oh!

jlynchMicron Nov 4, 2024 Author

Uh oh!

zzzeek Nov 5, 2024 Maintainer

Uh oh!

jlynchMicron Nov 5, 2024 Author

jlynchMicron
Oct 28, 2024

Replies: 2 comments 8 replies

jlynchMicron
Oct 28, 2024
Author

zzzeek
Oct 28, 2024
Maintainer

jlynchMicron Oct 28, 2024
Author

zzzeek Oct 28, 2024
Maintainer

jlynchMicron Oct 28, 2024
Author

zzzeek Oct 29, 2024
Maintainer

jlynchMicron Oct 29, 2024
Author

jlynchMicron Nov 4, 2024
Author

zzzeek Nov 5, 2024
Maintainer

jlynchMicron Nov 5, 2024
Author