Skip to content

Pulling large dataframes using client.list_rows and rows.to_dataframe() leads to duplicates when setting max_results and start_index #1569

Open
@kdw2126

Description

@kdw2126

Environment details

  • OS type and version: Mac OS X 13.1
  • Python version: 3.10.9, although I've had the same issue in 3.11 and earlier versions
  • pip version: 23.1.2
  • google cloud library versions:

google-api-core 2.11.0
google-auth 2.18.0
google-cloud-bigquery 3.10.0
google-cloud-bigquery-storage 2.19.1
google-cloud-core 2.3.2
google-crc32c 1.5.0
google-resumable-media 2.5.0

Steps to reproduce

  1. Pull a large table (tables with a lot of columns seem to cause the largest problems) using client.list_rows, setting a large value for max_results and a non-zero start_index value.
  2. Use rows.to_dataframe to aggregate the results from that table into a DataFrame, which will (for whatever reason) cause multiple pages containing the same row data to be combined in a way that leads to duplicates.

Code example

from google.oauth2.service_account import Credentials
from google.cloud.bigquery import Client as BigQueryAPIClient

credentials = Credentials.from_service_account_file(
    INSERT_CREDENTIAL_FILE_HERE
)

client = BigQueryAPIClient(credentials=credentials, project="bigquery-public-data")

table = client.get_table("bigquery-public-data.patents.publications")
rows = client.list_rows(table, max_results=50000, start_index=50000)
data = rows.to_dataframe(progress_bar_type="tqdm")

print(data[data[["publication_number", "application_number"]].duplicated()])
print(data[data['publication_number'] == 'WO-2011055014-A1'])

These columns are not duplicated in the actual publications table, which can be confirmed with this SQL:

SELECT publication_number, application_number, COUNT(*) FROM `bigquery-public-data.patents.publications` 
GROUP BY publication_number, application_number
HAVING COUNT(*) > 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    api: bigqueryIssues related to the googleapis/python-bigquery API.priority: p3Desirable enhancement or fix. May not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions