Open
Description
Environment details
- OS type and version: Mac OS X 13.1
- Python version: 3.10.9, although I've had the same issue in 3.11 and earlier versions
- pip version: 23.1.2
- google cloud library versions:
google-api-core 2.11.0
google-auth 2.18.0
google-cloud-bigquery 3.10.0
google-cloud-bigquery-storage 2.19.1
google-cloud-core 2.3.2
google-crc32c 1.5.0
google-resumable-media 2.5.0
Steps to reproduce
- Pull a large table (tables with a lot of columns seem to cause the largest problems) using client.list_rows, setting a large value for max_results and a non-zero start_index value.
- Use rows.to_dataframe to aggregate the results from that table into a DataFrame, which will (for whatever reason) cause multiple pages containing the same row data to be combined in a way that leads to duplicates.
Code example
from google.oauth2.service_account import Credentials
from google.cloud.bigquery import Client as BigQueryAPIClient
credentials = Credentials.from_service_account_file(
INSERT_CREDENTIAL_FILE_HERE
)
client = BigQueryAPIClient(credentials=credentials, project="bigquery-public-data")
table = client.get_table("bigquery-public-data.patents.publications")
rows = client.list_rows(table, max_results=50000, start_index=50000)
data = rows.to_dataframe(progress_bar_type="tqdm")
print(data[data[["publication_number", "application_number"]].duplicated()])
print(data[data['publication_number'] == 'WO-2011055014-A1'])
These columns are not duplicated in the actual publications table, which can be confirmed with this SQL:
SELECT publication_number, application_number, COUNT(*) FROM `bigquery-public-data.patents.publications`
GROUP BY publication_number, application_number
HAVING COUNT(*) > 1