Re: BigQuery external table limitations

amlanroy1980

I have e-commerce product details stored in CSV format in the GCS bucket. Each file contains the attributes of a single product. The number of such files is approximately 4 billion.

I am planning to use a BigQuery external table to query the data.

It worked when I tested with 10 million such files. My question is, will it work when the number of files is as high as 4 billion? Even if it takes time (15-30 mins), it is fine.

What are the other recommended options to query the data if we don't use the BigQuery external table?

ms4446

Hi @amlanroy1980 ,

Testing with 10 million files is a good start, and it's encouraging to hear it worked well. However, scaling up to 4 billion files introduces significant challenges due to the inherent limitations and overheads associated with managing such a large number of files.

Challenges:

Metadata Overhead: As the number of files increases, so does the metadata that BigQuery needs to manage, potentially becoming a bottleneck.
Query Performance: BigQuery's performance might degrade when handling billions of files, especially smaller ones, due to the overhead of opening and reading each file.
Cost Concerns: Given that query costs in BigQuery are tied to the amount of data processed, the costs could escalate quickly due to the sheer number of files.

Potential Strategies:

Partitioning: Using natural divisions in your data (e.g., product categories, date) can help by reducing the scope of data scanned in each query, potentially improving performance.
Hierarchical Structure: Organizing your files in a logical directory structure within your GCS bucket (by category, then subcategory) can aid in management but won't necessarily enhance query speed.
Batching: Consider periodically loading subsets of your data into BigQuery managed tables. While this adds some latency and requires infrastructure for data synchronization, it can offer faster query times.

Alternatives to Consider:

Managed Tables: Directly importing your data into BigQuery managed tables could yield the best query performance, though it involves managing both the initial load and ongoing data updates.
Use of Wildcards: Employing wildcards in your external table definitions can ease managing file paths and schemas, though this alone doesn’t address all metadata overhead issues.
Cloud Storage FUSE: This solution allows GCS buckets to be mounted as filesystems, which might be useful for certain access patterns but is generally less suitable for enhancing SQL query performance.

Start with external tables combined with a strategic partitioning approach to see if it meets your performance needs. If you encounter performance issues, transitioning to managed tables for more frequently accessed queries might be beneficial.

amlanroy1980

Thanks, @ms4446 , for the reply.

We are already using a native BigQuey table for commonly used queries. We have identified the attributes that are frequently queried; we have extracted them from JSON and stored them in individual columns in a BigQuery native table. So, I am not really worried about the performance of the external table. Frequently used queries will be executed on the native table.

For the external table, I was wondering whether an external table will even work if the number of files is as high as 4 billion. It's fine if it takes time; may be even 30-60 mins.

Also, what are the possible options to make the external table more efficient for querying?

ms4446

Hi @amlanroy1980 ,

BQ external tables can, in theory, handle a dataset of this size. However, managing such a large number of files will come with its challenges, particularly in terms of metadata management and the time it takes to initiate and execute queries. As you mentioned, if query execution times of up to 30-60 minutes are acceptable, then it's technically feasible to use an external table for your scenario.

Enhancing Efficiency of External Tables:

To make querying external tables more efficient, especially when dealing with such a vast number of files, consider the following strategies:

Partitioning and Clustering:
- Partitioning: Apply partitioning on an appropriate column (e.g., date, category) to reduce the number of files and data scanned per query.
- Clustering: After partitioning, clustering the data based on frequently queried columns can further improve query performance by organizing data in each partition into a specified order.
Incremental Loading:
- Instead of querying all 4 billion files at once, consider creating a system where newer data is periodically added to a smaller, more manageable external table. This way, only a subset of the entire dataset is queried at any time, which can help manage performance degradation.
Combining Wildcard Functions with Table Partitions:
- Use wildcard functions in your table definitions to manage and simplify access to multiple files and directories. When combined with partitioning strategies, this can reduce the overhead caused by the vast number of files.
Utilizing Columnar Storage Formats:
- If possible, convert your CSV files into a columnar storage format like Parquet. BigQuery processes columnar formats more efficiently because they allow for better compression and faster data retrieval.
Caching Strategies:
- Implement caching mechanisms where possible. While BigQuery does not directly offer caching for external tables, intermediate results from queries can be stored in native tables or materialized views for frequently accessed data that doesn't need real-time freshness.

Testing and Monitoring: Given the unique nature of your dataset and its size, it would be advisable to conduct extensive testing on a subset of your dataset. Monitoring the performance and costs associated with these tests will provide better insights into how well these strategies work in your specific environment.

amlanroy1980

Hi @ms4446 , I was wondering if there is a completely different approach that can be explored.

To summarize my use case, we are storing product data. There are close to 3 billion products, they are stored as JSON. Products are constantly being updated.

We have identified important attributes, extracted them from JSON and stored them in columns in the native BigQuery table.

I am looking for a cost-effective approach for storing the complete JSONs and accessing them if required.

I also tried storing the JSON in native BigQuery. But facing problems because of the frequent updates to the data.

ms4446

Hi @amlanroy1980 ,

Here are a few other strategies you might consider:

1. Cloud Firestore: This NoSQL database excels in managing JSON-like documents and supports real-time updates. It’s an excellent fit if you frequently need to query or update specific portions of your JSON data. For analytics, Firestore can be integrated with BigQuery, although this requires setting up either manual exports or automated pipelines, which might add some operational overhead.

2. Change Data Capture (CDC) with a Message Queue (e.g., Pub/Sub): If your dataset experiences extremely frequent updates, a CDC strategy using Pub/Sub could efficiently manage these changes. This setup allows you to maintain synchronization with BigQuery and handle updates without significant performance impacts. Note that there may be some latency or consistency issues to manage when syncing between different systems.

4. Hybrid Approach with Data Tiering: This approach uses different storage solutions based on access frequency. Store highly accessed data in faster, slightly more costly solutions like Firestore or BigQuery, and keep full JSONs in a more cost-effective solution like GCS. This flexible strategy helps balance performance with cost.

Recommendations for Moving Forward:

Firestore is best if real-time access and updates are critical.
GCS and BigQuery external tables offer great value for storing large amounts of data with occasional access needs.
CDC with Pub/Sub is ideal for managing very high-frequency updates efficiently.
A hybrid approach might provide the most flexibility, depending on how varied your access needs are.

I recommend testing these solutions with a small subset of your data to determine which configuration best meets your specific requirements.