I have e-commerce product details stored in CSV format in the GCS bucket. Each file contains the attributes of a single product. The number of such files is approximately 4 billion.
I am planning to use a BigQuery external table to query the data.
It worked when I tested with 10 million such files. My question is, will it work when the number of files is as high as 4 billion? Even if it takes time (15-30 mins), it is fine.
What are the other recommended options to query the data if we don't use the BigQuery external table?
Hi @amlanroy1980 ,
Testing with 10 million files is a good start, and it's encouraging to hear it worked well. However, scaling up to 4 billion files introduces significant challenges due to the inherent limitations and overheads associated with managing such a large number of files.
Challenges:
Potential Strategies:
Alternatives to Consider:
Start with external tables combined with a strategic partitioning approach to see if it meets your performance needs. If you encounter performance issues, transitioning to managed tables for more frequently accessed queries might be beneficial.
Thanks, @ms4446 , for the reply.
We are already using a native BigQuey table for commonly used queries. We have identified the attributes that are frequently queried; we have extracted them from JSON and stored them in individual columns in a BigQuery native table. So, I am not really worried about the performance of the external table. Frequently used queries will be executed on the native table.
For the external table, I was wondering whether an external table will even work if the number of files is as high as 4 billion. It's fine if it takes time; may be even 30-60 mins.
Also, what are the possible options to make the external table more efficient for querying?
Hi @amlanroy1980 ,
BQ external tables can, in theory, handle a dataset of this size. However, managing such a large number of files will come with its challenges, particularly in terms of metadata management and the time it takes to initiate and execute queries. As you mentioned, if query execution times of up to 30-60 minutes are acceptable, then it's technically feasible to use an external table for your scenario.
Enhancing Efficiency of External Tables:
To make querying external tables more efficient, especially when dealing with such a vast number of files, consider the following strategies:
Partitioning and Clustering:
Incremental Loading:
Combining Wildcard Functions with Table Partitions:
Utilizing Columnar Storage Formats:
Caching Strategies:
Testing and Monitoring: Given the unique nature of your dataset and its size, it would be advisable to conduct extensive testing on a subset of your dataset. Monitoring the performance and costs associated with these tests will provide better insights into how well these strategies work in your specific environment.
Hi @ms4446 , I was wondering if there is a completely different approach that can be explored.
To summarize my use case, we are storing product data. There are close to 3 billion products, they are stored as JSON. Products are constantly being updated.
We have identified important attributes, extracted them from JSON and stored them in columns in the native BigQuery table.
I am looking for a cost-effective approach for storing the complete JSONs and accessing them if required.
I also tried storing the JSON in native BigQuery. But facing problems because of the frequent updates to the data.
Hi @amlanroy1980 ,
Here are a few other strategies you might consider:
1. Cloud Firestore: This NoSQL database excels in managing JSON-like documents and supports real-time updates. It’s an excellent fit if you frequently need to query or update specific portions of your JSON data. For analytics, Firestore can be integrated with BigQuery, although this requires setting up either manual exports or automated pipelines, which might add some operational overhead.
2. Change Data Capture (CDC) with a Message Queue (e.g., Pub/Sub): If your dataset experiences extremely frequent updates, a CDC strategy using Pub/Sub could efficiently manage these changes. This setup allows you to maintain synchronization with BigQuery and handle updates without significant performance impacts. Note that there may be some latency or consistency issues to manage when syncing between different systems.
4. Hybrid Approach with Data Tiering: This approach uses different storage solutions based on access frequency. Store highly accessed data in faster, slightly more costly solutions like Firestore or BigQuery, and keep full JSONs in a more cost-effective solution like GCS. This flexible strategy helps balance performance with cost.
Recommendations for Moving Forward:
I recommend testing these solutions with a small subset of your data to determine which configuration best meets your specific requirements.
User | Count |
---|---|
1 | |
1 | |
1 | |
1 | |
1 |