Improve performance of "provenance summary" generation query#612
Improve performance of "provenance summary" generation query#612SandeepTuniki wants to merge 1 commit into
Conversation
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request optimizes the provenance summary aggregation by filtering Spanner queries to fetch only the place types and names for the specific place IDs present in the dataset, rather than fetching all nodes and edges. While this significantly reduces data transfer, the feedback highlights a potential runtime failure if a dataset contains more than 10,000 distinct places due to Spanner's IN clause limit. Additionally, minor improvements are suggested to fix duplicate step numbers in the SQL comments and to use an explicit CROSS JOIN UNNEST instead of a comma join for better readability.
| SET place_dcids_str = ( | ||
| SELECT IFNULL(STRING_AGG(FORMAT("'%s'", REPLACE(observation_about, "'", "\\'")), ','), "''") | ||
| FROM `temp_dataset_places` | ||
| ); |
There was a problem hiding this comment.
Potential Runtime Failure with Spanner's IN Clause Limit (10,000 values)
The current implementation constructs a comma-separated list of place DCIDs (place_dcids_str) and pushes it down to Spanner via EXTERNAL_QUERY using an IN clause:
subject_id IN (%s)
However, Google Cloud Spanner has a strict limit of 10,000 values in an IN operator (see Spanner Query Limits). If a dataset contains more than 10,000 distinct places (for example, US zip codes, census tracts, or school districts), this query will fail at runtime with a Spanner error.
Additionally, very large lists of DCIDs can exceed Spanner's query size limit (1 MB) or BigQuery's unresolved query length limit.
Recommended Solution:
Fallback / Batching / Conditional Execution:
If you want to keep the pushdown approach for smaller datasets, you can check the count of distinct places first. If it exceeds 10,000, fallback to fetching all or use a different strategy:
DECLARE place_count INT64;
SET place_count = (SELECT COUNT(*) FROM `temp_dataset_places`);
IF place_count <= 10000 THEN
-- Execute the filtered pushdown queries
...
ELSE
-- Fallback to fetching all or another safe method
...
END IF;References
- When writing BigQuery SQL queries, single quotes inside single-quoted string literals can be escaped using either backslash escaping (') or SQL-standard quote doubling ('').
- Do not use JOINs in Spanner queries executed via EXTERNAL_QUERY when using Databoost, as Databoost requires no joins in the Spanner query.
| LEFT JOIN `temp_type_edges_filtered` edges ON raw.observation_about = edges.subject_id; | ||
|
|
||
| -- Step 5: Aggregate Place Type Summaries (with distinct top_places) | ||
| -- Step 6: Aggregate Place Type Summaries and attach names to top 3 sample places |
There was a problem hiding this comment.
Duplicate Step Numbers in SQL Comments
There are multiple comments labeled as Step 6 in the SQL query:
- Line 123:
-- Step 6: Join observations with filtered place_type only - Line 132:
-- Step 6: Aggregate Place Type Summaries and attach names to top 3 sample places - Line 194:
-- Step 6: Final aggregation and export to Cache
Please update the step numbers sequentially (e.g., Step 6, Step 7, Step 8) to maintain readability and clarity.
| -- Step 6: Aggregate Place Type Summaries and attach names to top 3 sample places | |
| -- Step 7: Aggregate Place Type Summaries and attach names to top 3 sample places |
| FROM top_place_dcids tp, | ||
| UNNEST(tp.top_dcids) as dcid |
There was a problem hiding this comment.
Use Explicit CROSS JOIN UNNEST for Readability
Using a comma join (FROM top_place_dcids tp, UNNEST(...)) is valid in BigQuery but can be less readable and sometimes ambiguous. Using an explicit CROSS JOIN is the preferred standard practice.
| FROM top_place_dcids tp, | |
| UNNEST(tp.top_dcids) as dcid | |
| FROM top_place_dcids tp | |
| CROSS JOIN UNNEST(tp.top_dcids) as dcid |
This PR improves the performance of "provenance summary" generation query.
Broadly, the query improvements are due to:
To measure the improvement, I did some benchmarking on staging DB with Jetski's help on 3 different imports of varying sizes. Below are the results: