Skip to content

Improve performance of "provenance summary" generation query#612

Draft
SandeepTuniki wants to merge 1 commit into
masterfrom
fix-provenance-summary-spanner-pushdown
Draft

Improve performance of "provenance summary" generation query#612
SandeepTuniki wants to merge 1 commit into
masterfrom
fix-provenance-summary-spanner-pushdown

Conversation

@SandeepTuniki

@SandeepTuniki SandeepTuniki commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

This PR improves the performance of "provenance summary" generation query.

Broadly, the query improvements are due to:

  • Performing some early filtering on Spanner side so that less data is streamed to BigQuery.
  • On Bigquery side:
    • I created temp tables to buffer the streamed data instead of loading it into memory directly.
    • In addition, I applied further filters on BQ before applying joins to reduce intermediate processing.

To measure the improvement, I did some benchmarking on staging DB with Jetski's help on 3 different imports of varying sizes. Below are the results:

Scale Category Dataset Name Time Series Count Total Observations BigQuery Data Processed Total Execution Time
Small Adolescent_Birth_Rate 439 7,403 18.3 MB 159.75s (~2.6 min)
Medium BEAGDPv2 84,123 ~2.1 Million 579 MB 151.60s (~2.5 min)
Very Large BLS_QCEW 58,446,698 ~640 Million 2.77 TB 869.76s (~14.4 min)

@codacy-production

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the provenance summary aggregation by filtering Spanner queries to fetch only the place types and names for the specific place IDs present in the dataset, rather than fetching all nodes and edges. While this significantly reduces data transfer, the feedback highlights a potential runtime failure if a dataset contains more than 10,000 distinct places due to Spanner's IN clause limit. Additionally, minor improvements are suggested to fix duplicate step numbers in the SQL comments and to use an explicit CROSS JOIN UNNEST instead of a comma join for better readability.

Comment on lines +100 to +103
SET place_dcids_str = (
SELECT IFNULL(STRING_AGG(FORMAT("'%s'", REPLACE(observation_about, "'", "\\'")), ','), "''")
FROM `temp_dataset_places`
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Potential Runtime Failure with Spanner's IN Clause Limit (10,000 values)

The current implementation constructs a comma-separated list of place DCIDs (place_dcids_str) and pushes it down to Spanner via EXTERNAL_QUERY using an IN clause:
subject_id IN (%s)

However, Google Cloud Spanner has a strict limit of 10,000 values in an IN operator (see Spanner Query Limits). If a dataset contains more than 10,000 distinct places (for example, US zip codes, census tracts, or school districts), this query will fail at runtime with a Spanner error.

Additionally, very large lists of DCIDs can exceed Spanner's query size limit (1 MB) or BigQuery's unresolved query length limit.

Recommended Solution:

Fallback / Batching / Conditional Execution:
If you want to keep the pushdown approach for smaller datasets, you can check the count of distinct places first. If it exceeds 10,000, fallback to fetching all or use a different strategy:

DECLARE place_count INT64;
SET place_count = (SELECT COUNT(*) FROM `temp_dataset_places`);

IF place_count <= 10000 THEN
  -- Execute the filtered pushdown queries
  ...
ELSE
  -- Fallback to fetching all or another safe method
  ...
END IF;
References
  1. When writing BigQuery SQL queries, single quotes inside single-quoted string literals can be escaped using either backslash escaping (') or SQL-standard quote doubling ('').
  2. Do not use JOINs in Spanner queries executed via EXTERNAL_QUERY when using Databoost, as Databoost requires no joins in the Spanner query.

LEFT JOIN `temp_type_edges_filtered` edges ON raw.observation_about = edges.subject_id;

-- Step 5: Aggregate Place Type Summaries (with distinct top_places)
-- Step 6: Aggregate Place Type Summaries and attach names to top 3 sample places

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Duplicate Step Numbers in SQL Comments

There are multiple comments labeled as Step 6 in the SQL query:

  • Line 123: -- Step 6: Join observations with filtered place_type only
  • Line 132: -- Step 6: Aggregate Place Type Summaries and attach names to top 3 sample places
  • Line 194: -- Step 6: Final aggregation and export to Cache

Please update the step numbers sequentially (e.g., Step 6, Step 7, Step 8) to maintain readability and clarity.

Suggested change
-- Step 6: Aggregate Place Type Summaries and attach names to top 3 sample places
-- Step 7: Aggregate Place Type Summaries and attach names to top 3 sample places

Comment on lines +177 to +178
FROM top_place_dcids tp,
UNNEST(tp.top_dcids) as dcid

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use Explicit CROSS JOIN UNNEST for Readability

Using a comma join (FROM top_place_dcids tp, UNNEST(...)) is valid in BigQuery but can be less readable and sometimes ambiguous. Using an explicit CROSS JOIN is the preferred standard practice.

Suggested change
FROM top_place_dcids tp,
UNNEST(tp.top_dcids) as dcid
FROM top_place_dcids tp
CROSS JOIN UNNEST(tp.top_dcids) as dcid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant