Improve performance of "provenance summary" generation query by SandeepTuniki · Pull Request #612 · datacommonsorg/import

SandeepTuniki · 2026-07-04T17:37:32Z

This PR improves the performance of "provenance summary" generation query.

Broadly, the query improvements are due to:

Performing some early filtering on Spanner side so that less data is streamed to BigQuery.
On Bigquery side:
- I created temp tables to buffer the streamed data instead of loading it into memory directly.
- In addition, I applied further filters on BQ before applying joins to reduce intermediate processing.

To measure the improvement, I did some benchmarking on staging DB with Jetski's help on 3 different imports of varying sizes. Below are the results:

Scale Category	Dataset Name	Time Series Count	Total Observations	BigQuery Data Processed	Total Execution Time
Small	Adolescent_Birth_Rate	439	7,403	18.3 MB	159.75s (~2.6 min)
Medium	BEAGDPv2	84,123	~2.1 Million	579 MB	151.60s (~2.5 min)
Very Large	BLS_QCEW	58,446,698	~640 Million	2.77 TB	869.76s (~14.4 min)

…dge lookups

codacy-production · 2026-07-04T17:39:07Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results

Complexity 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

gemini-code-assist

Code Review

This pull request optimizes the provenance summary aggregation by filtering Spanner queries to fetch only the place types and names for the specific place IDs present in the dataset, rather than fetching all nodes and edges. While this significantly reduces data transfer, the feedback highlights a potential runtime failure if a dataset contains more than 10,000 distinct places due to Spanner's IN clause limit. Additionally, minor improvements are suggested to fix duplicate step numbers in the SQL comments and to use an explicit CROSS JOIN UNNEST instead of a comma join for better readability.

gemini-code-assist · 2026-07-04T17:39:49Z

+        SET place_dcids_str = (
+          SELECT IFNULL(STRING_AGG(FORMAT("'%s'", REPLACE(observation_about, "'", "\\'")), ','), "''")
+          FROM `temp_dataset_places`
+        );


Potential Runtime Failure with Spanner's IN Clause Limit (10,000 values)

The current implementation constructs a comma-separated list of place DCIDs (place_dcids_str) and pushes it down to Spanner via EXTERNAL_QUERY using an IN clause:
subject_id IN (%s)

However, Google Cloud Spanner has a strict limit of 10,000 values in an IN operator (see Spanner Query Limits). If a dataset contains more than 10,000 distinct places (for example, US zip codes, census tracts, or school districts), this query will fail at runtime with a Spanner error.

Additionally, very large lists of DCIDs can exceed Spanner's query size limit (1 MB) or BigQuery's unresolved query length limit.

Recommended Solution:

Fallback / Batching / Conditional Execution:
If you want to keep the pushdown approach for smaller datasets, you can check the count of distinct places first. If it exceeds 10,000, fallback to fetching all or use a different strategy:

DECLARE place_count INT64; SET place_count = (SELECT COUNT(*) FROM `temp_dataset_places`); IF place_count <= 10000 THEN -- Execute the filtered pushdown queries ... ELSE -- Fallback to fetching all or another safe method ... END IF;

References

When writing BigQuery SQL queries, single quotes inside single-quoted string literals can be escaped using either backslash escaping (') or SQL-standard quote doubling ('').

Do not use JOINs in Spanner queries executed via EXTERNAL_QUERY when using Databoost, as Databoost requires no joins in the Spanner query.

gemini-code-assist · 2026-07-04T17:39:49Z

+        LEFT JOIN `temp_type_edges_filtered` edges ON raw.observation_about = edges.subject_id;

-        -- Step 5: Aggregate Place Type Summaries (with distinct top_places)
+        -- Step 6: Aggregate Place Type Summaries and attach names to top 3 sample places


Duplicate Step Numbers in SQL Comments

There are multiple comments labeled as Step 6 in the SQL query:

Line 123: -- Step 6: Join observations with filtered place_type only

Line 132: -- Step 6: Aggregate Place Type Summaries and attach names to top 3 sample places

Line 194: -- Step 6: Final aggregation and export to Cache

Please update the step numbers sequentially (e.g., Step 6, Step 7, Step 8) to maintain readability and clarity.

Suggested change

-- Step 6: Aggregate Place Type Summaries and attach names to top 3 sample places

-- Step 7: Aggregate Place Type Summaries and attach names to top 3 sample places

gemini-code-assist · 2026-07-04T17:39:49Z

+          FROM top_place_dcids tp,
+          UNNEST(tp.top_dcids) as dcid


Use Explicit CROSS JOIN UNNEST for Readability

Using a comma join (FROM top_place_dcids tp, UNNEST(...)) is valid in BigQuery but can be less readable and sometimes ambiguous. Using an explicit CROSS JOIN is the preferred standard practice.

Suggested change

FROM top_place_dcids tp,

UNNEST(tp.top_dcids) as dcid

FROM top_place_dcids tp

CROSS JOIN UNNEST(tp.top_dcids) as dcid

Fix ProvenanceSummaryGenerator Spanner pushdown filter for Node and E…

201e8e5

…dge lookups

gemini-code-assist Bot reviewed Jul 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance of "provenance summary" generation query#612

Improve performance of "provenance summary" generation query#612
SandeepTuniki wants to merge 1 commit into
masterfrom
fix-provenance-summary-spanner-pushdown

SandeepTuniki commented Jul 4, 2026 •

edited

Loading

Uh oh!

codacy-production Bot commented Jul 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 4, 2026

Uh oh!

gemini-code-assist Bot Jul 4, 2026

Uh oh!

gemini-code-assist Bot Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	-- Step 6: Aggregate Place Type Summaries and attach names to top 3 sample places
	-- Step 7: Aggregate Place Type Summaries and attach names to top 3 sample places

Uh oh!

Conversation

SandeepTuniki commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codacy-production Bot commented Jul 4, 2026

Up to standards ✅

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 4, 2026

Choose a reason for hiding this comment

Potential Runtime Failure with Spanner's IN Clause Limit (10,000 values)

Recommended Solution:

Uh oh!

gemini-code-assist Bot Jul 4, 2026

Choose a reason for hiding this comment

Duplicate Step Numbers in SQL Comments

Uh oh!

gemini-code-assist Bot Jul 4, 2026

Choose a reason for hiding this comment

Use Explicit CROSS JOIN UNNEST for Readability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SandeepTuniki commented Jul 4, 2026 •

edited

Loading

Potential Runtime Failure with Spanner's `IN` Clause Limit (10,000 values)

Use Explicit `CROSS JOIN UNNEST` for Readability