feat: Implement Statvar Calculation aggregation by SandeepTuniki · Pull Request #589 · datacommonsorg/import

SandeepTuniki · 2026-06-25T13:03:12Z

No description provided.

…and Spanner

codacy-production · 2026-06-25T13:04:18Z

Not up to standards ⛔

🔴 Issues 10 medium · 18 minor

Alerts:
⚠ 28 issues (≤ 0 issues of at least minor severity)

Results:
28 new issues

Category Results

Documentation 12 minor

Security 4 medium

CodeStyle 6 minor

Complexity 6 medium

View in Codacy

🟢 Metrics 54 complexity

Metric Results

Complexity 54

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

gemini-code-assist

Code Review

This pull request introduces the StatVarCalculationGenerator class along with its corresponding end-to-end integration tests. This generator builds and executes multi-statement SQL scripts using BigQuery Federation to perform mathematical operations (such as DIVIDE, MULTIPLY, ADD, and SUBTRACT) on statistical variables, constructing output TimeSeries and Observation rows to write back to Spanner. The review feedback highlights two critical SQL injection vulnerabilities: the multiplier value is embedded into the SQL query without validation or casting, and the regular expressions (sv_regex and mm_regex) are not escaped, which could allow malicious inputs to alter the query structure. Both issues should be addressed by casting the multiplier to a float and escaping single quotes in the regex filters.

SandeepTuniki · 2026-06-25T13:25:47Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the StatVarCalculationGenerator class to perform statistical variable calculations (such as DIVIDE, MULTIPLY, ADD, and SUBTRACT) using BigQuery Federation and write the results back to Spanner, along with comprehensive integration tests. The review feedback highlights several critical issues: a major performance bottleneck caused by multiple full table scans on the Spanner Observation table, a correctness bug where the provenance column is missing from the TimeSeries export, an incorrect escaping helper used for BigQuery-only literals, and a regex escaping bug inside BigQuery raw string literals.

…raw string escaping

n-h-diaz

Thanks! very cool sql :)

just want to double check: I see import_name_regex as an option to filter facets in the config - is anything needed to handle this here? or will the other facet matching take care of it? (and for my own understanding, does the calculation do a cross product of all input imports that match the facets or only within an import?)

also curious if you've run this on the existing configs and what the performance is like? (the climate calculation is huge(!), so want to make sure this is feasible. for example, if it gets very large, maybe we can do more aggressive filtering early on vs fetching the entire input import from spanner)

also since we're doing this for a subset of imports, we should ensure that all required input imports are included within the same aggregation run (not for this pr, but something to keep in mind)

…ests

SandeepTuniki · 2026-06-30T06:20:39Z

Thanks for the review and suggestions, Natalie!

To answer your questions and update you on the changes since your review:

import_name_regex: Yes, my bad. This was a gap in my porting. Thanks for catching this. I've now implemented support for import_name_regex.
Join Behaviour: Yes, the calculation performs a cross-import join based on place-date/extra_entities (it does not restrict inputs to the same import/provenance). It is the same behaviour in the existing flume pipeline.
Scaling & performance testing: I haven't done full scale testing on our full KG for these queries. To keep the development flow fast, I've primarily relied on e2e tests for correctness, but I'm banking on BQ to scale the performance for our full KG. However, I did some scalability testing on an earlier place aggregation and used those lessons to apply the same patterns here (ex: creating temp tables in BQ, streaming only the required data from Spanner etc). I still anticipate running into a few scaling issues once we run these on full KG, but I'm optimistic that they can be fixed with minor tweaks in the queries.
Also, thanks for catching the sub-optimal part of query which was loading most of the Observation table's data into BQ 🙏. I updated the query to load only those observations which belong to the specified imports.
Dependency of input imports: This orchestration will be handled through the changes in feat: Support configurable aggregations setup #582.

Thanks for approving the PR. I'll leave the PR open for a day in case you might want to respond to code changes, or for any further comments. I'll merge it otherwise.

Implement StatVar calculation aggregations using BigQuery Federation …

8383b01

…and Spanner

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread pipeline/workflow/ingestion-helper/aggregation/stat_var_calculation_generator.py Outdated

Comment thread pipeline/workflow/ingestion-helper/aggregation/stat_var_calculation_generator.py Outdated

Security: Validate multiplier and escape single quotes in regex filters

b7c1803

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

SandeepTuniki added 2 commits June 25, 2026 19:09

Optimize: Cache Spanner inputs in BigQuery temp tables and fix regex …

e2aaeac

…raw string escaping

Fix: Resolve Codacy warnings and expand metadata assertions in E2E tests

30c1ff3

SandeepTuniki marked this pull request as ready for review June 26, 2026 03:47

SandeepTuniki requested review from n-h-diaz and vish-cs June 26, 2026 03:47

n-h-diaz approved these changes Jun 26, 2026

View reviewed changes

SandeepTuniki added 2 commits June 30, 2026 10:53

Optimize: Implement early Spanner-side filtering for Observation table

26faeab

Feat: Implement import_name_regex filtering and add E2E integration t…

b24d3c5

…ests

Merge branch 'master' into statvar-calculation-aggregation

3d897bc

SandeepTuniki enabled auto-merge (squash) July 1, 2026 02:23

SandeepTuniki merged commit f5c4b24 into master Jul 1, 2026
10 of 11 checks passed

SandeepTuniki deleted the statvar-calculation-aggregation branch July 1, 2026 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Implement Statvar Calculation aggregation#589

feat: Implement Statvar Calculation aggregation#589
SandeepTuniki merged 7 commits into
masterfrom
statvar-calculation-aggregation

SandeepTuniki commented Jun 25, 2026

Uh oh!

codacy-production Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

SandeepTuniki commented Jun 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

n-h-diaz left a comment

Uh oh!

SandeepTuniki commented Jun 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

SandeepTuniki commented Jun 25, 2026

Uh oh!

codacy-production Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

SandeepTuniki commented Jun 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

n-h-diaz left a comment

Choose a reason for hiding this comment

Uh oh!

SandeepTuniki commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codacy-production Bot commented Jun 25, 2026 •

edited

Loading

SandeepTuniki commented Jun 30, 2026 •

edited

Loading