Skip to content

feat: Implement Statvar Calculation aggregation#589

Merged
SandeepTuniki merged 7 commits into
masterfrom
statvar-calculation-aggregation
Jul 1, 2026
Merged

feat: Implement Statvar Calculation aggregation#589
SandeepTuniki merged 7 commits into
masterfrom
statvar-calculation-aggregation

Conversation

@SandeepTuniki

Copy link
Copy Markdown
Contributor

No description provided.

@codacy-production

codacy-production Bot commented Jun 25, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 10 medium · 18 minor

Alerts:
⚠ 28 issues (≤ 0 issues of at least minor severity)

Results:
28 new issues

Category Results
Documentation 12 minor
Security 4 medium
CodeStyle 6 minor
Complexity 6 medium

View in Codacy

🟢 Metrics 54 complexity

Metric Results
Complexity 54

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the StatVarCalculationGenerator class along with its corresponding end-to-end integration tests. This generator builds and executes multi-statement SQL scripts using BigQuery Federation to perform mathematical operations (such as DIVIDE, MULTIPLY, ADD, and SUBTRACT) on statistical variables, constructing output TimeSeries and Observation rows to write back to Spanner. The review feedback highlights two critical SQL injection vulnerabilities: the multiplier value is embedded into the SQL query without validation or casting, and the regular expressions (sv_regex and mm_regex) are not escaped, which could allow malicious inputs to alter the query structure. Both issues should be addressed by casting the multiplier to a float and escaping single quotes in the regex filters.

Comment thread pipeline/workflow/ingestion-helper/aggregation/stat_var_calculation_generator.py Outdated
Comment thread pipeline/workflow/ingestion-helper/aggregation/stat_var_calculation_generator.py Outdated
@SandeepTuniki

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the StatVarCalculationGenerator class to perform statistical variable calculations (such as DIVIDE, MULTIPLY, ADD, and SUBTRACT) using BigQuery Federation and write the results back to Spanner, along with comprehensive integration tests. The review feedback highlights several critical issues: a major performance bottleneck caused by multiple full table scans on the Spanner Observation table, a correctness bug where the provenance column is missing from the TimeSeries export, an incorrect escaping helper used for BigQuery-only literals, and a regex escaping bug inside BigQuery raw string literals.

Comment thread pipeline/workflow/ingestion-helper/aggregation/stat_var_calculation_generator.py Outdated
Comment thread pipeline/workflow/ingestion-helper/aggregation/stat_var_calculation_generator.py Outdated
@SandeepTuniki SandeepTuniki marked this pull request as ready for review June 26, 2026 03:47

@n-h-diaz n-h-diaz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! very cool sql :)

just want to double check: I see import_name_regex as an option to filter facets in the config - is anything needed to handle this here? or will the other facet matching take care of it? (and for my own understanding, does the calculation do a cross product of all input imports that match the facets or only within an import?)

also curious if you've run this on the existing configs and what the performance is like? (the climate calculation is huge(!), so want to make sure this is feasible. for example, if it gets very large, maybe we can do more aggressive filtering early on vs fetching the entire input import from spanner)

also since we're doing this for a subset of imports, we should ensure that all required input imports are included within the same aggregation run (not for this pr, but something to keep in mind)

@SandeepTuniki

SandeepTuniki commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review and suggestions, Natalie!

To answer your questions and update you on the changes since your review:

  • import_name_regex: Yes, my bad. This was a gap in my porting. Thanks for catching this. I've now implemented support for import_name_regex.
  • Join Behaviour: Yes, the calculation performs a cross-import join based on place-date/extra_entities (it does not restrict inputs to the same import/provenance). It is the same behaviour in the existing flume pipeline.
  • Scaling & performance testing: I haven't done full scale testing on our full KG for these queries. To keep the development flow fast, I've primarily relied on e2e tests for correctness, but I'm banking on BQ to scale the performance for our full KG. However, I did some scalability testing on an earlier place aggregation and used those lessons to apply the same patterns here (ex: creating temp tables in BQ, streaming only the required data from Spanner etc). I still anticipate running into a few scaling issues once we run these on full KG, but I'm optimistic that they can be fixed with minor tweaks in the queries.
  • Also, thanks for catching the sub-optimal part of query which was loading most of the Observation table's data into BQ 🙏. I updated the query to load only those observations which belong to the specified imports.
  • Dependency of input imports: This orchestration will be handled through the changes in feat: Support configurable aggregations setup #582.

Thanks for approving the PR. I'll leave the PR open for a day in case you might want to respond to code changes, or for any further comments. I'll merge it otherwise.

@SandeepTuniki SandeepTuniki enabled auto-merge (squash) July 1, 2026 02:23
@SandeepTuniki SandeepTuniki merged commit f5c4b24 into master Jul 1, 2026
10 of 11 checks passed
@SandeepTuniki SandeepTuniki deleted the statvar-calculation-aggregation branch July 1, 2026 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants