Skip to content

feat: Implement super enum aggregations#598

Draft
SandeepTuniki wants to merge 2 commits into
masterfrom
super-enum-aggregation
Draft

feat: Implement super enum aggregations#598
SandeepTuniki wants to merge 2 commits into
masterfrom
super-enum-aggregation

Conversation

@SandeepTuniki

Copy link
Copy Markdown
Contributor

No description provided.

@codacy-production

codacy-production Bot commented Jun 29, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 2 high · 13 medium · 12 minor

Alerts:
⚠ 27 issues (≤ 0 issues of at least minor severity)

Results:
27 new issues

Category Results
UnusedCode 2 medium
Documentation 9 minor
ErrorProne 2 high
Security 10 medium
CodeStyle 3 minor
Complexity 1 medium

View in Codacy

🟢 Metrics 26 complexity

Metric Results
Complexity 26

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the SuperEnumAggregationGenerator class to perform Super Enum aggregations using a Pure BigQuery approach, along with corresponding end-to-end integration tests. The reviewer provided valuable feedback on optimizing the BigQuery script: first, to avoid a full table scan on Spanner's Observation table by filtering the query using a list of eligible statistical variables; and second, to dynamically construct the provenance of generated edges using the source edge's provenance instead of hardcoding the first import name, thereby properly supporting multiple imports.

Comment on lines +344 to +345
FROM EXTERNAL_QUERY("{connection_id}",
'''SELECT variable_measured, entity1, extra_entities_id, facet_id, date, value FROM Observation''') o

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Querying the entire Observation table without any filter on variable_measured inside the EXTERNAL_QUERY will trigger a full table scan on Spanner's largest table. In production, this will lead to severe performance degradation, high query costs, and potential timeouts.

To optimize this, we should fetch the list of eligible source statistical variables in Python first, and then pass them as a list of literals to filter the Observation query directly in Spanner.

Suggested change
FROM EXTERNAL_QUERY("{connection_id}",
'''SELECT variable_measured, entity1, extra_entities_id, facet_id, date, value FROM Observation''') o
FROM EXTERNAL_QUERY("{connection_id}",
'''SELECT variable_measured, entity1, extra_entities_id, facet_id, date, value FROM Observation
WHERE variable_measured IN ({eligible_svs_placeholder})''') o

Comment on lines +124 to +126
SELECT subject_id, predicate, object_id
FROM EXTERNAL_QUERY("{connection_id}",
'''SELECT subject_id, predicate, object_id FROM Edge

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To support multiple imports correctly and avoid hardcoding the provenance of the first import, we should select the provenance column from the Edge table in Spanner. This will allow us to dynamically construct the provenance for the generated edges.

Suggested change
SELECT subject_id, predicate, object_id
FROM EXTERNAL_QUERY("{connection_id}",
'''SELECT subject_id, predicate, object_id FROM Edge
SELECT subject_id, predicate, object_id, provenance
FROM EXTERNAL_QUERY("{connection_id}",
'''SELECT subject_id, predicate, object_id, provenance FROM Edge

subject_id,
predicate,
object_id,
CONCAT('{prefix}', '{safe_names[0]}', '_SuperEnum') AS provenance

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of hardcoding the provenance using the first import name (safe_names[0]), we can dynamically construct it using the provenance of the source edge (p.provenance). This ensures that if multiple imports are processed together, each generated edge retains its correct provenance mapping.

Suggested change
CONCAT('{prefix}', '{safe_names[0]}', '_SuperEnum') AS provenance
CONCAT(p.provenance, '_SuperEnum') AS provenance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant