feat: Implement super enum aggregations by SandeepTuniki · Pull Request #598 · datacommonsorg/import

SandeepTuniki · 2026-06-29T06:24:21Z

No description provided.

codacy-production · 2026-06-29T06:25:23Z

Not up to standards ⛔

🔴 Issues 2 high · 13 medium · 12 minor

Alerts:
⚠ 27 issues (≤ 0 issues of at least minor severity)

Results:
27 new issues

Category Results

UnusedCode 2 medium

Documentation 9 minor

ErrorProne 2 high

Security 10 medium

CodeStyle 3 minor

Complexity 1 medium

View in Codacy

🟢 Metrics 26 complexity

Metric Results

Complexity 26

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

gemini-code-assist

Code Review

This pull request introduces the SuperEnumAggregationGenerator class to perform Super Enum aggregations using a Pure BigQuery approach, along with corresponding end-to-end integration tests. The reviewer provided valuable feedback on optimizing the BigQuery script: first, to avoid a full table scan on Spanner's Observation table by filtering the query using a list of eligible statistical variables; and second, to dynamically construct the provenance of generated edges using the source edge's provenance instead of hardcoding the first import name, thereby properly supporting multiple imports.

gemini-code-assist · 2026-06-29T06:26:38Z

+          FROM EXTERNAL_QUERY("{connection_id}",
+            '''SELECT variable_measured, entity1, extra_entities_id, facet_id, date, value FROM Observation''') o


Querying the entire Observation table without any filter on variable_measured inside the EXTERNAL_QUERY will trigger a full table scan on Spanner's largest table. In production, this will lead to severe performance degradation, high query costs, and potential timeouts.

To optimize this, we should fetch the list of eligible source statistical variables in Python first, and then pass them as a list of literals to filter the Observation query directly in Spanner.

Suggested change

FROM EXTERNAL_QUERY("{connection_id}",

'''SELECT variable_measured, entity1, extra_entities_id, facet_id, date, value FROM Observation''') o

FROM EXTERNAL_QUERY("{connection_id}",

'''SELECT variable_measured, entity1, extra_entities_id, facet_id, date, value FROM Observation

WHERE variable_measured IN ({eligible_svs_placeholder})''') o

gemini-code-assist · 2026-06-29T06:26:38Z

+        SELECT subject_id, predicate, object_id
+        FROM EXTERNAL_QUERY("{connection_id}",
+          '''SELECT subject_id, predicate, object_id FROM Edge


To support multiple imports correctly and avoid hardcoding the provenance of the first import, we should select the provenance column from the Edge table in Spanner. This will allow us to dynamically construct the provenance for the generated edges.

Suggested change

SELECT subject_id, predicate, object_id

FROM EXTERNAL_QUERY("{connection_id}",

'''SELECT subject_id, predicate, object_id FROM Edge

SELECT subject_id, predicate, object_id, provenance

FROM EXTERNAL_QUERY("{connection_id}",

'''SELECT subject_id, predicate, object_id, provenance FROM Edge

gemini-code-assist · 2026-06-29T06:26:38Z

+          subject_id,
+          predicate,
+          object_id,
+          CONCAT('{prefix}', '{safe_names[0]}', '_SuperEnum') AS provenance


Instead of hardcoding the provenance using the first import name (safe_names[0]), we can dynamically construct it using the provenance of the source edge (p.provenance). This ensures that if multiple imports are processed together, each generated edge retains its correct provenance mapping.

Suggested change

CONCAT('{prefix}', '{safe_names[0]}', '_SuperEnum') AS provenance

CONCAT(p.provenance, '_SuperEnum') AS provenance

…ify/reformat tests

Implement super enum aggregation using pure BigQuery

74d5a73

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

Refactor SQL helpers to functions, align strategy with C++, and simpl…

32f55d8

…ify/reformat tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Implement super enum aggregations#598

feat: Implement super enum aggregations#598
SandeepTuniki wants to merge 2 commits into
masterfrom
super-enum-aggregation

SandeepTuniki commented Jun 29, 2026

Uh oh!

codacy-production Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		FROM EXTERNAL_QUERY("{connection_id}",
		'''SELECT variable_measured, entity1, extra_entities_id, facet_id, date, value FROM Observation''') o

	CONCAT('{prefix}', '{safe_names[0]}', '_SuperEnum') AS provenance
	CONCAT(p.provenance, '_SuperEnum') AS provenance

Uh oh!

Conversation

SandeepTuniki commented Jun 29, 2026

Uh oh!

codacy-production Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codacy-production Bot commented Jun 29, 2026 •

edited

Loading