feat: Implement super enum aggregations#598
Conversation
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| UnusedCode | 2 medium |
| Documentation | 9 minor |
| ErrorProne | 2 high |
| Security | 10 medium |
| CodeStyle | 3 minor |
| Complexity | 1 medium |
🟢 Metrics 26 complexity
Metric Results Complexity 26
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request introduces the SuperEnumAggregationGenerator class to perform Super Enum aggregations using a Pure BigQuery approach, along with corresponding end-to-end integration tests. The reviewer provided valuable feedback on optimizing the BigQuery script: first, to avoid a full table scan on Spanner's Observation table by filtering the query using a list of eligible statistical variables; and second, to dynamically construct the provenance of generated edges using the source edge's provenance instead of hardcoding the first import name, thereby properly supporting multiple imports.
| FROM EXTERNAL_QUERY("{connection_id}", | ||
| '''SELECT variable_measured, entity1, extra_entities_id, facet_id, date, value FROM Observation''') o |
There was a problem hiding this comment.
Querying the entire Observation table without any filter on variable_measured inside the EXTERNAL_QUERY will trigger a full table scan on Spanner's largest table. In production, this will lead to severe performance degradation, high query costs, and potential timeouts.
To optimize this, we should fetch the list of eligible source statistical variables in Python first, and then pass them as a list of literals to filter the Observation query directly in Spanner.
| FROM EXTERNAL_QUERY("{connection_id}", | |
| '''SELECT variable_measured, entity1, extra_entities_id, facet_id, date, value FROM Observation''') o | |
| FROM EXTERNAL_QUERY("{connection_id}", | |
| '''SELECT variable_measured, entity1, extra_entities_id, facet_id, date, value FROM Observation | |
| WHERE variable_measured IN ({eligible_svs_placeholder})''') o |
| SELECT subject_id, predicate, object_id | ||
| FROM EXTERNAL_QUERY("{connection_id}", | ||
| '''SELECT subject_id, predicate, object_id FROM Edge |
There was a problem hiding this comment.
To support multiple imports correctly and avoid hardcoding the provenance of the first import, we should select the provenance column from the Edge table in Spanner. This will allow us to dynamically construct the provenance for the generated edges.
| SELECT subject_id, predicate, object_id | |
| FROM EXTERNAL_QUERY("{connection_id}", | |
| '''SELECT subject_id, predicate, object_id FROM Edge | |
| SELECT subject_id, predicate, object_id, provenance | |
| FROM EXTERNAL_QUERY("{connection_id}", | |
| '''SELECT subject_id, predicate, object_id, provenance FROM Edge |
| subject_id, | ||
| predicate, | ||
| object_id, | ||
| CONCAT('{prefix}', '{safe_names[0]}', '_SuperEnum') AS provenance |
There was a problem hiding this comment.
Instead of hardcoding the provenance using the first import name (safe_names[0]), we can dynamically construct it using the provenance of the source edge (p.provenance). This ensures that if multiple imports are processed together, each generated edge retains its correct provenance mapping.
| CONCAT('{prefix}', '{safe_names[0]}', '_SuperEnum') AS provenance | |
| CONCAT(p.provenance, '_SuperEnum') AS provenance |
…ify/reformat tests
No description provided.