Skip to content

Feature/emt 391 improvements#3049

Draft
olacognite wants to merge 14 commits into
feature/EMT-391from
feature/EMT-391_improvements
Draft

Feature/emt 391 improvements#3049
olacognite wants to merge 14 commits into
feature/EMT-391from
feature/EMT-391_improvements

Conversation

@olacognite

@olacognite olacognite commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Description

We want to add the EM aliasing deployment pack #3005
This is a giant PR with some styling shortcomings and some dead code also. A claude summary of possible actions was given as follows:

Overall verdict

The functionality is well-tested (98% patch coverage) and cleanly isolated, but the code reads as if it were written in another language's idiom — Java/C#-style enterprise patterns transplanted onto a Python codebase that has its own, much lighter conventions. The single biggest risk for a team unfamiliar with the codebase is that future maintainers will have two competing styles to reconcile. None of this blocks merge given the feature flag, but several items are worth fixing before this becomes the template others copy.

Structural deviations from the codebase

This is the heart of it. The existing toolkit is pragmatic Python; this PR imports a different design culture. Concretely:

A dependency-injection container written by hand.

bootstrap/bootstrapper.py is ~10 provide_* factory functions wiring objects together with x or provide_default() fallbacks. This is a Spring/Guice idiom. Nothing else in the toolkit does this — collaborators construct objects directly. For the actual call path (one command, one facade), this entire file could be a couple of constructor calls.

Abstraction layers with exactly one implementation.

There are five ABCs in the new tree (RuleDefinitionRegistry, RulesDiscovery, AliasingKuiperBuilder, AliasingKuiperBuilderFactory, ExpressionComposer), each with a single Default*/Local* concrete class. A Builder, plus a BuilderFactory that does nothing but call the builder's constructor, plus a Facade over the factory, is three layers to do one thing. The toolkit's existing commands subclass ToolkitCommand and implement execute() — flat and direct. A reader coming from the rest of the repo will spend real time tracing facade → factory → builder → registry → composer to find where work actually happens.

Filesystem-glob + importlib rule discovery.

registry/rules_discovery.py globs rules/*.py and dynamically imports each module to find RuleDefinition subclasses. This is the only place in _cdf_tk that uses importlib.import_module for discovery. The codebase's established idiom for the same problem is subclasses() (used in resource_ios, builders, yaml_classes, etc.). The dynamic version is slower (cold-start is already a known toolkit pain point — see CDF-27851), harder to trace statically, and silently swallows import failures into log warnings, so a broken rule just vanishes from the registry instead of failing loudly. A plain explicit registry dict, or subclasses() after importing the package, would match convention and remove ~80 lines.

Single-string value objects.

RuleName and RuleDescription in rules/base.py wrap one str each with a post_init non-empty check. Per-rule context *Builder classes (e.g. CaseTransformationContextBuilder) do the same for one field. This is ceremony Python usually expresses with a validated field or a guard clause.

Custom exception hierarchy that bypasses the toolkit's.

io/errors.py defines YamlReadError / InvalidRuleFormatError deriving from bare Exception, and the builder/registry raise BuilderConstraintError, RuleDefinitionNotFoundError, and plain ValueError/FileNotFoundError. The CLI's top-level handler (_cdf.py:178) catches ToolkitError to render clean error messages. None of these inherit from ToolkitError, so a user pointing the command at a malformed YAML gets a raw Python traceback instead of a formatted toolkit error. This is a user-facing behavioral deviation, not just stylistic — worth fixing. The repo already has ToolkitFileNotFoundError, ToolkitValueError, etc. for exactly this.

Not using the toolkit's file I/O helpers.

YamlRulesReader.read_file takes a str path (rest of the codebase passes pathlib.Path) and uses raw open() + yaml.safe_load; the command writes output with Path.write_text. The toolkit has read_yaml_file, safe_read, safe_write, and yaml_safe_dump (used throughout modules.py and elsewhere) that handle encoding and error wrapping consistently.

Concrete bugs / dead code

  • Dead method:YamlRulesReader._validate_rules_key_exists (lines ~122–140) is defined but never called — it was superseded by _validate_root_structure. Remove it.
  • Dead computation: in aliasing_kuiper_builder.py._resolve_composite_rules, resolved_specs is built up (.append(...)) and then never used — only expanded_sub_rules feeds the result. Either it's leftover from a refactor or it signals an intended check that got dropped. Worth confirming with the author which.
  • Non-exhaustive match:CaseTransformationRuleDefinition.create_kuiper_macro matches UPPERCASE/LOWERCASE with no fallback. If a third CaseStrategy is ever added, expression is unbound → UnboundLocalError at runtime rather than a clear error. Add an else/case _.
  • Unused public surface: the Rule / RuleName / RuleDescription classes in rules/base.py aren't referenced anywhere in shipping code (the real flow uses the separate AliasingRule dataclass). Two parallel "rule" types in one feature is confusing; drop the unused one.
  • O(n²) duplicate check:_validate_rules does rule_names.count(name) inside a set comprehension over the same list. Trivial at current scale, but collections.Counter is the idiomatic one-liner.

Bump

  • Patch
  • Skip

Changelog

Added/Changed/Improved/Removed/Fixed

See commits, this does not go into the changelog anyway, since we are merging this into another PR branch.

@gemini-code-assist

Copy link
Copy Markdown
Contributor
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

@olacognite

Copy link
Copy Markdown
Contributor Author

/gemini review

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

☂️ Code Coverage

current status: ✅

Overall Coverage

Statements Covered Coverage Threshold Status
43513 37490 86% 80% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/api/facade.py 100% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/assembly/aliasing_kuiper_builder.py 96% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/bootstrap/bootstrapper.py 100% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/io/errors.py 97% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/io/yaml_rules_reader.py 88% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/registry/registry.py 96% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/registry/rules_discovery.py 100% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/rules/base.py 100% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/rules/case_transformation.py 93% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/rules/character_substitution.py 93% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/rules/composite.py 98% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/rules/leading_zero_normalization.py 100% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/rules/prefix_suffix.py 100% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/rules/regex_substitution.py 100% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/aliasing/rules/value_expansion.py 96% 🟢
cognite_toolkit/_cdf_tk/commands/entity_matching/entity_matching.py 100% 🟢
TOTAL 97% 🟢

updated for commit: 6ddbed5 by action🐍

@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.29630% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.15%. Comparing base (10e49e3) to head (6ddbed5).

Files with missing lines Patch % Lines
...ity_matching/aliasing/rules/case_transformation.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@                 Coverage Diff                 @@
##           feature/EMT-391    #3049      +/-   ##
===================================================
- Coverage            86.20%   86.15%   -0.05%     
===================================================
  Files                  493      489       -4     
  Lines                43774    43513     -261     
===================================================
- Hits                 37736    37490     -246     
+ Misses                6038     6023      -15     
Files with missing lines Coverage Δ
...tk/commands/entity_matching/aliasing/api/facade.py 100.00% <100.00%> (ø)
...ching/aliasing/assembly/aliasing_kuiper_builder.py 96.20% <100.00%> (-0.39%) ⬇️
...entity_matching/aliasing/bootstrap/bootstrapper.py 100.00% <100.00%> (ø)
..._tk/commands/entity_matching/aliasing/io/errors.py 96.87% <100.00%> (+0.10%) ⬆️
...s/entity_matching/aliasing/io/yaml_rules_reader.py 87.50% <100.00%> (+5.50%) ⬆️
...ands/entity_matching/aliasing/registry/registry.py 95.65% <100.00%> (+0.19%) ⬆️
...tity_matching/aliasing/registry/rules_discovery.py 100.00% <100.00%> (+14.28%) ⬆️
...tk/commands/entity_matching/aliasing/rules/base.py 100.00% <ø> (ø)
..._matching/aliasing/rules/character_substitution.py 93.33% <ø> (-1.41%) ⬇️
...mmands/entity_matching/aliasing/rules/composite.py 98.03% <ø> (-0.27%) ⬇️
... and 6 more

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the entity matching aliasing module by replacing the builder factory pattern with a callable builder provider, simplifying rule discovery to a static list, and removing unused context builders and rule classes. Additionally, error classes now inherit from ToolkitValueError, and file operations have been updated to use safe_read and safe_write utilities. Feedback was provided regarding a lambda assignment in bootstrapper.py that lacks explicit type hints and violates PEP 8 (E731).

if builder_provider is None:
registry = LocalRuleDefinitionRegistry.bootstrap(LocalRulesDiscovery.create())
composer = DefaultExpressionComposer(provide_aliasing_composition_config())
resolved_builder_provider = lambda: DefaultAliasingKuiperBuilder(registry=registry, composer=composer)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Assigning a lambda expression to a variable violates PEP 8 (E731) and lacks explicit type hints, which goes against the repository style guide's emphasis on strong typing and requiring type hints for all functions.\n\nUsing a nested helper function with explicit type hints is more idiomatic and type-safe.

        def resolved_builder_provider() -> AliasingKuiperBuilder:\n            return DefaultAliasingKuiperBuilder(registry=registry, composer=composer)
References
  1. Strong Typing: Use type hints extensively with pyright. Required: All functions, methods, and class attributes must have type hints. (link)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist addressed now in ce59665

What do you think about the other changes and how they match the issues in the description?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in commit ce59665 and the broader set of refactors effectively address the structural and stylistic concerns raised in the PR description.

Specifically, the following improvements align well with the toolkit's established patterns:

  • Simplification of DI and Abstraction Layers: Removing the AliasingKuiperBuilderFactory and the various *ContextBuilder classes significantly flattens the architecture, making the code much easier to trace and maintain. The move to a simple Callable provider for the builder is a pragmatic, idiomatic approach.
  • Explicit Rule Discovery: Replacing the dynamic importlib discovery with an explicit list of definitions in rules_discovery.py improves static analysis, removes potential runtime import failures, and aligns with the toolkit's preference for explicit registration.
  • Exception Hierarchy Alignment: Inheriting from ToolkitValueError ensures that errors are caught by the CLI's top-level handler, providing a consistent user experience compared to raw tracebacks.
  • Standardized I/O: Adopting pathlib.Path and the toolkit's safe_read/safe_write helpers resolves the inconsistencies with the rest of the codebase.
  • Code Cleanup: Removing the dead methods, unused Rule classes, and redundant computations cleans up the surface area of the feature significantly.

Overall, these changes reconcile the new feature with the existing toolkit design culture, making it much more maintainable for future contributors.

Comment on lines +50 to +63
except FileNotFoundError as e:
raise YamlReadError(
"File not found",
file_path=file_path,
file_path=str(file_path),
) from e
except yaml.YAMLError as e:
raise YamlReadError(
f"Invalid YAML syntax: {e!s}",
file_path=file_path,
file_path=str(file_path),
) from e
except Exception as e:
raise YamlReadError(
f"Error reading file: {e!s}",
file_path=file_path,
file_path=str(file_path),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file_path.as_posix() ensures correct printing on Windows :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olacognite olacognite requested a review from ronpal June 4, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants