Skip to content

Eliminate trace() entry duplication in ColumnRepresentation#2608

Open
piotrszul wants to merge 6 commits intomainfrom
issue/2594
Open

Eliminate trace() entry duplication in ColumnRepresentation#2608
piotrszul wants to merge 6 commits intomainfrom
issue/2594

Conversation

@piotrszul
Copy link
Copy Markdown
Collaborator

Summary

Fixes #2594 — a single source-level trace() call produced 2× or more TraceCollector entries when its result column was consumed by count(), exists(), empty(), last(), combine(), or the | union operator.

Root cause. Several ColumnRepresentation methods compiled to Spark when(...).otherwise(...) patterns that referenced the traced column more than once. Because TraceExpression is Nondeterministic, Spark's CSE does not deduplicate it — each reference in the expression tree fires independently.

Fix. Introduces ColumnHelpers.let(value, body) — a let-binding primitive built on Spark higher-order array functions (array + transform + element_at). array(value) materialises the operand once at codegen time; the body lambda then references the bound parameter, not the original expression. The result is a pure Column with no logical-plan dependency, safe in every relational context.

All 14 affected methods in ColumnRepresentation are rewritten:

  • Spark builtins (no let): count, isEmpty, last, normaliseNull, aggregate, plural — single-reference rewrites using coalesce, try_element_at, nullif, filter(array(...)).
  • let helper: singular, filter (scalar), toArray (scalar), transform (scalar) — conditionals where both branches need the operand.

What's new

  • ColumnHelpers.java — new let primitive
  • ColumnHelpersTest.java — unit tests for let (identity, multi-ref, single-fire over TraceExpression)
  • ColumnRepresentationTraceTest.java — Layer B regression guard: one test per ColumnRepresentation method, asserts exactly one trace fire per row for both 1-row and 3-row inputs
  • TraceFunctionTest — drops @Tag("known-failing") from the duplication matrix; adds 3 new rows exercising count > 0, where(...).first(), and combine()
  • checkstyle.xml / suppressions.xmlRegexpMultiline rule (scoped to *Representation*.java) flags any future when(x...).otherwise(x...) patterns, failing CI before the bug can recur
  • PathlingContext — startup assertion that spark.sql.legacy.sizeOfNull = false (required for the coalesce(size(c), 0) rewrite)
  • fhirpath/pom.xml — removes the known-failing Surefire exclusion; all tests now run in the default suite

Test plan

  • All TraceFunctionTest$TraceEntryCountTest rows pass (was 6 known-failing, now 13 pass)
  • ColumnRepresentationTraceTest passes — every method fires exactly once per row
  • ColumnHelpersTest passes
  • mvn clean verify -pl fhirpath,encoders,library-api — all tests pass, Checkstyle clean
  • PathlingContextTest.create_rejectsLegacySizeOfNullEnabled passes
  • Checkstyle negative test: reverting count() to the old pattern triggers the RepeatedSqlEvaluation rule

Notes

  • trace() inside SQL aggregate functions (sum, count, avg, …) is a separate Spark constraint on Nondeterministic expressions that this PR does not address. Tracked in Document trace() incompatibility with SQL aggregate functions #2607.
  • The lambda-let pattern introduces a small single-element array allocation per let call. This is accepted; a custom Catalyst LetExpression remains available as a future optimisation if benchmarking shows it matters.

🤖 Generated with Claude Code

piotrszul and others added 4 commits April 24, 2026 16:54
Add regression tests that encode the reproduction matrix from #2594:
trace() entries are emitted more than once when the traced column is
consumed by count(), exists(), empty(), combine(), or the | union
operator. The fix will land in a follow-up change.

Tests tagged known-failing are excluded from the default Surefire run
via a project property; opt in with -Dpathling.test.excludedGroups=none
-Dgroups=known-failing. The fhirpath-trace capability gains a new
requirement documenting that collector entries must match the number of
logical trace invocations, with the matrix as its scenarios.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrites multi-reference when/otherwise patterns in ColumnRepresentation
so that Nondeterministic operands (such as trace()) are evaluated exactly
once per row, not once per reference in the Spark expression tree.

The fix introduces a ColumnHelpers.let(value, body) primitive built on
Spark higher-order array functions (array + transform + element_at). This
materialises the operand once at codegen time and binds the result to the
lambda parameter; multiple references to the parameter inside the body do
not re-evaluate the operand. The resulting expression is a pure Spark
Column with no logical-plan dependency, composing in any relational
context including Window.over.

Affected methods rewritten in ColumnRepresentation:
- count() and isEmpty() array branches: replaced with coalesce(size(c), 0)
  and coalesce(size(c).equalTo(0), true)
- last() array branch: replaced with try_element_at(c, -1)
- normaliseNull() array branch: replaced with nullif(c, array())
- aggregate() both branches: replaced with coalesce(aggregate(...), zero)
  and coalesce(c, zero)
- plural() both branches: coalesce(a, array()) and filter(array(c), ...)
- singular(), filter(), toArray(), transform() scalar branches: wrapped
  with let() so the conditional body references the bound parameter, not
  the original operand

A Checkstyle RegexpMultiline rule scoped to *Representation*.java files
guards against future regressions: any new when(P uses x, V uses x) or
when(P uses x).otherwise(V uses x) pattern in those files fails CI.

A startup-time assertion in PathlingContext verifies that
spark.sql.legacy.sizeOfNull is false (the Spark 3.0+ default), which the
coalesce(size(c), 0) rewrite relies on for correct null-array handling.

Closes #2594. Follow-up documentation issue: #2607.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Syncs three new trace entry-count scenarios and the new SQL aggregate
incompatibility requirement from the change delta spec into the main
fhirpath-trace capability spec. Moves the change to the archive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-project-automation github-project-automation Bot moved this to Backlog in Pathling May 8, 2026
@piotrszul piotrszul moved this from Backlog to In progress in Pathling May 8, 2026
@piotrszul piotrszul added the bug Something isn't working label May 8, 2026
…esentationTraceTest

Closes the coverage gap identified in PR #2608 review: the contains()
method was rewritten to use let() but had no corresponding test in the
Layer B regression guard suite.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@johngrimes johngrimes added the library-api Relating to the library API label May 8, 2026
@johngrimes johngrimes changed the title fix: Eliminate trace() entry duplication in ColumnRepresentation Eliminate trace() entry duplication in ColumnRepresentation May 8, 2026
- Correct "at code generation time" to "per row at execution time" in
  ColumnHelpers class Javadoc (factual error)
- Extend let() method Javadoc to cover nondeterminism introduced by body,
  not only by value
- Fix SuppressWithPlainTextCommentFilter comment to accurately describe
  range suppression semantics (not same-line)
- Improve singular(), plural(), and count() inline comments to explain
  why single-reference matters, not just that it exists
- Document the identity-element precondition on aggregate()
- Add let_nullValue_propagatesNullAndFiresOnce test to ColumnHelpersTest
- Add filter_array_singleFire test to ColumnRepresentationTraceTest

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working library-api Relating to the library API

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

trace() entries are duplicated by non-deterministic expression re-evaluation

2 participants