Eliminate trace() entry duplication in ColumnRepresentation#2608
Open
Eliminate trace() entry duplication in ColumnRepresentation#2608
Conversation
Add regression tests that encode the reproduction matrix from #2594: trace() entries are emitted more than once when the traced column is consumed by count(), exists(), empty(), combine(), or the | union operator. The fix will land in a follow-up change. Tests tagged known-failing are excluded from the default Surefire run via a project property; opt in with -Dpathling.test.excludedGroups=none -Dgroups=known-failing. The fhirpath-trace capability gains a new requirement documenting that collector entries must match the number of logical trace invocations, with the matrix as its scenarios. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrites multi-reference when/otherwise patterns in ColumnRepresentation so that Nondeterministic operands (such as trace()) are evaluated exactly once per row, not once per reference in the Spark expression tree. The fix introduces a ColumnHelpers.let(value, body) primitive built on Spark higher-order array functions (array + transform + element_at). This materialises the operand once at codegen time and binds the result to the lambda parameter; multiple references to the parameter inside the body do not re-evaluate the operand. The resulting expression is a pure Spark Column with no logical-plan dependency, composing in any relational context including Window.over. Affected methods rewritten in ColumnRepresentation: - count() and isEmpty() array branches: replaced with coalesce(size(c), 0) and coalesce(size(c).equalTo(0), true) - last() array branch: replaced with try_element_at(c, -1) - normaliseNull() array branch: replaced with nullif(c, array()) - aggregate() both branches: replaced with coalesce(aggregate(...), zero) and coalesce(c, zero) - plural() both branches: coalesce(a, array()) and filter(array(c), ...) - singular(), filter(), toArray(), transform() scalar branches: wrapped with let() so the conditional body references the bound parameter, not the original operand A Checkstyle RegexpMultiline rule scoped to *Representation*.java files guards against future regressions: any new when(P uses x, V uses x) or when(P uses x).otherwise(V uses x) pattern in those files fails CI. A startup-time assertion in PathlingContext verifies that spark.sql.legacy.sizeOfNull is false (the Spark 3.0+ default), which the coalesce(size(c), 0) rewrite relies on for correct null-array handling. Closes #2594. Follow-up documentation issue: #2607. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Syncs three new trace entry-count scenarios and the new SQL aggregate incompatibility requirement from the change delta spec into the main fhirpath-trace capability spec. Moves the change to the archive. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…esentationTraceTest Closes the coverage gap identified in PR #2608 review: the contains() method was rewritten to use let() but had no corresponding test in the Layer B regression guard suite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Correct "at code generation time" to "per row at execution time" in ColumnHelpers class Javadoc (factual error) - Extend let() method Javadoc to cover nondeterminism introduced by body, not only by value - Fix SuppressWithPlainTextCommentFilter comment to accurately describe range suppression semantics (not same-line) - Improve singular(), plural(), and count() inline comments to explain why single-reference matters, not just that it exists - Document the identity-element precondition on aggregate() - Add let_nullValue_propagatesNullAndFiresOnce test to ColumnHelpersTest - Add filter_array_singleFire test to ColumnRepresentationTraceTest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2594 — a single source-level
trace()call produced 2× or moreTraceCollectorentries when its result column was consumed bycount(),exists(),empty(),last(),combine(), or the|union operator.Root cause. Several
ColumnRepresentationmethods compiled to Sparkwhen(...).otherwise(...)patterns that referenced the traced column more than once. BecauseTraceExpressionisNondeterministic, Spark's CSE does not deduplicate it — each reference in the expression tree fires independently.Fix. Introduces
ColumnHelpers.let(value, body)— a let-binding primitive built on Spark higher-order array functions (array+transform+element_at).array(value)materialises the operand once at codegen time; the body lambda then references the bound parameter, not the original expression. The result is a pureColumnwith no logical-plan dependency, safe in every relational context.All 14 affected methods in
ColumnRepresentationare rewritten:let):count,isEmpty,last,normaliseNull,aggregate,plural— single-reference rewrites usingcoalesce,try_element_at,nullif,filter(array(...)).lethelper:singular,filter(scalar),toArray(scalar),transform(scalar) — conditionals where both branches need the operand.What's new
ColumnHelpers.java— newletprimitiveColumnHelpersTest.java— unit tests forlet(identity, multi-ref, single-fire overTraceExpression)ColumnRepresentationTraceTest.java— Layer B regression guard: one test perColumnRepresentationmethod, asserts exactly one trace fire per row for both 1-row and 3-row inputsTraceFunctionTest— drops@Tag("known-failing")from the duplication matrix; adds 3 new rows exercisingcount > 0,where(...).first(), andcombine()checkstyle.xml/suppressions.xml—RegexpMultilinerule (scoped to*Representation*.java) flags any futurewhen(x...).otherwise(x...)patterns, failing CI before the bug can recurPathlingContext— startup assertion thatspark.sql.legacy.sizeOfNull = false(required for thecoalesce(size(c), 0)rewrite)fhirpath/pom.xml— removes theknown-failingSurefire exclusion; all tests now run in the default suiteTest plan
TraceFunctionTest$TraceEntryCountTestrows pass (was 6 known-failing, now 13 pass)ColumnRepresentationTraceTestpasses — every method fires exactly once per rowColumnHelpersTestpassesmvn clean verify -pl fhirpath,encoders,library-api— all tests pass, Checkstyle cleanPathlingContextTest.create_rejectsLegacySizeOfNullEnabledpassescount()to the old pattern triggers theRepeatedSqlEvaluationruleNotes
trace()inside SQL aggregate functions (sum,count,avg, …) is a separate Spark constraint onNondeterministicexpressions that this PR does not address. Tracked in Documenttrace()incompatibility with SQL aggregate functions #2607.letcall. This is accepted; a custom CatalystLetExpressionremains available as a future optimisation if benchmarking shows it matters.🤖 Generated with Claude Code