Skip to content

test: cover dataset run item observation links#1714

Open
marliessophie wants to merge 1 commit into
mainfrom
codex/lfe-10352-python-dataset-run-item-observation
Open

test: cover dataset run item observation links#1714
marliessophie wants to merge 1 commit into
mainfrom
codex/lfe-10352-python-dataset-run-item-observation

Conversation

@marliessophie

@marliessophie marliessophie commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary

  • Add a targeted regression test that dataset experiments create dataset run items with the root span observation id.
  • Confirms the current Python SDK path already passes observation_id=span.id.

Linear: LFE-10352

Verification

  • Not run per request.

Greptile Summary

This PR adds a targeted unit test verifying that DatasetClient.run_experiment correctly links each dataset run item to the root span's observation ID, guarding against a specific regression class where the wrong span ID (or no span ID) could be passed to dataset_run_items.create.

  • The test monkeypatches dataset_run_items.create to capture kwargs, runs a single-item experiment, flushes the in-memory exporter, and then asserts both trace_id and observation_id match the root \"experiment-item-run\" span's context — confirming the existing observation_id=span.id path already passes.

Confidence Score: 5/5

Test-only addition with no production code changes; the existing SDK path it exercises already behaves correctly.

The change is a single new test file with no modifications to production code. The test correctly exercises the observation-ID linking path through a real DatasetClient, a monkeypatched API layer, and the in-memory OTel exporter already used by the unit test suite.

No files require special attention.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Test
    participant DatasetClient
    participant Langfuse as Langfuse Client
    participant Span as OTel Span (experiment-item-run)
    participant Mock as Mock dataset_run_items.create

    Test->>DatasetClient: "run_experiment(name=experiment, task=...)"
    DatasetClient->>Langfuse: "run_experiment(data=[item], ...)"
    Langfuse->>Span: start_as_current_observation(experiment-item-run)
    Span-->>Langfuse: "span (span.id = root observation id)"
    Langfuse->>Mock: "create(trace_id=trace_id, observation_id=span.id, ...)"
    Mock-->>Langfuse: DatasetRunItem
    Langfuse-->>Span: end span
    Test->>Langfuse: flush()
    Test->>Test: get_span(experiment-item-run)
    Test->>Test: "assert created_run_item trace_id == root_span.context.trace_id"
    Test->>Test: "assert created_run_item observation_id == root_span.context.span_id"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Test
    participant DatasetClient
    participant Langfuse as Langfuse Client
    participant Span as OTel Span (experiment-item-run)
    participant Mock as Mock dataset_run_items.create

    Test->>DatasetClient: "run_experiment(name=experiment, task=...)"
    DatasetClient->>Langfuse: "run_experiment(data=[item], ...)"
    Langfuse->>Span: start_as_current_observation(experiment-item-run)
    Span-->>Langfuse: "span (span.id = root observation id)"
    Langfuse->>Mock: "create(trace_id=trace_id, observation_id=span.id, ...)"
    Mock-->>Langfuse: DatasetRunItem
    Langfuse-->>Span: end span
    Test->>Langfuse: flush()
    Test->>Test: get_span(experiment-item-run)
    Test->>Test: "assert created_run_item trace_id == root_span.context.trace_id"
    Test->>Test: "assert created_run_item observation_id == root_span.context.span_id"
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
tests/unit/test_dataset_run_item_observation_id.py:63-64
If `run_experiment` ever fails to call `dataset_run_items.create` (e.g. the item is skipped for any reason), `created_run_item` stays as an empty `{}` and the assertion on `created_run_item["trace_id"]` raises a `KeyError` instead of a meaningful `AssertionError`. Adding an explicit guard makes the failure message immediately actionable.

```suggestion
    root_span = get_span("experiment-item-run")
    assert created_run_item, "dataset_run_items.create was never called"
    assert created_run_item["trace_id"] == format(root_span.context.trace_id, "032x")
```

Reviews (1): Last reviewed commit: "test: cover dataset run item observation..." | Re-trigger Greptile

@github-actions

Copy link
Copy Markdown

@claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test-only addition; uses existing fixtures and pins the observation_id linkage behavior.

Extended reasoning...

Overview

This PR adds a single new file tests/unit/test_dataset_run_item_observation_id.py containing one regression test that verifies DatasetClient.run_experiment creates a dataset run item linked to the root span's trace and observation IDs. No production code is modified.

Security risks

None. The change is confined to a unit test file under tests/unit/ and uses an in-memory client with a monkeypatched API method.

Level of scrutiny

Low. Test-only additions for regression coverage carry minimal risk — the worst case is a flaky/failing test, which would surface in CI rather than affect users. The test depends on existing fixtures (langfuse_memory_client, get_span) defined in tests/conftest.py and standard public symbols (DatasetClient, Dataset, DatasetItem, DatasetRunItem, DatasetStatus).

Other factors

The bug hunting system reported no issues. The test pins a specific contract (observation_id is the root span's hex span id) that aligns with the linked Linear ticket LFE-10352, so it serves as useful regression coverage. PR description notes the test was not executed locally, but CI will validate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant