Skip to content

fix(langchain): handle Anthropic cache_creation nested-dict in _parse_usage_model#1698

Open
ASAD-BE18 wants to merge 5 commits into
langfuse:mainfrom
ASAD-BE18:fix/anthropic-cache-creation-usage
Open

fix(langchain): handle Anthropic cache_creation nested-dict in _parse_usage_model#1698
ASAD-BE18 wants to merge 5 commits into
langfuse:mainfrom
ASAD-BE18:fix/anthropic-cache-creation-usage

Conversation

@ASAD-BE18

@ASAD-BE18 ASAD-BE18 commented Jun 10, 2026

Copy link
Copy Markdown

Closes #1697

What changed

_parse_usage_model in langfuse/langchain/CallbackHandler.py now handles Anthropic's cache_creation field when it arrives as a nested dict (tiered prompt-caching API, langchain-anthropic ≥ 0.3.x).

Root cause

Anthropic's extended prompt-caching API includes a cache_creation key in the LLM output usage dict:

{
    "input_tokens": 9454,
    "output_tokens": 380,
    "cache_read_input_tokens": 0,
    "cache_creation": {
        "ephemeral_1h_input_tokens": 500,
        "ephemeral_5m_input_tokens": 0,
    },
}

v2.x (≤ 2.60.10) — The old final filter (v is not None and not isinstance(v, str)) let the nested dict pass through into UpdateGenerationBody.usageDetails, which is typed as Union[Dict[str, int], ...]. Pydantic raised a ValidationError ("value is not a valid integer") that was swallowed by the ingestion queue, silently dropping the entire generation end() event — every trace had endTime=null and input=0 / output=0.

v4.x (HEAD before this PR) — The isinstance(v, int) filter correctly prevented the crash, but the cache_creation dict was silently discarded, losing all cache-creation token data.

Fix

Before the final isinstance(v, int) filter, pop cache_creation and flatten each per-tier value into a cache_creation_{tier_key} entry (matching the pattern already used for input_token_details). If the total is non-zero, also set cache_creation_input_tokens via setdefault so the legacy scalar field is not overwritten if already present.

# Anthropic extended prompt caching: cache_creation is a dict keyed by cache tier.
if "cache_creation" in usage_model and isinstance(usage_model["cache_creation"], dict):
    cache_creation = usage_model.pop("cache_creation")
    total = 0
    for tier_key, tier_val in cache_creation.items():
        if isinstance(tier_val, int):
            usage_model[f"cache_creation_{tier_key}"] = tier_val
            total += tier_val
    if total > 0:
        usage_model.setdefault("cache_creation_input_tokens", total)

Tests added (tests/unit/test_parse_usage_model.py)

Test What it verifies
test_anthropic_cache_creation_dict_flattened Per-tier keys present; aggregate = sum of tiers; original dict key absent
test_anthropic_cache_creation_all_zeros_no_aggregate Zero-value tiers stored; no aggregate added when total is zero
test_anthropic_cache_creation_legacy_field_not_overwritten setdefault preserves existing cache_creation_input_tokens value

Existing test_standard_tier_input_token_details and test_priority_tier_not_subtracted still pass unchanged.

Verification

uv run --frozen pytest tests/unit/test_parse_usage_model.py -v
uv run --frozen ruff check langfuse/langchain/CallbackHandler.py tests/unit/test_parse_usage_model.py
uv run --frozen ruff format --check langfuse/langchain/CallbackHandler.py tests/unit/test_parse_usage_model.py

All pass. No live-provider tests are needed since the fix is unit-testable.

Greptile Summary

This PR fixes _parse_usage_model in langfuse/langchain/CallbackHandler.py to correctly handle the Anthropic extended prompt-caching API where cache_creation arrives as a nested dict of per-tier token counts rather than a scalar integer.

  • The new block pops cache_creation when it is a dict, flattens each tier into a cache_creation_{tier_key} key, and uses setdefault to set the legacy aggregate cache_creation_input_tokens only when the total is non-zero, preserving any pre-existing scalar value.
  • Three focused unit tests cover the flattening, the all-zeros case, and the legacy-field preservation case.

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped to a single parsing helper, all existing tests still pass, and three new unit tests cover the key scenarios introduced by this fix.

The fix is a well-contained addition inside an already-guarded if isinstance(usage_model, dict) block, the logic is straightforward dict-flattening with no side effects on other fields, and the setdefault correctly prevents clobbering an existing scalar. The one theoretical gap — a non-integer tier value being silently skipped — mirrors how the rest of the function already handles unexpected types via the final isinstance(v, int) filter.

No files require special attention.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[_parse_usage_model called] --> B{has __dict__?}
    B -- yes --> C[usage equals usage.__dict__]
    B -- no --> D[usage_model equals usage.copy]
    C --> D
    D --> E{OpenAI format?}
    E -- yes --> F[return usage_model as-is]
    E -- no --> G[Apply conversion_list key renames]
    G --> H[Flatten input_token_details]
    H --> I[Flatten output_token_details]
    I --> J[Flatten Vertex AI token details]
    J --> N{cache_creation present AND is dict?}
    N -- no --> P
    N -- yes --> O[Pop cache_creation dict and flatten tier keys]
    O --> O2{total greater than 0?}
    O2 -- yes --> O3[setdefault cache_creation_input_tokens equals total]
    O2 -- no --> P
    O3 --> P[Final filter: keep only int values]
    P --> Q{usage_model empty?}
    Q -- yes --> R[return None]
    Q -- no --> S[return usage_model]
Loading

Reviews (1): Last reviewed commit: "fix(langchain): handle Anthropic cache_c..." | Re-trigger Greptile

…_usage_model

Anthropic's extended prompt-caching API returns cache_creation as a nested
dict keyed by cache tier (e.g. {"ephemeral_1h_input_tokens": 500,
"ephemeral_5m_input_tokens": 0}) inside the LLM output usage object.

Previously this value was passed through to UpdateGenerationBody.usageDetails
unchanged.  On v2.x (final filter: not None/not str) the nested dict
survived, causing a Pydantic ValidationError ("value is not a valid integer")
that was silently swallowed by the ingestion queue — dropping the entire
generation end() event and leaving every trace with endTime=null and
input/output=0.  On v4.x (final filter: isinstance(v, int)) the nested dict
was silently discarded, losing all cache-creation token data.

Fix: before the final int-filter, pop cache_creation, flatten each tier
value into a cache_creation_{tier_key} entry, and set the aggregated
cache_creation_input_tokens total via setdefault so the legacy field is
not overwritten if already present.

Closes langfuse#1697

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@CLAassistant

CLAassistant commented Jun 10, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@hassiebp

Copy link
Copy Markdown
Collaborator

@claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small, well-scoped fix that flattens Anthropic's nested cache_creation dict; production logic is correct and the inline nit is a test-quality improvement, not a blocker.

Extended reasoning...

Overview

This PR adds a small block inside _parse_usage_model in langfuse/langchain/CallbackHandler.py to handle Anthropic's extended prompt-caching API, where cache_creation arrives as a nested dict (per-tier token counts) rather than a scalar. The block pops the dict, flattens each tier into cache_creation_{tier_key} (mirroring the existing input_token_details pattern), and uses setdefault to expose an aggregate cache_creation_input_tokens without clobbering a pre-existing scalar. Three new unit tests cover the flatten case, all-zeros case, and legacy-field preservation; existing tier tests are unchanged.

Security risks

None. The change is a pure dict-transformation with isinstance guards; no I/O, auth, crypto, or untrusted-input parsing surface is touched.

Level of scrutiny

Low. This is a self-contained parsing helper change behind isinstance(usage_model, dict) and isinstance(usage_model["cache_creation"], dict) guards, with isinstance(int) per-tier filtering that matches the function's existing defensive style. The production logic correctly handles the three relevant cases (flatten, all-zeros aggregate suppression, legacy-field preservation).

Other factors

The single finding is a nit about test_anthropic_cache_creation_legacy_field_not_overwritten — the legacy value (300) equals the tier sum (300), so the test would also pass under a regressed plain-assignment implementation. The production code is correct; this is purely a test-discrimination improvement worth picking up but not blocking. The maintainer explicitly requested review and there are no outstanding human-reviewer comments.

Comment on lines +57 to +73
def test_anthropic_cache_creation_legacy_field_not_overwritten():
"""If both the legacy cache_creation_input_tokens (int) and the new cache_creation
(dict) are present, the legacy value is preserved and the dict total is not added."""
usage = {
"input_tokens": 100,
"output_tokens": 50,
"cache_creation_input_tokens": 300, # legacy field already present
"cache_creation": {
"ephemeral_1h_input_tokens": 200,
"ephemeral_5m_input_tokens": 100,
},
}
result = _parse_usage_model(usage)

# setdefault must not overwrite the existing legacy value
assert result["cache_creation_input_tokens"] == 300
assert "cache_creation" not in result

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The test_anthropic_cache_creation_legacy_field_not_overwritten test uses legacy cache_creation_input_tokens=300 and tier values that sum to 200+100=300 — the same value. Because the legacy value equals the tier sum, the assertion result["cache_creation_input_tokens"] == 300 would pass even if setdefault were regressed to a plain assignment (usage_model["cache_creation_input_tokens"] = total), defeating the test's stated purpose of guarding setdefault semantics. Use a legacy value that differs from the tier sum (e.g. cache_creation_input_tokens=999 with tiers summing to 300) so a regression to direct assignment would be caught.

Extended reasoning...

What the bug is. The new test test_anthropic_cache_creation_legacy_field_not_overwritten (tests/unit/test_parse_usage_model.py:57-73) is intended to prove that the production code uses setdefault("cache_creation_input_tokens", total) — i.e. that a pre-existing legacy scalar value is preserved instead of being overwritten by the dict total. As constructed, however, the test cannot distinguish between the correct setdefault implementation and a buggy plain-assignment regression.

How it manifests. The fixture sets cache_creation_input_tokens=300 (legacy scalar) and a nested cache_creation dict whose tiers sum to 200+100=300. The only assertion on this field is assert result["cache_creation_input_tokens"] == 300. Since the legacy value (300) numerically equals the tier sum (300), both implementations produce the same final value:

  • Correct (setdefault): key already exists with 300 → kept at 300 ✓
  • Regressed (plain assignment): key overwritten with total=300 → ends at 300 ✓

Both branches satisfy the assertion, so the test is silent on which behavior the code actually has.

Why existing code doesn't prevent it. Nothing else in the test file pins cache_creation_input_tokens to a value different from the tier sum. The two sibling tests use a single non-overlapping field (test_anthropic_cache_creation_dict_flattened has no legacy field present; test_anthropic_cache_creation_all_zeros_no_aggregate asserts the key is absent). So no test in the suite would fail if setdefault were swapped for direct assignment.

Step-by-step proof. Mentally apply the regressed implementation usage_model["cache_creation_input_tokens"] = total to the fixture:

  1. Input dict contains cache_creation_input_tokens: 300 and cache_creation: {ephemeral_1h_input_tokens: 200, ephemeral_5m_input_tokens: 100}.
  2. The regressed branch pops cache_creation, computes total = 200 + 100 = 300.
  3. usage_model["cache_creation_input_tokens"] = 300 overwrites the legacy 300 with the new 300 — value unchanged.
  4. Final filter keeps all ints. result["cache_creation_input_tokens"] == 300 — assertion passes despite the regression.

Impact. Pure test-quality issue, no production impact. The production code in langfuse/langchain/CallbackHandler.py correctly uses setdefault and the fix as shipped is correct. But the regression test guarding it has zero discriminating power — a future refactor (or AI-assisted edit) that accidentally drops setdefault for direct assignment would slip through CI.

How to fix. Change the fixture so the legacy value differs from the tier sum, e.g.:

"cache_creation_input_tokens": 999,  # legacy field already present
"cache_creation": {
    "ephemeral_1h_input_tokens": 200,
    "ephemeral_5m_input_tokens": 100,
},

and assert result["cache_creation_input_tokens"] == 999. Under setdefault the legacy 999 wins; under plain assignment the value would become 300 and the assertion would fail — giving the test the discriminating power it claims to have.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in fe3af4d

The legacy value (300) equalled the tier sum (200+100=300), so the
assertion would pass even if setdefault were replaced with a plain
assignment. Use 999 to make the test discriminating.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(langchain): handle Anthropic cache_creation nested-dict in usage — causes silent generation drop on v2.x, data loss on v4.x

3 participants