Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions langfuse/langchain/CallbackHandler.py
Original file line number Diff line number Diff line change
Expand Up @@ -1664,6 +1664,22 @@ def _parse_usage_model(usage: Union[pydantic.BaseModel, dict]) -> Any:
0, usage_model[f"input_modality_{item['modality']}"] - value
)

# Anthropic extended prompt caching: cache_creation is a dict keyed by cache tier.
# Example: {"ephemeral_1h_input_tokens": 500, "ephemeral_5m_input_tokens": 0}
# Flatten into individual keys and expose an aggregated total that mirrors the
# legacy cache_creation_input_tokens field for backward-compatible cost tracking.
if "cache_creation" in usage_model and isinstance(
usage_model["cache_creation"], dict
):
cache_creation = usage_model.pop("cache_creation")
total = 0
for tier_key, tier_val in cache_creation.items():
if isinstance(tier_val, int):
usage_model[f"cache_creation_{tier_key}"] = tier_val
total += tier_val
if total > 0:
usage_model.setdefault("cache_creation_input_tokens", total)

usage_model = {k: v for k, v in usage_model.items() if isinstance(v, int)}

return usage_model if usage_model else None
Expand Down
72 changes: 72 additions & 0 deletions tests/unit/test_parse_usage_model.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,78 @@
from langfuse.langchain.CallbackHandler import _parse_usage_model


def test_anthropic_cache_creation_dict_flattened():
"""Anthropic extended caching: cache_creation dict is flattened into per-tier keys
and an aggregated cache_creation_input_tokens total is added."""
usage = {
"input_tokens": 9454,
"output_tokens": 380,
"cache_read_input_tokens": 0,
"cache_creation": {
"ephemeral_1h_input_tokens": 500,
"ephemeral_5m_input_tokens": 200,
},
}
result = _parse_usage_model(usage)

# Core fields survive
assert result["input"] == 9454
assert result["output"] == 380
assert result["cache_read_input_tokens"] == 0

# Per-tier keys are present and individually correct
assert result["cache_creation_ephemeral_1h_input_tokens"] == 500
assert result["cache_creation_ephemeral_5m_input_tokens"] == 200

# Aggregated total equals sum of all tiers
assert result["cache_creation_input_tokens"] == 700

# The original nested dict must not be present
assert "cache_creation" not in result


def test_anthropic_cache_creation_all_zeros_no_aggregate():
"""When all cache_creation tier values are zero no aggregate key is added
(avoids noise in traces where caching did not fire)."""
usage = {
"input_tokens": 100,
"output_tokens": 50,
"cache_creation": {
"ephemeral_1h_input_tokens": 0,
"ephemeral_5m_input_tokens": 0,
},
}
result = _parse_usage_model(usage)

assert result["input"] == 100
assert result["output"] == 50
# Per-tier zero keys are still stored
assert result["cache_creation_ephemeral_1h_input_tokens"] == 0
assert result["cache_creation_ephemeral_5m_input_tokens"] == 0
# No aggregate added when total is zero
assert "cache_creation_input_tokens" not in result
assert "cache_creation" not in result


def test_anthropic_cache_creation_legacy_field_not_overwritten():
"""If both the legacy cache_creation_input_tokens (int) and the new cache_creation
(dict) are present, the legacy value is preserved and the dict total is not added."""
usage = {
"input_tokens": 100,
"output_tokens": 50,
"cache_creation_input_tokens": 999, # legacy field already present; intentionally != tier sum (300)
"cache_creation": {
"ephemeral_1h_input_tokens": 200,
"ephemeral_5m_input_tokens": 100,
},
}
result = _parse_usage_model(usage)

# setdefault must not overwrite the existing legacy value
assert result["cache_creation_input_tokens"] == 999
assert "cache_creation" not in result
Comment on lines +57 to +73

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The test_anthropic_cache_creation_legacy_field_not_overwritten test uses legacy cache_creation_input_tokens=300 and tier values that sum to 200+100=300 — the same value. Because the legacy value equals the tier sum, the assertion result["cache_creation_input_tokens"] == 300 would pass even if setdefault were regressed to a plain assignment (usage_model["cache_creation_input_tokens"] = total), defeating the test's stated purpose of guarding setdefault semantics. Use a legacy value that differs from the tier sum (e.g. cache_creation_input_tokens=999 with tiers summing to 300) so a regression to direct assignment would be caught.

Extended reasoning...

What the bug is. The new test test_anthropic_cache_creation_legacy_field_not_overwritten (tests/unit/test_parse_usage_model.py:57-73) is intended to prove that the production code uses setdefault("cache_creation_input_tokens", total) — i.e. that a pre-existing legacy scalar value is preserved instead of being overwritten by the dict total. As constructed, however, the test cannot distinguish between the correct setdefault implementation and a buggy plain-assignment regression.

How it manifests. The fixture sets cache_creation_input_tokens=300 (legacy scalar) and a nested cache_creation dict whose tiers sum to 200+100=300. The only assertion on this field is assert result["cache_creation_input_tokens"] == 300. Since the legacy value (300) numerically equals the tier sum (300), both implementations produce the same final value:

  • Correct (setdefault): key already exists with 300 → kept at 300 ✓
  • Regressed (plain assignment): key overwritten with total=300 → ends at 300 ✓

Both branches satisfy the assertion, so the test is silent on which behavior the code actually has.

Why existing code doesn't prevent it. Nothing else in the test file pins cache_creation_input_tokens to a value different from the tier sum. The two sibling tests use a single non-overlapping field (test_anthropic_cache_creation_dict_flattened has no legacy field present; test_anthropic_cache_creation_all_zeros_no_aggregate asserts the key is absent). So no test in the suite would fail if setdefault were swapped for direct assignment.

Step-by-step proof. Mentally apply the regressed implementation usage_model["cache_creation_input_tokens"] = total to the fixture:

  1. Input dict contains cache_creation_input_tokens: 300 and cache_creation: {ephemeral_1h_input_tokens: 200, ephemeral_5m_input_tokens: 100}.
  2. The regressed branch pops cache_creation, computes total = 200 + 100 = 300.
  3. usage_model["cache_creation_input_tokens"] = 300 overwrites the legacy 300 with the new 300 — value unchanged.
  4. Final filter keeps all ints. result["cache_creation_input_tokens"] == 300 — assertion passes despite the regression.

Impact. Pure test-quality issue, no production impact. The production code in langfuse/langchain/CallbackHandler.py correctly uses setdefault and the fix as shipped is correct. But the regression test guarding it has zero discriminating power — a future refactor (or AI-assisted edit) that accidentally drops setdefault for direct assignment would slip through CI.

How to fix. Change the fixture so the legacy value differs from the tier sum, e.g.:

"cache_creation_input_tokens": 999,  # legacy field already present
"cache_creation": {
    "ephemeral_1h_input_tokens": 200,
    "ephemeral_5m_input_tokens": 100,
},

and assert result["cache_creation_input_tokens"] == 999. Under setdefault the legacy 999 wins; under plain assignment the value would become 300 and the assertion would fail — giving the test the discriminating power it claims to have.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in fe3af4d



def test_standard_tier_input_token_details():
"""Standard tier: audio and cache_read are subtracted from input."""
usage = {
Expand Down