Skip to content

Inflated Code Output LOC from VS Code edit-session payloads #127

@TamasBoncz

Description

@TamasBoncz

Inflated Code Output LOC from VS Code edit-session payloads

The Output → Code Output view can significantly overcount AI-generated lines of code for VS Code Copilot agent sessions by summing persisted edit operation payloads instead of estimating unique produced code.


Summary

In one investigated local dataset, the chart showed 91,583 LOC for 2026-06-08. Almost all of it came from VS Code chatEditingSessions edit-state operations, not from AI response code blocks.

Measure LOC
Total shown for 2026-06-08 91,583
AI response code blocks 58
VS Code edit-state payloads 91,525

Root Cause

The LOC metric sums every persisted VS Code edit operation payload. For multi-round Copilot agent sessions, those payloads include repeated whole-file replacements, so the same file's lines are counted many times.

Example:

  1. Agent writes a 1,200-line file → VS Code stores a whole-file snapshot.
  2. Agent revises it → VS Code stores the whole file again.
  3. Agent revises it again → VS Code stores it again.

A user would expect ~1,200 produced lines. The current counter reports 3,600.


Why It Depends on the Model

The overcount tracks which edit tool the model uses:

Model family Requests Counted Real Inflation apply_patch string-edit
OpenAI (gpt-5.5, etc.) 85 171,027 115,632 1.48× 96% 0%
Anthropic (Claude) 1,213 160,402 184,919 0.87× 0% 90%
Mixed / copilot-auto 211 52,904 50,046 1.06× 68% 27%
  • OpenAI / apply_patch — the tool writes back the entire file for every change. VS Code records each apply_patch as a whole-file textEdit. Repeated applies on the same file inflate the count.
  • Anthropic / replace_string_in_file — targeted search-and-replace. VS Code records only the changed region. Little or no inflation (slight undercount instead).
  • Mixed — proportional to how often apply_patch is chosen.

Deep-Dive Verification (one request)

The largest request ("Start implementation", request_78fc0984) was reconstructed from its session JSONL and cross-checked three ways:

Measure LOC
Current counter (sum of every edit payload) 16,727
Genuinely new lines (diff vs previous version) ~5,294
Lines the model actually emitted (patch + create payloads) ~1,124

The model ran 59 tool rounds and edited almost entirely via 23 small apply_patch diffs — it never re-emitted whole files. Yet VS Code persisted 25 whole-file snapshots and the counter summed all of them.

haco/orchestrator.py — 9 whole-file snapshots:

Snapshot Stored file lines (counted)
1 916
2 1,190
9 1,207
Sum (current) 10,517
Final file ~1,207

Token usage corroborates: 32,558 real output tokens is consistent with ~1,124 lines of patch text, not 16,727 produced lines.


Actual vs Expected Behavior

Actual: AI-Generated LoC = sum of all edit operation payload sizes. Whole-file replacements and repeated revisions are summed as if each were new output.

Expected: Estimate unique or net AI-produced LOC. Repeated whole-file rewrites of the same file within one request should not multiply the file's line count.


Proposed Fix: Incremental Per-File Diff (Variant D)

Within each request, keep a running copy of each file's content. For every textEdit, reconstruct the resulting file state and count only lines new compared to the previous version of that file.

// per (requestId, fileUri), operations sorted by epoch
let prev = seedFromInitialContents(fileUri) ?? "";  // "" for newly created files
let produced = 0;
for (const op of fileOps) {
  const next = applyEdits(prev, op.edits);   // reconstruct file state after this op
  produced += addedLineCount(prev, next);    // line-level diff: only new/changed lines
  prev = next;
}
editLocIndex.set(requestId, fileUri, produced);

addedLineCount uses a linear multiset difference (hash each line with a 32-bit charCodeAt scan, tally previous hashes in a Map<number, count>, count hashes not already present). No LCS/Myers diff — keeps the step O(C) in payload characters.

Key fast paths:

  • Single-write files (74% here) skip the diff entirely — count newlines directly.
  • First snapshot of a new file — count its line total without a diff.

Effect on the investigated request:

Measure Current With fix
haco/orchestrator.py 10,517 ~1,550
Whole request 16,727 ~5,294
Day total (2026-06-08) 91,525 substantially lower

This is tool-agnostic: it removes the apply_patch inflation and also corrects the slight string-replace undercount.


Performance

Benchmarked on the heaviest local workspace (53 edit-state files, 761 operations, 12.1 MB), 60 iterations:

Variant Time vs current LOC
A — current (newline sum, the bug) 15.9 ms 1.0× 149,321
B — naive diff (split + Map<string>) 31.8 ms 2.0× 94,136
D — fast-path + split-free hash Map<number> 23.0 ms 1.5× 94,155
E — net-growth proxy (near-free) 12.1 ms 0.8× 86,674

+7 ms on the heaviest workspace; ~57 ms projected across all 82 workspaces (~1 GB). Against a multi-second parse of 953 MB of session JSONL, this is not perceptible.


Relevant Code


Acceptance Criteria

  • Repeated whole-file replacements of the same file in one request do not multiply the file's LOC.
  • A request that rewrites a 1,200-line file three times should not report ~3,600 unique generated LOC.
  • Existing code-block-based counting continues to work for harnesses without edit-session data.
  • The Code Output view makes clear whether it reports raw edit-operation volume or estimated unique produced code.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions