Skip to content

[TE-5578] Auto-collect git metadata for plan command#480

Merged
gchan merged 26 commits intomainfrom
gordon/te-5578-auto-collect-git-metadata
Apr 17, 2026
Merged

[TE-5578] Auto-collect git metadata for plan command#480
gchan merged 26 commits intomainfrom
gordon/te-5578-auto-collect-git-metadata

Conversation

@gchan
Copy link
Copy Markdown
Contributor

@gchan gchan commented Apr 14, 2026

Description

When the plan command runs with test selection enabled (--selection-strategy), it now automatically collects git commit metadata and sends it with the API request. This removes the need for customers to manually script metadata collection with --metadata flags.

The auto-collection gathers up to 17 fields in three phases:

  • Commit metadata (single git log call): SHA, parents, author/committer name/email/date, message
  • Diff fields (explicit git merge-base then 4 git diff commands): files changed, numstat, full diff, raw diff
  • Context fields: branch name, base branch, pipeline slug, build UUID

Base branch resolution uses a 3-level fallback: explicit --metadata base_branch=..., $BUILDKITE_PULL_REQUEST_BASE_BRANCH, then auto-detection via remote/HEAD → main → master. The --remote flag (default "origin", configurable via BUILDKITE_TEST_ENGINE_REMOTE) controls which git remote is used.

Each phase degrades gracefully — git failures log debug warnings and skip affected keys without blocking CI. User-provided --metadata values always take precedence over auto-collected values. Empty auto-collected values are filtered at the merge site to avoid sending meaningless keys in the API request.

This also DRYs up shared logic between the new plan path and the existing backfill command (TE-5476): record parsing (parseRecord), diff command execution (runDiffCommands), and struct-to-map conversion (ToMap) are now shared. Both paths use explicit two-arg diff (base commit) rather than triple-dot syntax.

Context

Resolves TE-5578

Changes

  • Add ResolveBaseBranch and CollectPlanMetadata for plan auto-collection
  • Integrate git metadata auto-collection into plan(), gated on --selection-strategy
  • Add --remote flag to plan command under "PREVIEW: TEST SELECTION" category
  • Refactor collectCommitMetadata to parse into CommitMetadata struct via ToMap
  • Extract shared runDiffCommands with two-arg diff form and CommitDiffs.ToMap
  • Extract shared parseRecord from duplicated parsing logic in metadata.go
  • Use explicit git merge-base instead of triple-dot syntax for diff collection
  • Filter empty metadata values at the merge site via mergeNonEmpty
  • Extract MergeMetadata with user-precedence semantics and tests
  • Guard against empty remote in ResolveBaseBranch
  • Route success log messages through debug.Printf, keep warnings on stderr
  • Always include parent_shas in CommitMetadata.ToMap (empty string for root commits)
  • Fix TrimRightTrimSuffix for record separator stripping
  • Use buildRecord test helper with package constants instead of hex escapes
  • Use cmp.Diff with full expected maps across all tests
  • Document auto-collection in README and update backfill docs with shared env var

Verification

Backed by specs. AI assisted.

go test ./internal/git/ ./internal/command/ . -count=1

21 new tests covering ResolveBaseBranch, CollectPlanMetadata, CommitMetadata.ToMap, CommitDiffs.ToMap, and MergeMetadata. All tests use cmp.Diff against complete expected maps.

Deployment

Low risk. Auto-collection is gated behind BKTEC_PREVIEW_SELECTION and only runs when --selection-strategy is set. No change to existing plan or backfill behaviour when selection is not enabled.

Rollback

Yes

gchan added 14 commits April 14, 2026 16:43
…-collection

Add internal/git/auto_metadata.go with two exported functions:

ResolveBaseBranch resolves the base branch ref for diffing against HEAD
using a 3-level fallback: explicit value from --metadata base_branch,
$BUILDKITE_PULL_REQUEST_BASE_BRANCH env var, then DetectDefaultBranch.
Bare branch names are auto-prefixed with <remote>/ and verified with
rev-parse before acceptance.

CollectPlanMetadata collects up to 17 metadata fields in 3 phases:
- Commit metadata (single git log -1 call reusing MetadataFormat)
- Diff fields (4 triple-dot git diff commands against base branch)
- Context fields (branch, base_branch, pipeline_slug, build_uuid)

Each phase degrades gracefully: failures log debug warnings and skip
affected keys without blocking CI. Field names and formats match the
backfill command (TE-5476) for training pipeline consistency.
When selection is active (cfg.SelectionStrategy is set), the plan
command now automatically collects git commit metadata before sending
the API request. This removes the need for customers to manually
script metadata collection with --metadata flags.

The integration resolves the base branch (logging the result to stderr),
calls CollectPlanMetadata, and merges the result into cfg.Metadata.
User-provided --metadata values always take precedence over
auto-collected values, preserving explicit overrides.
16 tests using FakeGitRunner covering:

ResolveBaseBranch (7 tests):
- Explicit bare branch name prefixed with remote
- Explicit value already containing remote prefix used as-is
- BUILDKITE_PULL_REQUEST_BASE_BRANCH env var fallback
- DetectDefaultBranch fallback when no explicit or env var
- Error when all resolution strategies fail
- Explicit ref not found falls through to DetectDefaultBranch
- Explicit fails, env var succeeds

CollectPlanMetadata (9 tests):
- Happy path with all 17 keys populated
- No base branch skips diff fields, retains commit metadata
- Diff command failures don't affect commit metadata
- Git log failure doesn't affect diff or context fields
- Detached HEAD omits branch key
- Key names match CommitMetadata/CommitDiffs JSON tags
- Multiline commit messages preserved correctly
- Buildkite env vars populated when set
- Empty diff output on default branch builds
… and ToMap

Add ToMap() method on CommitMetadata that converts the struct to a flat
map[string]string using the same key names as the JSON tags. ParentSHAs
are joined with spaces; the key is omitted for root commits.

Refactor collectCommitMetadata to parse git log output into a
CommitMetadata struct via a new parseCommitRecord helper, then flatten
it into the metadata map via ToMap(). This gives us type safety and a
single source of truth for field names -- the struct definition and its
JSON tags -- rather than duplicating key name strings between the
backfill and plan paths.

Replace TestCollectPlanMetadata_AllFieldsMatchBackfillFormat (which
manually verified key name consistency) with direct TestCommitMetadata_ToMap
tests, since the struct now structurally guarantees consistency.
Extract the 4 git diff commands (--name-only, --numstat, full diff, --raw)
into a shared runDiffCommands helper that accepts ref args as variadic
strings. Both extractCommitDiffs (backfill, two-arg form) and
collectDiffMetadata (plan, triple-dot form) now call into it, removing
the duplicated diff logic.

Add ToMap() on CommitDiffs mirroring CommitMetadata.ToMap(), and
simplify collectDiffMetadata to runDiffCommands + ToMap().
ToMap() on CommitMetadata and CommitDiffs faithfully returns all keys
including empty values -- it is a struct-to-map conversion, not a
policy layer. Filtering empty values is instead done by mergeNonEmpty
in auto_metadata.go, which skips empty strings when copying into the
metadata map. This avoids sending meaningless keys (e.g. "git_diff":"")
in the API request, since json.Marshal does not omit empty strings
within a map[string]string.
Keep stderr warnings for degraded states (not a git repo, base branch
resolution failed) since these represent lost functionality the user
should know about. Route the success case (auto-detected base branch)
through debug.Printf so it only appears with --debug, avoiding noise
on every plan invocation.
Register a plan-specific --remote flag under the "PREVIEW: TEST
SELECTION" category, defaulting to "origin". This replaces the
hardcoded remote and makes the option discoverable via plan --help.

The plan flag only reads BUILDKITE_TEST_ENGINE_REMOTE (not the
backfill-specific BUILDKITE_TEST_ENGINE_BACKFILL_REMOTE) to avoid
the backfill env var unexpectedly affecting plan behaviour. The
backfill flag retains both env vars for backwards compatibility.
…fArgs helper

Extract a local diffArgs closure that builds the full argument list
for each git diff invocation, pre-allocating the slice to the exact
capacity needed. This removes the fragile pattern of reusing an args
variable across multiple append calls from fresh literals, which is
safe today but could lead to subtle slice aliasing bugs if the code
were later refactored into a loop.
Move the record-to-CommitMetadata parsing into a single parseRecord
function in metadata.go. Both FetchBulkMetadata (backfill, multi-
record) and collectCommitMetadata (plan, single-record) now call it,
eliminating the duplicated field parsing that could drift if the
format changes.
Replace strings.TrimRight (which trims any combination of characters
in the cutset) with strings.TrimSuffix + strings.TrimSpace, which
precisely strips the trailing record separator then whitespace.

Add a buildRecord test helper that joins fields with fieldSeparator
and appends recordSeparator, replacing raw hex escapes (\x1f, \x1e)
throughout the test file with the package constants they represent.
The base elements are "diff" and "--no-ext-diff" (2, not 3).
Update field count from 15 to 17 and attribute empty value filtering
to mergeNonEmpty rather than ToMap.
…ests

Add early return error in ResolveBaseBranch when remote is empty,
preventing the HasPrefix check from matching on "/" which would
silently skip prefixing for all candidates.

Extract MergeMetadata into a testable function that handles the
user-precedence merge logic (existing keys win, empty auto values
are skipped, nil existing map returns auto as-is). This was previously
inline in autoCollectGitMetadata with no direct test coverage.
@gchan gchan requested a review from a team as a code owner April 14, 2026 21:50
@gchan gchan marked this pull request as draft April 14, 2026 21:55
ToMap should be a complete representation of the struct. For root
commits parent_shas is now an empty string rather than absent. The
mergeNonEmpty layer already filters empty values before they reach
the API request.
@gchan gchan force-pushed the gordon/te-5578-auto-collect-git-metadata branch from fdc746d to 2d5bd10 Compare April 14, 2026 22:17
…data

Replace triple-dot syntax (baseBranch...HEAD) with an explicit
git merge-base call followed by two-arg diff (forkPoint, HEAD).
This makes fork-point resolution visible and aligns the plan path
with the backfill path which also uses two-arg diff.

Since both callers now pass exactly two arguments (base and commit),
simplify runDiffCommands from variadic refArgs to named base/commit
parameters and inline the diff calls directly.
@gchan gchan force-pushed the gordon/te-5578-auto-collect-git-metadata branch from 2d5bd10 to e8d0139 Compare April 14, 2026 22:18
gchan added 7 commits April 15, 2026 10:42
Update the README 'Preview: Test Selection' section to document:
- Automatic git metadata collection when --selection-strategy is set
- Base branch resolution fallback chain
- The --remote flag and BUILDKITE_TEST_ENGINE_REMOTE env var
- How explicit --metadata values take precedence over auto-collected

Update the backfill docs to list BUILDKITE_TEST_ENGINE_REMOTE as a
shared env var alongside the existing BUILDKITE_TEST_ENGINE_BACKFILL_REMOTE.
Replace spot-check assertions with a single cmp.Diff against the
complete expected map. This verifies all 17 keys and their values,
catching any missing or incorrect field in one clear diff output.
Replace individual field checks and absence assertions with a single
cmp.Diff against the complete expected map. This catches any extra
or missing keys in one diff output.
Apply consistent test pattern: every test that asserts on the
metadata map now uses cmp.Diff against the complete expected map
rather than spot-checking individual fields. This catches missing,
extra, or incorrect keys in a single diff output.
Log which resolution method was used (explicit --metadata base_branch,
BUILDKITE_PULL_REQUEST_BASE_BRANCH env var, or DetectDefaultBranch)
when resolving the base branch. Useful for diagnosing customer issues
where diffs are unexpected.
Move the auto-collection logic from main.go into internal/command/plan.go
so it lives next to the API request it populates. The function now
accepts cfg and a GitRunner parameter instead of reading the global
cfg directly, keeping it testable.
…t is empty

In CI, the Buildkite agent checks out a specific commit SHA, leaving
the repo in detached HEAD state where git branch --show-current
returns empty. Fall back to the BUILDKITE_BRANCH env var which the
agent always sets.

Split the DetachedHead test into two cases: one where BUILDKITE_BRANCH
is set (verifies fallback populates the branch field) and one where
it's also empty (verifies branch is absent).
@gchan gchan force-pushed the gordon/te-5578-auto-collect-git-metadata branch from f232487 to 520ac52 Compare April 16, 2026 02:12
@gchan
Copy link
Copy Markdown
Contributor Author

gchan commented Apr 16, 2026

Evidence of this working in a local environment: https://github.com/buildkite/buildkite/pull/29016#issuecomment-4257634296 (private repo)

@gchan gchan marked this pull request as ready for review April 16, 2026 05:50
Comment thread internal/command/plan.go Outdated
Comment on lines +171 to +172
fmt.Fprintf(os.Stderr, "Warning: could not resolve base branch for diff metadata. "+
"Set --metadata base_branch=<branch> if your repo uses a non-standard default branch.\n")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no formatting here, maybe use Fprintln? Then we can remove the \n.

Suggested change
fmt.Fprintf(os.Stderr, "Warning: could not resolve base branch for diff metadata. "+
"Set --metadata base_branch=<branch> if your repo uses a non-standard default branch.\n")
fmt.Fprintln(os.Stderr, "Warning: could not resolve base branch for diff metadata. "+
"Set --metadata base_branch=<branch> if your repo uses a non-standard default branch.")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — switched to Fprintln and dropped the trailing \n in 51cc9d9.

@nprizal
Copy link
Copy Markdown
Contributor

nprizal commented Apr 17, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@nprizal
Copy link
Copy Markdown
Contributor

nprizal commented Apr 17, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 520ac52306

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/git/auto_metadata.go Outdated
Comment on lines +53 to +54
if !strings.HasPrefix(ref, remote+"/") {
ref = remote + "/" + ref
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid prefixing already-qualified base refs

When base_branch is passed as a qualified ref (for example upstream/main or refs/heads/release), this logic still prepends the configured remote unless it already starts with <remote>/. That rewrites valid refs into invalid ones like origin/upstream/main, causing ResolveBaseBranch to ignore the explicit override and fall back to default-branch detection, so diff metadata can be computed against the wrong base in multi-remote or fully-qualified-ref workflows.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a780239.

Rather than a contains("/") heuristic (which would misclassify legitimate bare-branch patterns like release/v2 or feature/foo as qualified), ResolveBaseBranch now probes each candidate verbatim first, then falls back to <remote>/<candidate> if the literal value does not resolve. This handles every shape without heuristics:

  • main — verbatim fails, origin/main succeeds
  • release/v2 — verbatim fails, origin/release/v2 succeeds (old behaviour preserved)
  • upstream/main — verbatim succeeds (previously was rewritten to origin/upstream/main and silently dropped)
  • refs/heads/release — verbatim succeeds (previously was rewritten and dropped)
  • origin/main — verbatim succeeds

Regression tests added: TestResolveBaseBranch_ExplicitQualifiedDifferentRemote, TestResolveBaseBranch_ExplicitFullyQualifiedRef, TestResolveBaseBranch_EnvVarQualifiedRef.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ This was AI

Comment thread internal/git/auto_metadata.go Outdated
return
}

record := strings.TrimSpace(strings.TrimSuffix(output, recordSeparator))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Claude here — I was reviewing this PR and think this trim order silently corrupts every commit message on the plan path.

Real git log output ends in \x1e\n (git always appends a newline). TrimSuffix for \x1e sees the trailing \n and no-ops; TrimSpace then strips the \n but leaves \x1e behind — which rides along into fields[8] and survives TrimSpace (not Unicode whitespace). Every message ends up as "...\n\u001e" in the API payload, while the backfill path stores the same commit cleanly.

Tests miss it because buildRecord ends at \x1e with no trailing \n — the one shape where the current order happens to work.

Suggested change
record := strings.TrimSpace(strings.TrimSuffix(output, recordSeparator))
record := strings.TrimSuffix(strings.TrimSpace(output), recordSeparator)

A fixture ending in \x1e\n would lock it in.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with git log and the exising metadata backfill, so I'm not sure if this is valid. Feel free to ignore if you think its irelevant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems legit:

image

Copy link
Copy Markdown
Contributor Author

@gchan gchan Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch — confirmed and fixed in eae355f.

We caught production evidence of this in the Kafka message stream: every plan-path commit message ended in the ASCII record separator byte (U+001E), JSON-encoded into the downstream payload. Exactly the failure mode you described:

  1. TrimSuffix with the record separator saw the trailing \n at the end of real git output, no-oped.
  2. TrimSpace then stripped the \n, leaving the record separator trapped inside fields[8].
  3. Per-field TrimSpace does not touch that byte (U+001E is not Unicode whitespace), so it rode into the commit message value and out the API payload.

Swapped the order to strings.TrimSuffix(strings.TrimSpace(output), recordSeparator) and added TestCollectPlanMetadata_TrailingNewline with a fixture ending in recordSeparator + "\n" (matching real git output) plus an explicit !strings.Contains(got["message"], recordSeparator) assertion. Verified the new test fails against the old code and passes against the fix.

Thanks for the detailed analysis, especially the note about the test fixture being the one shape where the buggy order happens to work.

Copy link
Copy Markdown
Contributor Author

@gchan gchan Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record — the concern was real. Production Kafka data confirmed the trailing ASCII record separator byte (U+001E) on every plan-path commit message. Fixed in eae355f with a regression test using a fixture that matches real git log output.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Last two messages was AI.

gchan added 3 commits April 17, 2026 15:21
…message

Real `git log -1 --format=...%x1e` output always ends in "\x1e\n":
git appends a trailing newline unconditionally, after the format string's
final placeholder. The previous trim order in `collectCommitMetadata`
was:

    record := strings.TrimSpace(strings.TrimSuffix(output, recordSeparator))

With real output this silently leaves the record separator trapped
inside the last field:

1. `TrimSuffix("\x1e")` scans the end, sees "\n", no-ops.
2. `TrimSpace` strips the "\n", leaving "...message\x1e".
3. `SplitN(record, "\x1f", 9)` puts "...message\x1e" into fields[8].
4. `TrimSpace(fields[8])` does not strip "\x1e" (it is ASCII 0x1E
   record separator, not Unicode whitespace).

The stray "\x1e" rides into `CommitMetadata.Message`, through
`ToMap()`, into `cfg.Metadata["message"]`, into the plan-create API
request, and out the other side into the Kafka message stream (observed
in production as the trailing "\u001e" on the message field). The
backfill path was unaffected because it splits on "\x1e" up front, so
each record reaches `parseRecord` without a trailing separator.

Existing unit tests missed this because the `buildRecord` helper
returns "...message\x1e" with no trailing "\n" -- the exact one
output shape where the buggy order happens to work.

Swap the order to apply `TrimSpace` first, then strip the separator:

    record := strings.TrimSuffix(strings.TrimSpace(output), recordSeparator)

Add `TestCollectPlanMetadata_TrailingNewline` with a fixture ending in
"\x1e\n" (matching real git output) plus an explicit no-stray-
separator assertion, to lock the fix in. Confirmed the test fails against
the old code with the message field showing a trailing "\x1e", and
passes against the new code.

Addresses review feedback on PR #480.
`ResolveBaseBranch` previously rewrote any candidate that did not already
start with "<remote>/" into "<remote>/<candidate>", then probed only the
rewritten form with `git rev-parse --verify`. That blind rewrite mangles
every qualified ref shape a user might pass:

  - "upstream/main"         -> "origin/upstream/main"   (invalid)
  - "refs/heads/release"    -> "origin/refs/heads/release" (invalid)

In both cases rev-parse rejects the rewritten ref, the explicit override
is silently dropped, and diff metadata ends up computed against whatever
`DetectDefaultBranch` finds (or the call errors out if nothing is
detectable). Multi-remote workflows and fully-qualified refs are both
legitimate inputs, and the old behaviour made them impossible to express
through `--metadata base_branch=...` or `BUILDKITE_PULL_REQUEST_BASE_BRANCH`.

Flip the resolution model: probe the candidate verbatim first, then fall
back to "<remote>/<candidate>" if the literal value does not resolve.
This handles every common shape without heuristics:

  - "main"                  verbatim fails, "origin/main" succeeds
  - "release/v2"            verbatim fails, "origin/release/v2" succeeds
  - "upstream/main"         verbatim succeeds
  - "refs/heads/release"    verbatim succeeds
  - "origin/main"           verbatim succeeds

A "contains /" heuristic was rejected because legitimate bare-branch
patterns like "release/v2" or "feature/foo" contain "/" but are still
meant to resolve against the configured remote. The verbatim-then-prefix
order is the narrowest change that keeps those working while unblocking
qualified refs.

Existing tests were audited for semantic drift; all still pass. Add
three new regression tests:

  - TestResolveBaseBranch_ExplicitQualifiedDifferentRemote:
    "upstream/main" with remote=origin must resolve to "upstream/main",
    not "origin/upstream/main".
  - TestResolveBaseBranch_ExplicitFullyQualifiedRef:
    "refs/heads/release" must resolve verbatim.
  - TestResolveBaseBranch_EnvVarQualifiedRef:
    the same handling must apply to BUILDKITE_PULL_REQUEST_BASE_BRANCH,
    not just the explicit --metadata source.

Addresses review feedback on PR #480.
The base-branch resolution warning in `autoCollectGitMetadata` calls
`fmt.Fprintf` but contains no formatting verbs, and embeds a trailing
"\n" in the format string. That is exactly the shape `Fprintln` exists
for: no verbs, implicit newline. Switch to `Fprintln` and drop the
explicit "\n".

Pure style change; no behaviour difference on stderr.

Addresses review feedback on PR #480.
Copy link
Copy Markdown
Contributor

@nprizal nprizal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@gchan gchan merged commit 21f5ff5 into main Apr 17, 2026
3 checks passed
@gchan gchan deleted the gordon/te-5578-auto-collect-git-metadata branch April 17, 2026 03:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants