feat(hitl): generic ask_user_via_form capability for selected reasoners by AbirAbbas · Pull Request #77 · Agent-Field/SWE-AF

AbirAbbas · 2026-05-26T14:34:10Z

Summary

Generalizes the existing Phase 1.5 plan-approval gate into a reusable
ask_user_via_form substrate. Three reasoners (run_product_manager,
run_issue_advisor, run_replanner) can now emit an ask_user_form
field in their structured output; when populated, the workflow pauses
on the control plane (real app.pause, hours/days if needed) until the
user submits, then re-invokes the reasoner with the answers in
prior_user_responses.

Why

The existing HITL surface is exactly one moment — Phase 1.5 plan approval.
Everywhere else the agent silently picks a default when the right
answer hinges on user judgment (which failing acceptance criteria are
acceptable as debt? abort or reduce scope? which of multiple plausible
goal interpretations?). This adds an opt-in way for the LLM itself to
escalate those cases.

What's in this PR

New substrate — swe_af/hitl/:

AskUserForm / AskUserFormField — Pydantic spec the LLM emits. Covers
all FormBuilder field types (input, textarea, number, slider, select,
radio, checkbox, checkbox_group, switch, date).
build_form_builder(spec) — translates the spec into hax.FormBuilder.
request_user_input_and_pause(...) — sends form via create_request(type="form-builder")
(wrapped with the same 120s hard timeout the plan-approval gate uses),
then await app.pause(...), then parses the response into AskUserResponse.
run_with_ask_user(...) — generic reasoner wrapper that loops on
ask_user_form output, threading prior_user_responses back into each
subsequent invocation. Budget-capped (AskUserBudget) and max-iteration-capped.
format_prior_user_responses(prior) — renders accumulated answers as
a markdown block so the LLM doesn't re-ask questions already answered.
build_hax_client_from_env() / approval_webhook_url(app) — env-driven
plumbing so each reasoner self-configures.

Initial allowlist (each reasoner gets ask_user_form schema field + prompt guidance):

run_product_manager — for fundamentally ambiguous goals where two
interpretations would yield very different PRDs.
run_issue_advisor — for RETRY_MODIFIED vs ACCEPT_WITH_DEBT trade-offs
and which failing acceptance criteria are acceptable as debt.
run_replanner — for ABORT (project-level judgment) and REDUCE_SCOPE
vs MODIFY_DAG (user's appetite for partial delivery).

Each reasoner caps itself at 2 ask iterations per invocation. Across a
build, total asks are bounded by call-site count (each reasoner
invocation has its own budget — cross-reasoner sharing wasn't feasible
because run_issue_advisor / run_replanner are invoked across
reasoner boundaries via app.call()).

Dependency bump: hax-sdk>=0.2.0 → >=0.2.4 in
requirements.txt, requirements-docker.txt, and pyproject.toml.
Docker pip cache keys on the constraint string; without the floor bump,
cached layers keep installing whatever was first resolved. FormBuilder
and create_form_request were already in 0.2.0; this is purely cache
invalidation.

What this PR does NOT change

The existing Phase 1.5 plan-approval gate (type="plan-review-v2").
Stays as-is; runs alongside the new substrate.
Default behavior when HAX_API_KEY is unset. build_hax_client_from_env
returns None; the wrapper short-circuits. Pipeline behavior is
identical to main.
Tool calls to the Claude Code harness. We kept the schema-output path
(LLM emits ask_user_form in its structured response) rather than
migrating to mid-turn tool calls, because the workflow pause is
durable (hours/days) and a mid-turn tool would have to hold the LLM
conversation open across that interval.

Test plan

python -m pytest tests/test_ask_user.py — 17/17 pass locally
python -m pytest tests/test_hax_create_request_timeout.py — 52/52 still pass
ruff check swe_af/hitl/ tests/test_ask_user.py <touched reasoner/prompt/schema files> — clean
No regressions across 162 planner/advisor/replanner/dag tests
Verified ruff error count vs origin/main — net zero new findings
CI green on this PR
Manual smoke test with HAX_API_KEY set: LLM emits ask_user_form, form renders in Hub, submit, reasoner resumes with answers in prior_user_responses (deferred to follow-up — requires a live Hax + control plane)

Files touched

requirements.txt, requirements-docker.txt, pyproject.toml — pin bump
swe_af/hitl/{__init__,ask_user,wrapper}.py — new
tests/test_ask_user.py — new
swe_af/reasoners/schemas.py — PRD.ask_user_form field
swe_af/execution/schemas.py — IssueAdvisorDecision.ask_user_form, ReplanDecision.ask_user_form
swe_af/reasoners/pipeline.py — run_product_manager runs through run_with_ask_user
swe_af/reasoners/execution_agents.py — run_issue_advisor and run_replanner likewise
swe_af/prompts/{product_manager,issue_advisor,replanner}.py — when-to-ask guidance + prior_user_responses threading

🤖 Generated with Claude Code

Docker pip cache keys on the exact constraint string; >=0.2.0 keeps restoring whatever was first resolved (0.2.0). Bumping the floor forces layer invalidation so downstream Docker builds pick up patch fixes. FormBuilder + create_form_request already exist in 0.2.0; the bump is about cache invalidation, not new functionality. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dget) New swe_af/hitl/ module that lets reasoners pause the workflow and ask the user a structured question via the Hax SDK FormBuilder. Same pause/resume mechanism as the existing Phase 1.5 plan-approval gate — generalized so any reasoner can opt in. - AskUserForm / AskUserFormField: typed Pydantic spec the LLM emits. Covers all FormBuilder field types (input, textarea, number, slider, select, radio, checkbox, checkbox_group, switch, date). - build_form_builder(): translates an AskUserForm into hax.FormBuilder. - request_user_input_and_pause(): wraps create_request(type=form-builder) with the same 120s hard timeout the plan-approval gate uses, then awaits app.pause() and parses the response back into AskUserResponse. - run_with_ask_user(): generic reasoner wrapper that loops on the LLM's ask_user_form output, threading prior_user_responses back into each subsequent invocation. Budget-capped and max-iteration-capped. - format_prior_user_responses(): renders accumulated answers as a markdown block for inclusion in the LLM prompt — keeps the LLM from re-asking questions already answered. - build_hax_client_from_env() / approval_webhook_url(): env-driven plumbing so each reasoner can self-configure without depending on build()'s setup. 17 unit tests cover form-builder round-trip for all field types, ApprovalResult parsing for submitted/timeout/cancelled/error decisions, and the wrapper's no-ask / one-ask / budget-exhausted / max-iteration / hax-disabled paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… ambiguous Wires the ask_user_via_form substrate into run_product_manager: PRD schema gains an optional ask_user_form field, the prompt grows a 'when to ask' section, and the reasoner now runs through run_with_ask_user so an emitted ask_user_form triggers a real app.pause() until the human responds. Bounded to 2 ask iterations per PM invocation. Falls through to the existing behavior when HAX_API_KEY is unset (no behavioral change for deployments that don't set it). Use case: the goal references multiple features/pages and priority is unclear, or two architecturally different interpretations are plausible and choosing one forecloses the other. Style preferences / details that can be documented as assumptions stay agent-decided. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires the ask_user_via_form substrate into the two execution-time reasoners that face the highest-stakes ambiguities: - run_issue_advisor: choosing between RETRY_MODIFIED and ACCEPT_WITH_DEBT, which failing acceptance criteria are acceptable as debt, and whether to ESCALATE_TO_REPLAN — all of these hinge on user judgment that the agent can't infer from failure context alone. - run_replanner: ABORT is a project-level decision the user almost always wants to weigh in on; REDUCE_SCOPE vs MODIFY_DAG hinges on the user's appetite for partial delivery. Each reasoner's structured-output schema (IssueAdvisorDecision / ReplanDecision) gains an optional ask_user_form field. Each prompt grows a 'when to ask' section. The reasoners now invoke router.harness() through run_with_ask_user with a per-invocation budget of 2 asks. Backwards-compatible: with HAX_API_KEY unset, build_hax_client_from_env returns None and the wrapper short-circuits the field — behavior is identical to before. The existing replanner parse-retry path (2 attempts on unparseable output) is preserved inside the closure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AbirAbbas · 2026-05-28T18:09:14Z

Manual validation (local, Tier 3 — live LLM)

Exercised the real LLM path on a local control plane + Hax:

✅ PM emitted ask_user_form on a deliberately ambiguous goal.
✅ run_with_ask_user built the Hax form and called app.pause; the workflow suspended and the pause cascaded to the parent plan/build executions (all waiting).
✅ Form rendered in the Hax Hub and was submitted.
⏸️ Resume-after-submit was not closed in this setup: the hosted Hax couldn't deliver its webhook to the local (non-public) control plane. The agent-side pause + form-creation path is confirmed here; the webhook→resume hop is exercised in feat(hitl): environment scout — negotiate scoped credentials before architecture #78's validation (same shared substrate), where a signed webhook was relayed to the CP and the reasoner resumed cleanly.

Budget cap (2 asks/reasoner) + max-iteration behavior remain covered by the 17 unit tests. CI green.

…ness env Adds the new reasoner that runs once between PM and Architect when HAX is enabled. The scout reads the PRD + repo, identifies third-party services whose absence would block the work, and asks the user for scoped / temporary tokens via a single Hax mega-form. Submitted values are stashed in the in-memory credentials store keyed by run_id; the scout's return payload OMITS scoped_credentials so the secrets never reach the control- plane workflow_execution row. - swe_af/prompts/environment_scout.py — system prompt + task-prompt builder. Strong guidance on when NOT to ask (purely local PRD, prior answers already cover the question, no genuine PRD-blocking requirement). - swe_af/reasoners/pipeline.py — @router.reasoner async def run_environment_scout. Same wrapper shape as the three reasoners from PR #77; uses run_with_ask_user with budget=2. - swe_af/app.py: * plan() — Phase 1.5 calls run_environment_scout via app.call BETWEEN PM and architect; guarded so it runs only when HAX_API_KEY is set. * build() body wrapped in try/finally so clear_scoped_credentials ALWAYS runs on exit (success or exception). Eliminates secret leakage across builds within the same agent process. * app.harness is monkey-patched once at module load to auto-inject stored credentials as env vars on EVERY harness call across the pipeline. Avoids touching the 25+ existing call sites. Backwards-compatible: with HAX_API_KEY unset, plan() skips the scout and the monkey-patched harness passes os.environ through unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rchitecture (#78) * feat(hitl): substrate for the environment scout (services + creds store + schema) Three new modules under swe_af/hitl/: - services.py — knowledge base of 9 common third-party services (Railway, Fly.io, Vercel, Supabase, Sentry, Datadog, GitHub, OpenAI, Anthropic) with their env var conventions, mint URLs, permissions hints, and signal files. Plus detect_services_from_repo() for a deterministic static pre-pass the LLM scout can build on. - credentials_store.py — process-local, execution-scoped dict for the credentials the scout negotiates. Keyed by run_id, thread-safe, isolates concurrent builds, NEVER persists. The full discussion of why this is in-memory (not BuildConfig, not app.memory, not the filesystem) lives in the module docstring. - scout_schema.py — ScoutResult Pydantic model used as the harness schema. Includes an explicit "scoped_credentials must NEVER round- trip through model_dump unless excluded" comment for callers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(hitl): run_environment_scout reasoner + wire into plan() and harness env Adds the new reasoner that runs once between PM and Architect when HAX is enabled. The scout reads the PRD + repo, identifies third-party services whose absence would block the work, and asks the user for scoped / temporary tokens via a single Hax mega-form. Submitted values are stashed in the in-memory credentials store keyed by run_id; the scout's return payload OMITS scoped_credentials so the secrets never reach the control- plane workflow_execution row. - swe_af/prompts/environment_scout.py — system prompt + task-prompt builder. Strong guidance on when NOT to ask (purely local PRD, prior answers already cover the question, no genuine PRD-blocking requirement). - swe_af/reasoners/pipeline.py — @router.reasoner async def run_environment_scout. Same wrapper shape as the three reasoners from PR #77; uses run_with_ask_user with budget=2. - swe_af/app.py: * plan() — Phase 1.5 calls run_environment_scout via app.call BETWEEN PM and architect; guarded so it runs only when HAX_API_KEY is set. * build() body wrapped in try/finally so clear_scoped_credentials ALWAYS runs on exit (success or exception). Eliminates secret leakage across builds within the same agent process. * app.harness is monkey-patched once at module load to auto-inject stored credentials as env vars on EVERY harness call across the pipeline. Avoids touching the 25+ existing call sites. Backwards-compatible: with HAX_API_KEY unset, plan() skips the scout and the monkey-patched harness passes os.environ through unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(hitl): 17 unit tests for the environment-scout substrate Three pillars covered: - services.py — KNOWN_SERVICES inventory bounds, missing-path safety, file + directory signal detection, prompt-summary rendering. - credentials_store.py — round-trip, blank/None filtering, isolation between execution_ids, get-returns-copy, concurrent thread safety, inject-into-env layering rules. - scout closure round-trip — pass 1 emits ask_user_form via the wrapper, pass 2 sees prior_user_responses and returns scoped_credentials; no-services-detected short-circuits the pause; model_dump(exclude={"scoped_credentials"}) actually strips the field. All tests mock HaxClient + app.pause; no real network, no real harness. Pin a baseline of 8+ services so future trimming is visible in diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AbirAbbas and others added 4 commits May 26, 2026 10:32

AbirAbbas mentioned this pull request May 26, 2026

feat(hitl): environment scout — negotiate scoped credentials before architecture #78

Merged

5 tasks

AbirAbbas merged commit 0a4c3b7 into main May 28, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hitl): generic ask_user_via_form capability for selected reasoners#77

feat(hitl): generic ask_user_via_form capability for selected reasoners#77
AbirAbbas merged 4 commits into
mainfrom
feat/hitl-ask-user-via-form

AbirAbbas commented May 26, 2026

Uh oh!

AbirAbbas commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AbirAbbas commented May 26, 2026

Summary

Why

What's in this PR

What this PR does NOT change

Test plan

Files touched

Uh oh!

AbirAbbas commented May 28, 2026

Manual validation (local, Tier 3 — live LLM)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant