Skip to content

feat(hitl): generic ask_user_via_form capability for selected reasoners#77

Merged
AbirAbbas merged 4 commits into
mainfrom
feat/hitl-ask-user-via-form
May 28, 2026
Merged

feat(hitl): generic ask_user_via_form capability for selected reasoners#77
AbirAbbas merged 4 commits into
mainfrom
feat/hitl-ask-user-via-form

Conversation

@AbirAbbas
Copy link
Copy Markdown
Collaborator

Summary

Generalizes the existing Phase 1.5 plan-approval gate into a reusable
ask_user_via_form substrate. Three reasoners (run_product_manager,
run_issue_advisor, run_replanner) can now emit an ask_user_form
field in their structured output; when populated, the workflow pauses
on the control plane (real app.pause, hours/days if needed) until the
user submits, then re-invokes the reasoner with the answers in
prior_user_responses.

Why

The existing HITL surface is exactly one moment — Phase 1.5 plan approval.
Everywhere else the agent silently picks a default when the right
answer hinges on user judgment (which failing acceptance criteria are
acceptable as debt? abort or reduce scope? which of multiple plausible
goal interpretations?). This adds an opt-in way for the LLM itself to
escalate those cases.

What's in this PR

New substrate — swe_af/hitl/:

  • AskUserForm / AskUserFormField — Pydantic spec the LLM emits. Covers
    all FormBuilder field types (input, textarea, number, slider, select,
    radio, checkbox, checkbox_group, switch, date).
  • build_form_builder(spec) — translates the spec into hax.FormBuilder.
  • request_user_input_and_pause(...) — sends form via create_request(type="form-builder")
    (wrapped with the same 120s hard timeout the plan-approval gate uses),
    then await app.pause(...), then parses the response into AskUserResponse.
  • run_with_ask_user(...) — generic reasoner wrapper that loops on
    ask_user_form output, threading prior_user_responses back into each
    subsequent invocation. Budget-capped (AskUserBudget) and max-iteration-capped.
  • format_prior_user_responses(prior) — renders accumulated answers as
    a markdown block so the LLM doesn't re-ask questions already answered.
  • build_hax_client_from_env() / approval_webhook_url(app) — env-driven
    plumbing so each reasoner self-configures.

Initial allowlist (each reasoner gets ask_user_form schema field + prompt guidance):

  • run_product_manager — for fundamentally ambiguous goals where two
    interpretations would yield very different PRDs.
  • run_issue_advisor — for RETRY_MODIFIED vs ACCEPT_WITH_DEBT trade-offs
    and which failing acceptance criteria are acceptable as debt.
  • run_replanner — for ABORT (project-level judgment) and REDUCE_SCOPE
    vs MODIFY_DAG (user's appetite for partial delivery).

Each reasoner caps itself at 2 ask iterations per invocation. Across a
build, total asks are bounded by call-site count (each reasoner
invocation has its own budget — cross-reasoner sharing wasn't feasible
because run_issue_advisor / run_replanner are invoked across
reasoner boundaries via app.call()).

Dependency bump: hax-sdk>=0.2.0>=0.2.4 in
requirements.txt, requirements-docker.txt, and pyproject.toml.
Docker pip cache keys on the constraint string; without the floor bump,
cached layers keep installing whatever was first resolved. FormBuilder
and create_form_request were already in 0.2.0; this is purely cache
invalidation.

What this PR does NOT change

  • The existing Phase 1.5 plan-approval gate (type="plan-review-v2").
    Stays as-is; runs alongside the new substrate.
  • Default behavior when HAX_API_KEY is unset. build_hax_client_from_env
    returns None; the wrapper short-circuits. Pipeline behavior is
    identical to main.
  • Tool calls to the Claude Code harness. We kept the schema-output path
    (LLM emits ask_user_form in its structured response) rather than
    migrating to mid-turn tool calls, because the workflow pause is
    durable (hours/days) and a mid-turn tool would have to hold the LLM
    conversation open across that interval.

Test plan

  • python -m pytest tests/test_ask_user.py — 17/17 pass locally
  • python -m pytest tests/test_hax_create_request_timeout.py — 52/52 still pass
  • ruff check swe_af/hitl/ tests/test_ask_user.py <touched reasoner/prompt/schema files> — clean
  • No regressions across 162 planner/advisor/replanner/dag tests
  • Verified ruff error count vs origin/main — net zero new findings
  • CI green on this PR
  • Manual smoke test with HAX_API_KEY set: LLM emits ask_user_form, form renders in Hub, submit, reasoner resumes with answers in prior_user_responses (deferred to follow-up — requires a live Hax + control plane)

Files touched

  • requirements.txt, requirements-docker.txt, pyproject.toml — pin bump
  • swe_af/hitl/{__init__,ask_user,wrapper}.py — new
  • tests/test_ask_user.py — new
  • swe_af/reasoners/schemas.pyPRD.ask_user_form field
  • swe_af/execution/schemas.pyIssueAdvisorDecision.ask_user_form, ReplanDecision.ask_user_form
  • swe_af/reasoners/pipeline.pyrun_product_manager runs through run_with_ask_user
  • swe_af/reasoners/execution_agents.pyrun_issue_advisor and run_replanner likewise
  • swe_af/prompts/{product_manager,issue_advisor,replanner}.py — when-to-ask guidance + prior_user_responses threading

🤖 Generated with Claude Code

AbirAbbas and others added 4 commits May 26, 2026 10:32
Docker pip cache keys on the exact constraint string; >=0.2.0 keeps
restoring whatever was first resolved (0.2.0). Bumping the floor forces
layer invalidation so downstream Docker builds pick up patch fixes.

FormBuilder + create_form_request already exist in 0.2.0; the bump is
about cache invalidation, not new functionality.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dget)

New swe_af/hitl/ module that lets reasoners pause the workflow and ask
the user a structured question via the Hax SDK FormBuilder. Same
pause/resume mechanism as the existing Phase 1.5 plan-approval gate —
generalized so any reasoner can opt in.

  - AskUserForm / AskUserFormField: typed Pydantic spec the LLM emits.
    Covers all FormBuilder field types (input, textarea, number, slider,
    select, radio, checkbox, checkbox_group, switch, date).
  - build_form_builder(): translates an AskUserForm into hax.FormBuilder.
  - request_user_input_and_pause(): wraps create_request(type=form-builder)
    with the same 120s hard timeout the plan-approval gate uses, then
    awaits app.pause() and parses the response back into AskUserResponse.
  - run_with_ask_user(): generic reasoner wrapper that loops on the
    LLM's ask_user_form output, threading prior_user_responses back into
    each subsequent invocation. Budget-capped and max-iteration-capped.
  - format_prior_user_responses(): renders accumulated answers as a
    markdown block for inclusion in the LLM prompt — keeps the LLM from
    re-asking questions already answered.
  - build_hax_client_from_env() / approval_webhook_url(): env-driven
    plumbing so each reasoner can self-configure without depending on
    build()'s setup.

17 unit tests cover form-builder round-trip for all field types,
ApprovalResult parsing for submitted/timeout/cancelled/error decisions,
and the wrapper's no-ask / one-ask / budget-exhausted /
max-iteration / hax-disabled paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… ambiguous

Wires the ask_user_via_form substrate into run_product_manager: PRD
schema gains an optional ask_user_form field, the prompt grows a
'when to ask' section, and the reasoner now runs through
run_with_ask_user so an emitted ask_user_form triggers a real
app.pause() until the human responds.

Bounded to 2 ask iterations per PM invocation. Falls through to the
existing behavior when HAX_API_KEY is unset (no behavioral change for
deployments that don't set it).

Use case: the goal references multiple features/pages and priority is
unclear, or two architecturally different interpretations are plausible
and choosing one forecloses the other. Style preferences / details that
can be documented as assumptions stay agent-decided.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the ask_user_via_form substrate into the two execution-time
reasoners that face the highest-stakes ambiguities:

  - run_issue_advisor: choosing between RETRY_MODIFIED and
    ACCEPT_WITH_DEBT, which failing acceptance criteria are acceptable
    as debt, and whether to ESCALATE_TO_REPLAN — all of these hinge on
    user judgment that the agent can't infer from failure context alone.
  - run_replanner: ABORT is a project-level decision the user almost
    always wants to weigh in on; REDUCE_SCOPE vs MODIFY_DAG hinges on
    the user's appetite for partial delivery.

Each reasoner's structured-output schema (IssueAdvisorDecision /
ReplanDecision) gains an optional ask_user_form field. Each prompt
grows a 'when to ask' section. The reasoners now invoke
router.harness() through run_with_ask_user with a per-invocation
budget of 2 asks.

Backwards-compatible: with HAX_API_KEY unset, build_hax_client_from_env
returns None and the wrapper short-circuits the field — behavior is
identical to before.

The existing replanner parse-retry path (2 attempts on unparseable
output) is preserved inside the closure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AbirAbbas
Copy link
Copy Markdown
Collaborator Author

Manual validation (local, Tier 3 — live LLM)

Exercised the real LLM path on a local control plane + Hax:

  • ✅ PM emitted ask_user_form on a deliberately ambiguous goal.
  • run_with_ask_user built the Hax form and called app.pause; the workflow suspended and the pause cascaded to the parent plan/build executions (all waiting).
  • ✅ Form rendered in the Hax Hub and was submitted.
  • ⏸️ Resume-after-submit was not closed in this setup: the hosted Hax couldn't deliver its webhook to the local (non-public) control plane. The agent-side pause + form-creation path is confirmed here; the webhook→resume hop is exercised in feat(hitl): environment scout — negotiate scoped credentials before architecture #78's validation (same shared substrate), where a signed webhook was relayed to the CP and the reasoner resumed cleanly.

Budget cap (2 asks/reasoner) + max-iteration behavior remain covered by the 17 unit tests. CI green.

@AbirAbbas AbirAbbas merged commit 0a4c3b7 into main May 28, 2026
2 checks passed
AbirAbbas added a commit that referenced this pull request May 28, 2026
…ness env

Adds the new reasoner that runs once between PM and Architect when HAX is
enabled. The scout reads the PRD + repo, identifies third-party services
whose absence would block the work, and asks the user for scoped /
temporary tokens via a single Hax mega-form. Submitted values are stashed
in the in-memory credentials store keyed by run_id; the scout's return
payload OMITS scoped_credentials so the secrets never reach the control-
plane workflow_execution row.

  - swe_af/prompts/environment_scout.py — system prompt + task-prompt
    builder. Strong guidance on when NOT to ask (purely local PRD, prior
    answers already cover the question, no genuine PRD-blocking
    requirement).
  - swe_af/reasoners/pipeline.py — @router.reasoner async def
    run_environment_scout. Same wrapper shape as the three reasoners
    from PR #77; uses run_with_ask_user with budget=2.
  - swe_af/app.py:
    * plan() — Phase 1.5 calls run_environment_scout via app.call BETWEEN
      PM and architect; guarded so it runs only when HAX_API_KEY is set.
    * build() body wrapped in try/finally so clear_scoped_credentials
      ALWAYS runs on exit (success or exception). Eliminates secret
      leakage across builds within the same agent process.
    * app.harness is monkey-patched once at module load to auto-inject
      stored credentials as env vars on EVERY harness call across the
      pipeline. Avoids touching the 25+ existing call sites.

Backwards-compatible: with HAX_API_KEY unset, plan() skips the scout and
the monkey-patched harness passes os.environ through unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AbirAbbas added a commit that referenced this pull request May 28, 2026
…rchitecture (#78)

* feat(hitl): substrate for the environment scout (services + creds store + schema)

Three new modules under swe_af/hitl/:

  - services.py — knowledge base of 9 common third-party services (Railway,
    Fly.io, Vercel, Supabase, Sentry, Datadog, GitHub, OpenAI, Anthropic)
    with their env var conventions, mint URLs, permissions hints, and
    signal files. Plus detect_services_from_repo() for a deterministic
    static pre-pass the LLM scout can build on.
  - credentials_store.py — process-local, execution-scoped dict for the
    credentials the scout negotiates. Keyed by run_id, thread-safe,
    isolates concurrent builds, NEVER persists. The full discussion of
    why this is in-memory (not BuildConfig, not app.memory, not the
    filesystem) lives in the module docstring.
  - scout_schema.py — ScoutResult Pydantic model used as the harness
    schema. Includes an explicit "scoped_credentials must NEVER round-
    trip through model_dump unless excluded" comment for callers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(hitl): run_environment_scout reasoner + wire into plan() and harness env

Adds the new reasoner that runs once between PM and Architect when HAX is
enabled. The scout reads the PRD + repo, identifies third-party services
whose absence would block the work, and asks the user for scoped /
temporary tokens via a single Hax mega-form. Submitted values are stashed
in the in-memory credentials store keyed by run_id; the scout's return
payload OMITS scoped_credentials so the secrets never reach the control-
plane workflow_execution row.

  - swe_af/prompts/environment_scout.py — system prompt + task-prompt
    builder. Strong guidance on when NOT to ask (purely local PRD, prior
    answers already cover the question, no genuine PRD-blocking
    requirement).
  - swe_af/reasoners/pipeline.py — @router.reasoner async def
    run_environment_scout. Same wrapper shape as the three reasoners
    from PR #77; uses run_with_ask_user with budget=2.
  - swe_af/app.py:
    * plan() — Phase 1.5 calls run_environment_scout via app.call BETWEEN
      PM and architect; guarded so it runs only when HAX_API_KEY is set.
    * build() body wrapped in try/finally so clear_scoped_credentials
      ALWAYS runs on exit (success or exception). Eliminates secret
      leakage across builds within the same agent process.
    * app.harness is monkey-patched once at module load to auto-inject
      stored credentials as env vars on EVERY harness call across the
      pipeline. Avoids touching the 25+ existing call sites.

Backwards-compatible: with HAX_API_KEY unset, plan() skips the scout and
the monkey-patched harness passes os.environ through unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(hitl): 17 unit tests for the environment-scout substrate

Three pillars covered:

  - services.py — KNOWN_SERVICES inventory bounds, missing-path safety,
    file + directory signal detection, prompt-summary rendering.
  - credentials_store.py — round-trip, blank/None filtering, isolation
    between execution_ids, get-returns-copy, concurrent thread safety,
    inject-into-env layering rules.
  - scout closure round-trip — pass 1 emits ask_user_form via the
    wrapper, pass 2 sees prior_user_responses and returns
    scoped_credentials; no-services-detected short-circuits the pause;
    model_dump(exclude={"scoped_credentials"}) actually strips the field.

All tests mock HaxClient + app.pause; no real network, no real harness.
Pin a baseline of 8+ services so future trimming is visible in diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant