From f5d12b9b2f7a4d1310f3ba5caab227dd84600026 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nikola=20Forr=C3=B3?= Date: Wed, 20 May 2026 16:16:37 +0200 Subject: [PATCH] Research async building architecture MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Nikola Forró Assisted-by: Claude Opus 4.6 via Claude Code --- research/ymir-async-building/index.md | 421 ++++++++++++++++++++++++++ 1 file changed, 421 insertions(+) create mode 100644 research/ymir-async-building/index.md diff --git a/research/ymir-async-building/index.md b/research/ymir-async-building/index.md new file mode 100644 index 0000000..f0ac5c8 --- /dev/null +++ b/research/ymir-async-building/index.md @@ -0,0 +1,421 @@ +# Ymir: Async Building — Solving the Copr Build Blocking Problem + +## Problem Statement + +The backport/rebase agent currently blocks the entire pod while waiting for +Copr builds to finish. Builds can take up to 3 hours, and the cherry-pick fix +loop can trigger up to 10 rebuild attempts — meaning a single task can +monopolize a pod for 30+ hours. During this time the pod is idle except for +polling Copr every 30 seconds. + +## Current Architecture + +### Workflow (backport agent) + +``` +Redis brpop (pick task) + → change Jira status + → fork & prepare dist-git + → run backport agent (Claude applies patches) + → run build agent (submit SRPM to Copr, poll until done) ← BLOCKING + ├─ SUCCESS → update release → commit → MR → Jira + ├─ TIMEOUT → proceed anyway + └─ FAILURE (decrement attempts_remaining) + ├─ cherry-pick workflow → fix_build_error loop ← BLOCKING (repeated) + └─ git-am workflow → full reset & retry ← BLOCKING (repeated) + (back to fork_and_prepare_dist_git — re-clones, re-patches, rebuilds) + → post-processing (log agent, MR, labels, Jira comment) +``` + +### Workflow (rebase agent) + +The rebase agent has the same blocking pattern. On build failure it always +does a full pipeline reset (like backport's git-am path): + +``` +Redis brpop (pick task) + → fork & prepare dist-git + → run rebase agent (Claude performs rebase) + → run build agent (submit SRPM to Copr, poll until done) ← BLOCKING + ├─ SUCCESS → update release → commit → MR → Jira + ├─ TIMEOUT → proceed anyway + └─ FAILURE (decrement attempts_remaining) + → full reset & retry ← BLOCKING (repeated) + (back to fork_and_prepare_dist_git — re-clones, re-rebases, rebuilds) + → post-processing (log agent, MR, labels, Jira comment) +``` + +Up to 10 full re-runs (`MAX_BUILD_ATTEMPTS`), no incremental fix loop. + +### Where time is spent + +| Phase | Typical duration | Notes | +| ---------------------------------- | --------------------- | ------------------------------- | +| Patch application (backport agent) | 2–10 min | LLM calls + tool use | +| SRPM generation | < 1 min | | +| **Copr build polling** | **10 min – 3 hours** | `asyncio.sleep(30)` loop | +| Build error analysis | 1–2 min | Downloads logs, LLM analysis | +| Fix attempt (if cherry-pick) | 5–15 min | Fresh agent + rebuild | +| **Full retry (if git-am)** | **12 min – 3+ hours** | **Re-clone, re-patch, rebuild** | +| Post-processing | 1–3 min | | + +The two failure paths use independent retry counters (`attempts_remaining` and +`incremental_fix_attempts`, both default to 10) that don't interact: + +- **Backport (cherry-pick)**: up to **11 builds** (1 initial + 10 fix attempts). +- **Backport (git-am)**: up to **10 full pipeline re-runs** (re-clone, + re-patch, rebuild each time) — worst case ~32 hours blocking a pod. +- **Rebase**: up to **10 full pipeline re-runs** (same as backport git-am + path) — worst case ~32 hours blocking a pod. + +The Copr polling (`ymir/tools/privileged/copr.py:186–229`) uses `asyncio.sleep` +which is non-blocking at the event-loop level, but the workflow is +single-task-per-pod so nothing else runs. + +### Key code locations + +- Copr build tool (polling loop): `ymir/tools/privileged/copr.py` — `BuildPackageTool._run()` +- Build agent: `ymir/agents/build_agent.py` +- Backport workflow state machine: `ymir/agents/backport_agent.py:1486–1509` +- Fix build error loop (cherry-pick): `ymir/agents/backport_agent.py:1195–1280` +- Git-am full retry (back to `fork_and_prepare_dist_git`): `ymir/agents/backport_agent.py:1329–1330` +- Attempt counter decrement: `ymir/agents/backport_agent.py:1316–1322` +- Reset on retry (does NOT reset `attempts_remaining`): `ymir/agents/backport_agent.py:1086–1089` +- Rebase workflow (same pattern): `ymir/agents/rebase_agent.py` +- Agent runner (memory management): `ymir/agents/reasoning_agent/_runner.py` + +### Resource constraints + +| Resource | Current | +| ---------------------- | ---------------------------------- | +| Namespace quota | 14Gi requests / 16Gi limits | +| Backport agents (×2) | 2Gi req / 5Gi limit each | +| Rebase agents (×2) | 512Mi req / 1Gi limit each | +| Rebuild agents (×2) | 512Mi req / 1Gi limit each | +| Triage agent (×1) | 512Mi req / 1Gi limit each | +| Phoenix | 256Mi req / 1Gi limit | +| **Total allocated** | **~6.8Gi requests / ~16Gi limits** | +| **Remaining headroom** | **~7.2Gi requests / ~0Gi limits** | + +The namespace is nearly at its limit budget. Any solution that adds pods or +increases per-pod memory needs a quota increase. + +### What consumes memory in a backport pod + +- **Source processing** — the dominant consumer. `rhpkg sources` downloads + compressed sources and `rhpkg prep` unpacks them. Large packages like + Firefox/Thunderbird have ~750MB compressed sources that expand to ~4GB + uncompressed. Both CPU and memory spike during these operations. The SE deployment already requires `cpu: 1000m, memory: 6Gi` limits for + backport pods (2 of each kind) just to avoid OOM kills. +- Git repos cloned to PVC (`/git-repos/{issue}/`) — disk-backed, but repo + operations spike RSS +- BeeAI `UnconstrainedMemory` — conversation history grows with each LLM + call and tool result, never trimmed during a run +- Multiple sub-agents created sequentially (backport, build, fix, log) — each + holds its own conversation state until garbage collected + +### Agent state and serialization + +- Workflow state is a Pydantic `BaseModel` — trivially serializable via + `model_dump_json()`. +- Agent conversation history (`UnconstrainedMemory`) can be serialized via + `BaseMemory.to_json_safe()` and `Message.to_plain()` — all message content + types are Pydantic models with `model_dump()`. Deserialization requires a + small custom utility (dispatch on `role` to construct the right `Message` + subclass), since the framework doesn't provide a `from_json()` method, but + the constructors already accept dicts internally. BeeAI also offers + alternative memory classes (`SlidingMemory`, `TokenMemory`, + `SummarizeMemory`) that could help bound conversation size for long-running + fix loops. +- Each `fix_build_error` call creates a **fresh agent** with no memory of prior + fix attempts — context is passed entirely through the prompt template (build + error message, upstream patches, repo path). This means conversation + continuity across the build wait is **not required**. + +## Alternative: Skip Build Verification Entirely + +Instead of optimizing the build wait, eliminate it — open the MR without a +Copr build and rely on CI (GreenWave/OSCI gating) to catch failures. The +preliminary testing agent already monitors these CI results on opened MRs. + +**Advantages:** + +- Eliminates the blocking problem completely — no architectural changes needed +- MRs opened faster, CI runs in parallel across the dist-git infrastructure + +**Challenges:** + +- The cherry-pick fix loop in the backport agent uses Copr builds as + **inline feedback** — the agent builds, reads the error, fixes, rebuilds. + Without this, the agent can't iteratively fix build failures. The MR would + be opened with potentially broken code. +- Someone (human or agent) would need to react to CI failures after the MR is + opened, adding a separate feedback loop outside the current workflow. +- For the git-am and rebase workflows, skipping the build is more viable since + their retry loop restarts from scratch anyway — the build result doesn't + guide the fix, it just triggers a blind retry. + +**Verdict:** viable for git-am/rebase workflows where the build doesn't guide +fixes. Not viable for the cherry-pick workflow without a way to close the +feedback loop post-MR (e.g. a separate agent that reacts to CI failures). Could +be offered as a configurable option per workflow type. + +## Proposed Solutions + +### Option A: Async Task Multiplexing (In-Process Concurrency) + +Run multiple tasks concurrently within the same pod using `asyncio`. + +``` +Pod main loop: + - Maintain a pool of N concurrent tasks (e.g. N=2–3) + - When a slot opens, brpop the next task + - Each task runs as an independent asyncio.Task + - Build polling (asyncio.sleep) yields the event loop to other tasks +``` + +**Implementation sketch:** + +```python +sem = asyncio.Semaphore(2) + +async def process_task(payload): + async with sem: + task = Task.model_validate_json(payload) + # ... existing workflow ... + +active = set() +while True: + element = await fix_await(redis.brpop([backport_queue], timeout=30)) + if element: + _, payload = element + t = asyncio.create_task(process_task(payload)) + active.add(t) + t.add_done_callback(active.discard) +``` + +**Advantages:** + +- Minimal code changes — workflows are already async +- Copr polling already uses `asyncio.sleep`, yielding the event loop — this + includes builds triggered inside the cherry-pick fix agent's LLM session, + since the `build_package` tool call goes through the same async polling loop +- Single pod, single process — simpler operations + +**Challenges:** + +- **Memory**: each concurrent task needs its own source extraction + agent + conversation. Source processing is the dominant consumer — large packages + (Firefox, Thunderbird) need ~4Gi just for unpacked sources. With N=2, a + backport pod could need ~12Gi; with N=3, ~18Gi. The SE deployment already runs backport pods at 6Gi limits for single + tasks. +- **MCP Gateway**: the MCP protocol multiplexes requests by ID over a single + SSE session (`_response_streams` dict keyed by `RequestId`), so concurrent + tool calls from the same pod work out of the box. The gateway tools + (e.g. `BuildPackageTool`) are stateless. No gateway changes needed. +- **Error blast radius**: one task's crash could affect others in the same + process. + +**Memory mitigation:** eagerly clear agent conversation history after each +sub-agent completes. However, the main memory pressure comes from source +processing, not agent state — conversation history cleanup won't help with +a 4Gi Firefox source extraction. + +**Verdict:** viability depends heavily on the package mix. For small packages +(N=2 feasible), this is a quick win. For large packages, even N=2 may exceed +memory limits. Doesn't fundamentally solve the problem. + +### Option B: Split Build-Wait from Fix into Separate Workflow Phases + +Decouple the workflow into two independently-schedulable phases. + +``` +Phase 1 — Patch & Submit: + 1. Fork dist-git, apply patches + 2. Generate SRPM + 3. Submit to Copr (fire and forget — just get build_id) + 4. Serialize workflow state to Redis + 5. Return to queue loop immediately + +Build Monitor (UMB consumer or lightweight polling service): + - One small pod (~256Mi) + - Subscribes to UMB topic /topic/VirtualTopic.devel.copr (build.end) + and matches build IDs against active tasks + - Alternatively, polls all active Copr builds every 30s in a single loop + - When a build completes → push result to "build_complete" queue + +Phase 2 — Verify & Fix: + 1. Pop from build_complete queue + 2. Deserialize workflow state + 3. If build succeeded → post-processing (update release, MR, Jira) + 4. If build failed → run fix agent → new SRPM → submit → back to monitor +``` + +**Advantages:** + +- Pod is never idle — always processing real work +- Memory freed between phases (no state held in RAM during builds) +- Scales naturally: one pod processes many tasks per hour +- Clean separation of concerns +- State serialization is straightforward (Pydantic models → JSON in Redis) + +**Challenges:** + +- Agent conversation history is lost between phases. However, the fix agent + already creates a **fresh agent** per attempt with full context in its prompt, + so nothing is actually lost. +- Upstream repo on PVC (`/git-repos/{issue}-upstream/`) must survive between + phases — this already works. +- Two workflow entry points instead of one — more complex orchestration. +- Needs a new build monitor service (small but new component). + +Copr does not support outgoing webhooks, but the internal Copr instance +publishes build completion events on the UMB (Unified Message Bus) under +`/topic/VirtualTopic.devel.copr` with topic `build.end`. Messages include +build ID, status (`SUCCEEDED`/failed), project, owner, and chroot. A UMB +consumer could replace polling entirely for Options B/C. + +Note: this option only covers builds triggered at the **workflow step** level +(`run_build_agent`). The cherry-pick fix agent calls `build_package` as a tool +**during its LLM session** — that build still blocks inside `fix_agent.run()`. +Splitting that would require intercepting tool calls mid-agent-run, which the +BeeAI framework doesn't support. In practice this means the fix loop still +blocks, but only the fix iterations (not the initial build or the git-am/rebase +retries). + +**Verdict:** addresses the main source of blocking (the initial build and +git-am/rebase retries). The cherry-pick fix loop's internal builds remain +blocking but are shorter (the fix agent only rebuilds after making targeted +changes). + +### Option C: Context Switching with Externally-Stored State + +When the agent reaches a build wait, serialize the **entire agent state** +(including conversation history) to external storage, release the pod, and +resume later. Redis/Valkey is the natural choice since it's already deployed +with a persistent 2Gi NFS-backed volume. S3 is another option but would +require new infrastructure. This pattern (storing chat history in external +storage during context switches) is used by other teams in their projects. + +``` +1. Agent runs until build_package is called +2. build_package returns immediately with build_id +3. Serialize to Redis: + - Pydantic workflow State + - BeeAI conversation history (all messages) + - Git state (commit hash, branch, dirty files) +4. Push task into "waiting_for_build" sorted set in Redis +5. Pod picks up next task +6. Monitor detects build completion → push to "resume" queue +7. Pod loads state from Redis, reconstructs agent, continues +``` + +This is the only option that could handle the cherry-pick fix loop's internal +builds — serializing mid-agent-run when `build_package` is called as a tool, +then resuming the agent with its conversation history intact after the build +completes. + +**Advantages:** + +- True suspend/resume — conversation history preserved +- Pod utilization maximized, including during cherry-pick fix loop builds +- No new infrastructure — uses existing Redis/Valkey +- Validated pattern (used by other teams) + +**Challenges:** + +- **BeeAI agent reconstruction**: conversation history is serializable via + `to_json_safe()`/`to_plain()` (deserialization needs a small custom + utility). The harder part is reconstructing `ReasoningAgentRunner` state — + beyond messages, `ReasoningAgentRunState` holds an `iteration` counter, + `steps` list (with `Tool` instances and `ToolOutput` objects — not trivially + serializable), and `usage`/`cost` tracking. The runner itself also holds + retry counters (`_iteration_error_counter`, `_global_error_counter`) and + tool call cycle detection state. Tool references, LLM config, and + requirement evaluations are deterministic from config and can be re-created. +- **Prompt caching**: after restoring state, the LLM prompt cache will be cold + (cache miss on the first call), increasing cost and latency for that + request. This is a cost tradeoff, not a correctness issue — cache control + injection points are recomputed dynamically on each LLM call based on + message positions. +- **Complexity**: most complex option, requires deep framework-level changes. + +**State size estimate:** + +- Workflow State JSON: ~1–5 KB +- Agent conversation (up to 255 iterations × ~2KB per message pair): ~500 KB +- Well within Redis capacity (2Gi PVC, current usage is small queues). + +The same serialization mechanism could also be used to **pass context between +agents in the end-to-end workflow** (triage → backport/rebase). Currently the +triage agent's conversation history — its reasoning about the issue, upstream +research, CVE analysis — is discarded when the task is enqueued. Only a small +structured payload (`BackportData`/`RebaseData` with package name, patch URLs, +justification) survives the handoff via `Task(metadata=state.model_dump())`. +With serialized conversation history in Redis, the downstream agent could load +the triage agent's research as additional context, potentially making better +decisions about patch application and conflict resolution. + +**Verdict:** most elegant but requires work to serialize/reconstruct full agent +state. The conversation history part is solvable now; the runner state +reconstruction is the remaining gap. Worth investigating if scale demands grow +beyond what Option B handles. + +### Build Monitor Service (shared component for Options B and C) + +Options B and C both need a way to detect when Copr builds finish. This would +be a dedicated lightweight service (~256Mi pod) — either a UMB consumer +listening on `/topic/VirtualTopic.devel.copr` (`build.end`) or a polling loop. +The UMB approach is event-driven and near-instant. + +This is not a standalone solution — the agent pod still blocks waiting for the +result without one of the options above. + +## Comparison Matrix + +| | Option A: Async Multiplexing | Option B: Split Phases | Option C: Redis State | +| --------------------------- | ----------------------------- | -------------------------------- | --------------------------------- | +| **Effort** | Low | Medium | High | +| **Memory impact** | N× current per pod | Neutral (freed between phases) | Neutral (state in Redis) | +| **Pod utilization** | Improved (N tasks) | Optimal (never idle) | Optimal (never idle) | +| **New infrastructure** | None | Build monitor pod (UMB consumer) | Monitor pod (uses existing Redis) | +| **Conversation continuity** | Preserved (in-memory) | Not needed (fresh agent) | Preserved (serialized) | +| **Framework changes** | None | Workflow refactoring | Agent serialization layer | +| **Quota impact** | Need ~22Gi+ limits | Need ~17Gi limits (+monitor) | Need ~17Gi limits (+monitor) | +| **Risk** | Memory pressure, blast radius | Orchestration complexity | Framework coupling | + +## Recommendation + +### Short-term: Option A with N=2 + +- Add semaphore-based task pool to the main loop +- Run 2 tasks concurrently per backport pod +- Profile actual RSS first — if well under 5Gi, this is easy +- Increase pod memory to ~8Gi if needed (requires quota bump to ~22Gi) +- Estimated effort: 1–2 days + +### Medium-term: Option B (Split Phases) + +- Refactor workflow into "submit" and "verify" phases +- Add lightweight Copr build monitor service +- State serialization via Pydantic `model_dump_json()` to Redis +- Conversation history intentionally not preserved — fix agent already gets + full context through prompt templates +- Estimated effort: 1–2 weeks + +### Long-term: Option C (if needed) + +- Only justified if workload grows significantly +- Build serialization layer for BeeAI agents, store state in existing Redis +- Depends on framework support + +## Open Questions + +1. **What is the actual RSS of a backport pod during source processing?** + Feedback from the SE deployment shows large packages (Firefox, Thunderbird) + need ~4Gi just for source extraction, making even N=2 concurrency + challenging. Profiling smaller packages would clarify the typical case. +2. **What's the namespace quota expansion path?** Determines feasibility of + Option A at higher concurrency. +3. **How large are typical agent conversations?** Affects state size + estimates and memory profiling for Option A.