Skip to content

fix: raise ContainerNotRunning when container dies after timeout#807

Open
GopalGB wants to merge 1 commit intoSWE-agent:mainfrom
GopalGB:fix/container-timeout-silent-failure
Open

fix: raise ContainerNotRunning when container dies after timeout#807
GopalGB wants to merge 1 commit intoSWE-agent:mainfrom
GopalGB:fix/container-timeout-silent-failure

Conversation

@GopalGB
Copy link
Copy Markdown

@GopalGB GopalGB commented Apr 4, 2026

Summary

  • Adds ContainerNotRunning exception (subclass of InterruptAgentFlow) that terminates the agent loop cleanly when the Docker container dies after container_timeout expires
  • Detects container-dead markers ("No such container", "is not running") in both subprocess output and exception messages
  • Prevents the agent from silently burning API calls after the container is gone

Problem

When container_timeout expires, the container's sleep process exits and Docker removes it (--rm). Subsequent docker exec calls fail, but execute() catches the exception and returns it as a normal observation. The model keeps issuing commands until step_limit or cost_limit is exhausted, wasting API calls and cost.

Solution

ContainerNotRunning extends InterruptAgentFlow, so the agent's existing run() loop handles it identically to LimitsExceeded or Submitted: it adds the exit message, saves the trajectory, and stops.

Test plan

  • Unit test: ContainerNotRunning raised when subprocess returns dead-container output (non-zero returncode)
  • Unit test: ContainerNotRunning raised when subprocess raises exception with dead-container message
  • Integration test: ContainerNotRunning raised after real container_timeout expires (requires Docker, marked @pytest.mark.slow)
  • All existing tests pass

Fixes #803

When a Docker container exits after container_timeout, subsequent
docker exec calls fail silently and the agent wastes API calls until
step_limit or cost_limit is exhausted. This detects container-dead
markers in both the subprocess output and exception messages, raising
ContainerNotRunning (an InterruptAgentFlow subclass) so the agent's
run loop terminates cleanly and saves the trajectory.

Fixes SWE-agent#803
@GopalGB
Copy link
Copy Markdown
Author

GopalGB commented May 7, 2026

Friendly ping — this PR has been clean (mergeStateStatus: CLEAN, pre-commit.ci passed) for a month. The fix raises ContainerNotRunning instead of a misleading exception when the container dies after timeout, surfacing the real failure mode to operators. Happy to address any review feedback. cc maintainers.

@klieret
Copy link
Copy Markdown
Member

klieret commented May 7, 2026

Hi, thanks for opening this, I'm currently catching up with issues & PRs, we were in a big grind to release programbench. Will take a look soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

container_timeout expiry causes silent failure loop instead of clean agent exit

2 participants