Skip to content

Reap hung builds with a per-job timeout#267

Merged
ericmj merged 2 commits into
mainfrom
runner-job-timeout
Jun 14, 2026
Merged

Reap hung builds with a per-job timeout#267
ericmj merged 2 commits into
mainfrom
runner-job-timeout

Conversation

@ericmj

@ericmj ericmj commented Jun 14, 2026

Copy link
Copy Markdown
Member

A job that hangs — a stuck docker build, docker push, wget, etc. — blocks its runner task indefinitely and keeps counting its weight against the shared concurrency budget. Once enough builds wedge to saturate the budget, the agent stops pulling any new :shared jobs (Erlang/Elixir/OTP builds) while separate-budget DockerManifest jobs keep flowing — so the agent looks alive but builds nothing until the wedged processes are killed by hand.

Two layers:

Per-job timeout (Bob.Script). Every shell-out job (erlang/elixir/manifest/otp/clean) runs through Bob.Script, so the whole script is wrapped in GNU timeout there — bounding the job no matter where it hangs, not just in docker build. Run without --foreground, timeout signals the entire process group, so the script's docker children die with it (verified on the agent: a grandchild process is reaped, no orphan); killing the docker client also makes the daemon cancel the build. timeout ships only on Linux, so non-Linux dev runs the script unwrapped.

Runner per-task backstop (Bob.Runner). The master already reaps stale running rows after 3h (Bob.Queue.Maintenance), but nothing reaps the task on the agent, so a leak would persist if a task wedged outside the script. Bob.Runner now arms a per-task timer at the same threshold; on fire it kills the task to free its weight, reports the job failed, and pulls fresh work. It fires before the master sweep's tick-granular detection, so it wins the race and avoids spawning a duplicate via requeue.

All three timeouts are 3h. Does not fix the underlying OTP-21 parallel-make deadlock that triggered this — those builds now fail cleanly instead of wedging the agent.

A build that hangs (e.g. a wedged `docker build`) blocks its runner
task forever and keeps counting its weight against the shared
concurrency budget. Once enough builds wedge to saturate the budget the
agent stops pulling any new :shared jobs (Erlang/Elixir/OTP builds)
while separate-budget DockerManifest jobs keep flowing, so the agent
looks alive but builds nothing.

The master already reaps stale running rows after 3h
(Bob.Queue.Maintenance), but nothing reaps the task on the agent, so
the weight leak persists. Arm a per-task timer in Bob.Runner mirroring
that threshold: on fire, kill the task to free its weight, report the
job failed, and pull fresh work. The runner's exact-3h timer fires
before the master sweep's tick-granular detection, so it wins the race
and avoids spawning a duplicate build via requeue.
@ericmj ericmj changed the title Time out runner tasks that outlive the master's stale-job threshold Reap hung builds: per-task runner timeout + docker build timeout Jun 14, 2026
A job can hang anywhere — a stuck `docker build`, `docker push`, or
`wget` — not just in `docker build`, and every shell-out job
(erlang/elixir/manifest/otp/clean) runs through Bob.Script. Wrap the
whole script in GNU `timeout` there. Run without --foreground it signals
the entire process group, so the script's `docker` children die with it;
killing the `docker` client also makes the daemon cancel the build. Sits
at the runner's per-task backstop. `timeout` ships only on Linux, so
non-Linux dev runs the script unwrapped.
@ericmj ericmj force-pushed the runner-job-timeout branch from f2683b6 to 443c859 Compare June 14, 2026 19:10
@ericmj ericmj changed the title Reap hung builds: per-task runner timeout + docker build timeout Reap hung builds with a per-job timeout Jun 14, 2026
@ericmj ericmj merged commit fce3afc into main Jun 14, 2026
4 checks passed
@ericmj ericmj deleted the runner-job-timeout branch June 14, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant