Reap hung builds with a per-job timeout#267
Merged
Merged
Conversation
A build that hangs (e.g. a wedged `docker build`) blocks its runner task forever and keeps counting its weight against the shared concurrency budget. Once enough builds wedge to saturate the budget the agent stops pulling any new :shared jobs (Erlang/Elixir/OTP builds) while separate-budget DockerManifest jobs keep flowing, so the agent looks alive but builds nothing. The master already reaps stale running rows after 3h (Bob.Queue.Maintenance), but nothing reaps the task on the agent, so the weight leak persists. Arm a per-task timer in Bob.Runner mirroring that threshold: on fire, kill the task to free its weight, report the job failed, and pull fresh work. The runner's exact-3h timer fires before the master sweep's tick-granular detection, so it wins the race and avoids spawning a duplicate build via requeue.
A job can hang anywhere — a stuck `docker build`, `docker push`, or `wget` — not just in `docker build`, and every shell-out job (erlang/elixir/manifest/otp/clean) runs through Bob.Script. Wrap the whole script in GNU `timeout` there. Run without --foreground it signals the entire process group, so the script's `docker` children die with it; killing the `docker` client also makes the daemon cancel the build. Sits at the runner's per-task backstop. `timeout` ships only on Linux, so non-Linux dev runs the script unwrapped.
f2683b6 to
443c859
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A job that hangs — a stuck
docker build,docker push,wget, etc. — blocks its runner task indefinitely and keeps counting itsweightagainst the shared concurrency budget. Once enough builds wedge to saturate the budget, the agent stops pulling any new:sharedjobs (Erlang/Elixir/OTP builds) while separate-budgetDockerManifestjobs keep flowing — so the agent looks alive but builds nothing until the wedged processes are killed by hand.Two layers:
Per-job
timeout(Bob.Script). Every shell-out job (erlang/elixir/manifest/otp/clean) runs throughBob.Script, so the whole script is wrapped in GNUtimeoutthere — bounding the job no matter where it hangs, not just indocker build. Run without--foreground,timeoutsignals the entire process group, so the script'sdockerchildren die with it (verified on the agent: a grandchild process is reaped, no orphan); killing thedockerclient also makes the daemon cancel the build.timeoutships only on Linux, so non-Linux dev runs the script unwrapped.Runner per-task backstop (
Bob.Runner). The master already reaps stalerunningrows after 3h (Bob.Queue.Maintenance), but nothing reaps the task on the agent, so a leak would persist if a task wedged outside the script.Bob.Runnernow arms a per-task timer at the same threshold; on fire it kills the task to free its weight, reports the job failed, and pulls fresh work. It fires before the master sweep's tick-granular detection, so it wins the race and avoids spawning a duplicate via requeue.All three timeouts are 3h. Does not fix the underlying OTP-21 parallel-
makedeadlock that triggered this — those builds now fail cleanly instead of wedging the agent.