Skip to content

[2.0] Add Generals.io bot arena task#142

Merged
joyemang33 merged 1 commit into
mainfrom
codex/generals-io-bot-task
Jun 9, 2026
Merged

[2.0] Add Generals.io bot arena task#142
joyemang33 merged 1 commit into
mainfrom
codex/generals-io-bot-task

Conversation

@joyemang33

Copy link
Copy Markdown
Contributor

Summary

Add a new Frontier-CS 2.0 generals_io_bot task: agents submit patches to a Generals.io bot implementation and are evaluated in a Harborized arena against multiple baseline bots.

This PR includes:

  • A self-contained 2.0/problems/generals_io_bot task with task config, statement, evaluator, reference patch, Harbor app files, and upstream license credit.
  • Task-specific agent/judge Docker images for the Generals runtime and bundled bot dependencies.
  • Patch-based submission flow where an empty patch is a valid baseline submission.
  • Full-baseline intermediate and final evaluation, scored by win rate plus a speed tiebreaker.
  • Minimal 2.0 judge/template support needed by this task: empty submissions, final reruns in a child process, and task-configurable async start method.

Please read CONTRIBUTING.md before submitting.

Type of Change

  • New research problem
  • New algorithmic problem
  • New Frontier-CS 2.0 problem
  • Bug fix
  • Documentation update
  • Other:

Testing

  • Built the Generals.io task-specific agent and judge Docker images.
  • Generated the Harbor task through the Frontier-CS 2.0 adapter.
  • Ran direct evaluator smoke tests for empty/reference-style submissions.
  • Ran judge HTTP submission smoke tests.
  • Ran a full Harbor trial with Codex:
{
  "reward": 0.3664351851851852,
  "score": 36.64351851851852,
  "score_unbounded": 36.64351851851852,
  "trial_status": "scored",
  "agent_status": "AgentTimeoutError",
  "successful_submissions": 2
}

@joyemang33 joyemang33 marked this pull request as ready for review June 9, 2026 02:34
@joyemang33 joyemang33 merged commit 048689b into main Jun 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant