wave 3: online RL with hidden state — POMDP, curiosity, hierarchical (5 stubs)#7
Merged
0bserver07 merged 6 commits intomainfrom May 8, 2026
Merged
wave 3: online RL with hidden state — POMDP, curiosity, hierarchical (5 stubs)#70bserver07 merged 6 commits intomainfrom
0bserver07 merged 6 commits intomainfrom
Conversation
… 3-partition env (Schmidhuber 1991) A 1-D environment partitioned into three regions: A deterministic (K=4), B random (K=8, N(0,0.5) noise), C learnable-but-unlearned (K=128 fixed pseudo-random targets). At each step the agent picks a region; a per-region tabular world model M[r][c] is updated with EMA; curiosity = max(0, mean(err[t-2W:t-W]) - mean(err[t-W:t])); policy is softmax(beta*curiosity) with eps-uniform fallback and a 200-step uniform burn-in. Headline reproduces 10/10 seeds: visits(C) > visits(B) > visits(A) ~ 43% / 33% / 24% at default config. Wallclock ~0.5 s for 5000 steps. Files: curiosity_three_regions.py env + model + curiosity policy + CLI visualize_curiosity_three_regions.py 6 PNGs in viz/ make_curiosity_three_regions_gif.py ~460 KB GIF, 39 frames at 10 fps curiosity_three_regions.gif the animation viz/ static visualisations README.md 8-section README Deviations from Schmidhuber 1991 (FKI-149-91 not retrievable in full; reconstructed from the IJCNN 1991 abstract, the 2010 creativity-theory review, and the 2020 Miraculous Year retrospective): tabular per-context predictor instead of an online RNN; cycling-counter contexts instead of 1-D position; discrete region-selector actions instead of motor outputs; no separate confidence module C. All documented in README §Deviations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e sequence (Schmidhuber, Zhao, Wiering 1997) Sequence of 4 POM maze tasks (5x5, fixed layout, 4 different goal cells). Tabular softmax policy over (4 wall-sensors x 1 memory bit) -> 6 actions (N/S/E/W + set-mem-0 + set-mem-1). REINFORCE updates batched as candidate modifications; SSA stack rolls back modifications whose post-push reward rate is dominated by the older-tag forward rate (with a min_test_window guard so the rate estimate isn't sampling-noise-dominated). Three regimes compared on identical compute: ssa (continual + SSA filter), no_ssa (continual + no filter), restart (re-init each task). Headline (seed 0): SSA 1.0 solve rate on tasks 0-2, 0.7 on task 3, while no_ssa hits 0.0 on task 2 (dragged into a stuck policy by carried-over task-1 bias). 10-seed mean tail solve rate: SSA 0.83 vs no_ssa 0.70 = +19% relative improvement. Full reproduces in 1.7 s on M-series CPU. Files: ssa_bias_transfer_mazes.py maze + policy + REINFORCE + SSA, CLI visualize_ssa_bias_transfer_mazes.py static PNGs (incl. 10-seed sweep) make_ssa_bias_transfer_mazes_gif.py stack-evolution animation (1.9 MB) README.md 8 sections per SPEC issue #1 viz/*.png 7 visualisation PNGs ssa_bias_transfer_mazes.gif 80-frame stack-evolution animation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tinuous obstacle avoidance (Schmidhuber 1991) Hierarchical decomposition with two networks: a sub-goal generator (C_high) emits K=2 way-points per arena from start/goal/obstacle features; a deliberately obstacle-blind low-level policy (C_low) walks straight at whatever target it is given. The cost-of-the-path model is closed-form (line-integral Gaussian penalty over obstacle distances), so dJ/d(sub-goal) is exact and the gradient flows directly back into C_high. Headline (seed 0, 200 evaluation arenas, 7s wallclock): C_high + C_low : 99.0% success, 1.0% collision Direct : 0.0% success, 100.0% collision 10-seed sweep mean: 98.5% +/- 1.1%. Direct baseline 0.0% on every seed because every arena has an obstacle anchored on the start->goal diagonal. Files: subgoal_obstacle_avoidance.py (env + nets + train + eval + CLI), README.md (8 sections), make_subgoal_obstacle_avoidance_gif.py (648 KB GIF), visualize_subgoal_obstacle_avoidance.py (5 PNGs in viz/). Stub problem.py removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r POM (Wiering & Schmidhuber 1997) POM environment: - 9×5 zigzag maze, 29 free cells, 8 wall-mask observations, BFS optimal = 28 steps (matches paper's 28-step claim). - Alternating-direction corridors so the dominant "corridor middle" observation requires opposite optimal actions in different rows -- the partial-observability trap that defeats memoryless Q-learning. - Smaller than paper's 62-cell maze (deviation, see README §Deviations). HQ agent: - M=5 ordered sub-agents (configurable) with per-sub-agent Q-tables and HQ-tables, plus paper's control-transfer unit (sub-goal-match firing). - SARSA(λ) Q-update with replacing eligibility trace per sub-agent; bootstrap target switches to next sub-agent's Q at transfer. - HQ.1, HQ.2, HQ.3 update rules from the paper for sub-goal scoring. - Max-Boltzmann action selection, Max-Random sub-goal selection, linear p_max ramp 0 → 1 across training. Flat Q(λ) baseline (single Q-table) for comparison. Visualisations: - viz/maze.png, viz/learning_curves.png, viz/hq_tables.png, viz/q_tables.png, viz/subagent_trajectory.png - hq_learning_pomdp.gif (40-frame training animation, 604 KB, under the 2 MB target). Reproducibility: - python3 hq_learning_pomdp.py --seed 0 deterministic across runs (verified bit-identical headline numbers across two invocations). - 21 s wallclock on M-series laptop CPU. Honest reproduction-gap: the paper claims HQ optimal at 28 steps and flat Q failing entirely; on our small (29-cell) reproduction both methods reach the goal during stochastic training and both fail under fully greedy evaluation, so the headline gap does not cleanly reproduce. The implementation is faithful to the paper's algorithm (Q.1/Q.2 with eligibility, HQ.1/HQ.2/HQ.3, Max-Boltzmann, Max-Random); the small maze and laptop-budget training schedule are the likely culprits. Documented in README §Deviations and §Open questions; five follow-up experiments listed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ber 1991)
Pure-numpy implementation of the Schmidhuber 1991 NIPS-3 controller-through-
model recipe on a 5x5 T-maze. The agent observes only its 4 wall booleans
plus a 1-bit indicator that is non-zero only at the start cell at t=0; the
flag is at top-end (indicator=+1) or bottom-end (indicator=-1) of the
T-junction. A memoryless agent cannot disambiguate at the T-junction.
Architecture: vanilla tanh RNN world-model M (40-d hidden) and recurrent
controller C (24-d hidden), both with hand-coded BPTT and Adam. Training
follows phase 1 (M on random + scripted rollouts) then 4 cycles of
phase-2 BPTT-through-M for C plus an M-refresh on C-driven rollouts
(Ha & Schmidhuber 2018 World Models pattern).
Two implementation tweaks were load-bearing: (a) straight-through estimator
on M's action input -- without it, soft action probs saturate as C peaks
and gradient on off-actions vanishes, capping C at 50% (feedforward). With
ST, C reaches 100% on solved seeds. (b) Indicator passed to M as a
side-input -- C's recurrent latch must work from observations alone, but
M's reward predictions are unreliable across 5 latch steps with vanilla
recurrence; passing the indicator separately to M only keeps reward
supervision correct without weakening the POMDP burden on C.
Results (seed 0): recurrent C 100% (200/200) at mean 6 steps; feed-forward
C 0%; random walk 3.5%; wallclock 32s incl. baseline. Multi-seed 6/10
solve perfectly, 4/10 plateau at 50% (feedforward equivalent), 0/10 fail
entirely thanks to best-C snapshot-across-cycles. Held-out M MSE 0.0038.
Files: pomdp_flag_maze.py, make_pomdp_flag_maze_gif.py,
visualize_pomdp_flag_maze.py, README.md (8 sections), pomdp_flag_maze.gif
(102 KB), viz/{maze_layout, agent_paths, hidden_state, training_curves,
results_table}.png. problem.py removed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(5 stubs) Octopus merge of 5 wave-3 stubs per SPEC issue #1. - wave-3-local/curiosity-three-regions: curiosity-driven model-building (3-partition env) - wave-3-local/subgoal-obstacle-avoidance: sub-goal generation (2-D continuous arena) - wave-3-local/pomdp-flag-maze: recurrent C+M for POMDP flag maze - wave-3-local/ssa-bias-transfer-mazes: success-story algorithm (multi-task POM) - wave-3-local/hq-learning-pomdp: hierarchical Q(lambda) on POMDP All 5 verified by separate audit subagent: numpy-only, deterministic, branch protocol followed (no wave-3-local on remote), all 8 README sections present, algorithmic faithfulness confirmed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
Audit Report — PR #7 wave 3 (5 stubs)Wave 3 verdict: APPROVE. Independent review by separate Explore subagent. All 5 stubs algorithmically correct, deterministic, numpy-only, branch-protocol-compliant. Per-stub verdicts
Cross-cut findings
Algorithmic faithfulness (3 deep dives)
Honest gap-reporting
Two notes (non-blocking)
Reproduce resultsagent-0bserver07 (Claude Code) on behalf of Yad — wave-3 audit subagent |
10 tasks
0bserver07
added a commit
that referenced
this pull request
May 8, 2026
wave 3: online RL with hidden state — POMDP, curiosity, hierarchical (5 stubs)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wave 3 — online RL with hidden state
Five stubs implementing Schmidhuber's 1991-1997 POMDP / curiosity / hierarchical-RL lineage per SPEC issue #1. Octopus-merged from 5 local-only
wave-3-local/<slug>branches.curiosity-three-regionssubgoal-obstacle-avoidancepomdp-flag-mazessa-bias-transfer-mazeshq-learning-pomdpAudit verdict (separate Explore subagent)
APPROVE for the wave with two notes:
pomdp-flag-maze: commit authored asagent-pomdp-flag-maze-builder@anthropic.cominstead ofagent-0bserver07 <agent-0bserver07@users.noreply.github.com>despite the worktree's git config being set correctly. The subagent appears to have overridden the per-worktree identity. Other 4 commits have the correct identity. Code is correct and reproducible; the attribution drift can be fixed via amend + force-push if you want, but is non-blocking. Flagged for v1.5 protocol tightening.What the audit actually verified (pass)
wave-3-local/*branches on origin (per the corrected post-wave-1 protocol).__pycache__committed.Per-stub deviations (in each stub's §Deviations)
min_test_window=200guard against pop-avalanches; small 5×5 mini-POMDP — restart baseline matches SSA on this scale (documented).gamma^Δt · HV ≤ R_goalbound prevents per-corridor specialization on small mazes).Citation gaps (in each stub's §Open questions)
All five 1991/1997 papers (FKI-149-91, ICANN-91, NIPS-3 chapter, ML 28, Adaptive Behavior 6(2)) are partially unretrievable. Reconstructions from secondary sources: 2015 Deep Learning in NN survey §6.7-§6.10 (curiosity, hierarchical RL, POMDP), Schmidhuber 1991 Curious model-building control systems IJCNN, the 2020 Miraculous Year 1990–1991 retrospective.
Branch protocol (continued from wave 2)
Each teammate worked on a local-only
wave-3-local/<slug>branch — not pushed to origin. Lead consolidated via local octopus merge intowave/3-rl-hidden-state, the only remote branch from this wave.Wave 0 → 1 → 2 → 3 progress
7 + 5 + 5 = 17/50 v1 stubs done (34%). 7 waves remaining = 33 stubs.
agent-0bserver07 (Claude Code) on behalf of Yad