wave 4: history compression + fast-weights + self-reference (5 stubs) by 0bserver07 · Pull Request #8 · cybertronai/schmidhuber-problems

0bserver07 · 2026-05-07T12:49:18Z

Wave 4 — history compression + fast-weights + self-reference

Five stubs implementing Schmidhuber's 1991-1993 sequence-indexing and meta-learning lineage per SPEC issue #1. Octopus-merged from 5 local-only wave-4-local/<slug> branches.

Stub	Method	Paper	Headline
`chunker-22-symbol`	Two-stack neural sequence chunker	Schmidhuber 1991/1992	99.5% label accuracy 10/10 seeds at 1500 blocks (paper: 13/17 at <5k); A-alone baseline at chance
`chunker-very-deep-1200`	Very-deep history compression	Schmidhuber 1993 (Habilitation)	599.5× depth-reduction at T=1200; chunker 100% recall vs single-net 0% (gradient vanishes by t=4)
`fast-weights-unknown-delay`	Slow-net controlling fast-weight memory	Schmidhuber 1992 (NC 4(1))	100% bit-accuracy at K=5-30 trained / K=1-60 extrapolation; 10/10 seeds; ~3s wall
`fast-weights-key-value`	Key/value binding via outer-product (linear-Transformer ancestor)	Schmidhuber 1992 (NC 4(1))	Slow projector boosts mean retrieval cosine 0.428 → 0.754 (1.76×); multi-seed 0.75–0.81; numerical grad-check <1e-9
`self-referential-weight-matrix`	RNN reading/writing its own weight matrix	Schmidhuber 1993 (ICANN-93)	99.6% on 4-way boolean meta-learning; 8/8 seeds > 0.95; manual BPTT grad-check 8e-7

Audit verdict (separate Explore subagent)

APPROVE across all 5 stubs.

Numpy-only (hard pass): All 5 verified. Imports limited to numpy/matplotlib/PIL/imageio/stdlib.
Determinism (3 spot-checks): chunker-22-symbol, fast-weights-key-value, self-referential-weight-matrix — each ran twice with seed 0, identical output.
Branch protocol: zero wave-4-local/* branches on origin (verified).
Algorithmic faithfulness (3 deep dives):
- chunker-22-symbol: two-network chunker verified — automatizer A predicts next symbol; chunker C only sees A's prediction failures (surprises, threshold 0.95).
- fast-weights-key-value: outer-product math verified — line 173 W_fast = values.T @ K, retrieval y = W_fast @ k_q. Exact linear-attention ancestor.
- self-referential-weight-matrix: W_eff = W_slow + W_fast confirmed; W_fast rewritten by net's own outputs each step.
Cleanliness: zero hardcoded paths, zero TODO/FIXME, zero __pycache__ committed.
Git authors: all 5 commits authored by agent-0bserver07 <agent-0bserver07@users.noreply.github.com> (no drift like wave 3 had).
GIF sizes: 370 KB to 589 KB (all under 2 MB cap).

Per-stub deviations (in each stub's §Deviations)

chunker-22-symbol: BPTT instead of RTRL; A's loss muted at boundary transition (t=20); C's h_c=0 readout for clean i.i.d. stream (recurrence accumulates noise from spurious early surprises).
chunker-very-deep-1200: synthetic trigger-recall task (Habilitationsschrift not retrievable); threshold-based surprise detector (vs probability-mass test); decoupled stage training; effective-depth metric defined as 1% of terminal gradient norm.
fast-weights-unknown-delay: sigmoid gate on every write (modern instantiation); slow net feedforward (no recurrence) — forces memory into W_fast; numerical-gradient test passes at 1e-6 max relative error.
fast-weights-key-value: bias-direction trick (raw keys carry shared-bias direction b so slow projector W_K has non-trivial denoising job — pure i.i.d. Gaussian keys are near-orthogonal in d=8 and training has nothing to do).
self-referential-weight-matrix: continuous-pointer relaxation instead of discrete addresses (Schlag 2021 / Irie 2022 standard); 4-way boolean meta-learning task (paper's "small toy proof of concept" instantiation).

Citation gaps (in each stub's §Open questions)

1991 FKI-148-91 / 1992 Neural Computation 4(2) chunker paper (retrievable)
1993 Habilitationsschrift (not publicly retrievable; reconstructed from 1992 NC chunker + 2015 DL in NN §6.4-6.5)
1992 NC 4(1) fast-weights paper (retrievable; supplemented with Schlag/Irie/Schmidhuber 2021 linear-Transformer formalization)
1993 ICANN-93 self-referential paper (partially retrievable; supplemented with 2018 Irie/Schlag/Schmidhuber follow-up)

Wave 0 → 1 → 2 → 3 → 4 progress

7 + 5 + 5 + 5 = 22/50 v1 stubs done (44%). 5 waves remaining = 28 stubs.

agent-0bserver07 (Claude Code) on behalf of Yad

…ast-weight updates across an unknown distractor gap (Schmidhuber 1992) The 1992 Neural Computation paper introduced the slow-net + fast-weight-matrix decomposition --- the slow programmer S emits a key/value/query/gate at each step, and writes a rank-1 outer product `eta * g_t * outer(v_t, k_t)` into a fast-weight matrix W_fast. The unknown-delay pattern-association task: pattern P presented at t=0 with a store flag; K~Uniform[5,30] random-distractor steps; recall flag at t=K+1; reproduce P. The slow net here is purely feedforward, so the only path that carries information across the gap is W_fast. Pure numpy: 917-parameter slow net, manual batched BPTT through the rank-1 fast-weight updates (verified against numerical gradients to 1e-6 max relative error via `--gradcheck`). Adam, batch=32, gradient norm clipped at 1.0. Results (seed 0, 1500 iters, ~3 s wallclock): - 100.00% bit-accuracy at recall over delays K=5..30 (50 episodes per K). - 100.00% bit-accuracy when extrapolated to delays K=1..60 (the algorithm the slow net learns is delay-independent by construction). - Deterministic across re-runs at the same seed. - Multi-seed: 10/10 seeds reach 100.00% within 1500 iters. Visualizations show the gate spiking at the store step (~0.9), staying near 0.1 across all distractor steps, and the recall output overlaying the true pattern bit for bit. The Frobenius norm of W_fast jumps at store and drifts only slightly across distractors --- the textbook "load and hold" behaviour. This stub is the direct ancestor of unnormalised linear self-attention (Schlag/Irie/Schmidhuber 2021); FROM/TO are renamed key/value, the rank-1 write rule is the same. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ewriting Implement the wave-4 self-referential weight matrix stub per SPEC #1. Architecture (faithful to ICANN-93 + modern SRWM lineage): - Effective W_eff = W_slow (BPTT-trained) + W_fast (per-episode plastic) - Each step the net outputs row/col soft attention, write value, write gate - W_fast updated by rank-1: eta * gate * val * outer(row, col) - Reads happen implicitly: W_eff_{t+1} contains last step's writes - Manual BPTT with tape; gradient check passes at relative error 8e-7 Task: 4-way meta-learning on AND/OR/XOR/NAND of 2 bits. Episode = 4 demo steps with labels visible + 4 query steps with labels hidden. The only mechanism for storing task identity is W_fast. Headline (seed 0): - Final query accuracy: 0.996 (AND/OR/XOR/NAND = 1.00/0.99/1.00/1.00) - Wallclock: ~5 s on M-series laptop CPU - 8-seed sweep: 8/8 > 0.95, 7/8 > 0.99 - 169 slow params, 3000 episodes, Adam lr=0.01 Files: - self_referential_weight_matrix.py: model, BPTT, training, --gradcheck CLI - make_self_referential_weight_matrix_gif.py: 4-task animation (364 KB) - visualize_self_referential_weight_matrix.py: 5 PNGs in viz/ - README.md: 8 sections per SPEC; includes deviation list and v2 questions Paper reports: a small-toy proof of concept of the self-referential weight-update idea. Reproduces: yes (continuous relaxation of the discrete read/write addresses, which is the standard modern instantiation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…er ancestor) Slow projector W_K writes outer-product updates to a fast weight matrix W_fast and reads it back for key-addressed value retrieval -- the unnormalised linear-attention math later formalised in Schlag/Irie/ Schmidhuber 2021. - N=5 (key, value) pairs, d_key=d_val=8 per the v1 spec. - Raw keys share a fixed bias direction so the slow projector has a non-trivial denoising job; pure iid keys would be a degenerate task. - Plain SGD on 0.5 ||y - v_q||^2; gradient analytically derived and numerically verified to <1e-9. - Multi-seed (0..9) post-training cosine 0.75-0.81; pre-training cosine 0.43-0.51. Stable. - Capacity sweep N=1..12 (no retrain) included; smooth fall-off as expected for random-projection associative memory. - Wallclock 0.07s per training run. - 8 PNGs to viz/, 370 KB GIF showing key-cosine matrix becoming diagonal as W_K projects out the bias. Files: - fast_weights_key_value.py (forward, backward, grad-check, train, eval, capacity sweep, CLI) - visualize_fast_weights_key_value.py (8 static PNGs) - make_fast_weights_key_value_gif.py (training animation) - README.md (8 sections)

Schmidhuber Habilitationsschrift 1993 / NC 4(2) 1992: a level-0 automatizer learns the deterministic filler in a synthetic trigger-recall stream of length T = 1200, and a level-1 chunker trains on the compressed surprise stream of length k = 2. Headline (seed 0, T = 1200): chunker target accuracy 100% vs single-net full-BPTT baseline 0%. Effective BPTT depth (gradient-norm 1% cutoff on the recall-target loss): 4 of 1199 for the baseline, 2 for the chunker. Depth-reduction ratio (T - 1) / k = 599.5x. Multi-seed sanity (seeds 1-3, T = 500): 3/3 at 100% chunker / 0% baseline, 249.5x reduction. Wallclock for the headline run: 30 s on M-series CPU. Files: chunker_very_deep_1200.py (RNN + BPTT + 2-stage training + baseline + eval), visualize_chunker_very_deep_1200.py (4 PNGs), make_chunker_very_deep_1200_gif.py (50-frame credit-assignment animation, 410 KB), README.md (8 sections), viz/, results.json. Documents synthetic-task vs original-benchmark deviation explicitly: the Habilitationsschrift is not retrievable in original form, the 1200-step number is reconstructed via the 2015 DL-in-NN survey sec 6.4-6.5 and the 1992 NC chunker paper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…dhuber 1991/1992) 22-symbol stream of {a, x, b1..b20}, with 21-symbol blocks chosen randomly from {a b1..b20, x b1..b20}. Two output heads: next-symbol softmax and a 1-d label sigmoid asked at the end of each block ("did this block start with 'a'?", a 20-step credit-assignment task). Two modes implemented and compared back-to-back: - a_alone: a single vanilla Elman RNN handles both heads. Stays at 43-57% chance label accuracy on 10/10 seeds. - chunker (Schmidhuber 1991/1992): an automatizer A is trained on the next-symbol task; a chunker C is fed only A's surprises (timesteps where A's predicted prob of the actual next symbol is below 0.95) and trained on the label task. Once A learns the deterministic b_i -> b_{i+1} transitions, surprises taper to one per block (the random choice-bit at the boundary), and C's label task collapses to a 1-step copy. 10/10 seeds reach 99.5% label accuracy at 1500 blocks (~31k input symbols), in ~1 s wallclock per mode on an M-series CPU. Architecture is faithful to the paper (vanilla Elman RNN for both nets, hidden=32, BPTT inside each block). Two notable deviations documented in README §Deviations: A's loss is muted at the random boundary transition (otherwise A drifts off-uniform and the surprise threshold misses some boundaries), and C's hidden state is reset to zero between surprises (the label task on this clean stream is intrinsically 1-step; persistent recurrence accumulates noise from early-training spurious surprises). Both choices preserve the algorithmic claim of history compression while making the surprise channel reliable enough that 10/10 seeds converge. Files: - chunker_22_symbol.py core: stream + RNN + Adam + train + eval + CLI - visualize_chunker_22_symbol 4 static PNGs in viz/ - make_chunker_22_symbol_gif training animation (chunker_22_symbol.gif) - README.md 8-section spec-compliant doc - chunker_22_symbol.gif 576 KB animation, 50 frames + 12-frame hold - viz/{training_curves,surprise_pattern,network_weights,test_episode}.png Reproducibility: python3 chunker_22_symbol.py --seed N is deterministic; verified bit-for-bit identical results across two consecutive runs at seed 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Octopus merge of 5 wave-4 stubs per SPEC issue #1. - wave-4-local/chunker-22-symbol: two-stack neural sequence chunker (1991/1992) - wave-4-local/chunker-very-deep-1200: very-deep history compression (1993) - wave-4-local/fast-weights-unknown-delay: slow-net controlling fast-weight memory (1992) - wave-4-local/fast-weights-key-value: key/value binding via outer-product (1992) - wave-4-local/self-referential-weight-matrix: RNN read/write own weights (1993) All 5 verified by separate audit subagent: numpy-only, deterministic, branch protocol followed (no wave-4-local on remote), all 8 README sections, algorithmic faithfulness confirmed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

0bserver07 · 2026-05-07T12:49:57Z

Audit Report — PR #8 wave 4 (5 stubs)

Wave 4 verdict: APPROVE.

Independent review by separate Explore subagent.

Per-stub verdicts

Stub	Verdict	Reason
chunker-22-symbol	APPROVE	Two-net chunker verified; surprises threshold at 0.95; A-alone vs chunker = 57% vs 99.5%
chunker-very-deep-1200	APPROVE	599.5× depth-reduction; gradient vanishes by t=4 in baseline; chunker stays at 100%
fast-weights-unknown-delay	APPROVE	Slow-net + gated W_fast writes; numerical grad test 1e-6 max relative error
fast-weights-key-value	APPROVE	Exact `W_fast = V^T K` outer-product math; numerical grad-check <1e-9
self-referential-weight-matrix	APPROVE	`W_eff = W_slow + W_fast` confirmed; manual BPTT grad-check at 8e-7

Cross-cut findings

Numpy-only (hard pass): All 5 verified. Imports = {numpy, matplotlib, PIL, imageio, argparse, json, os, platform, subprocess, sys, time, typing}.
Determinism (3 spot-checks): chunker-22-symbol (57.0% / 99.5%), fast-weights-key-value (loss=0.31669, cos=0.8837), self-referential-weight-matrix (acc=0.996) — all byte-identical across reruns.
Branch protocol: All 5 on local-only wave-4-local/*; zero pushed to origin.
Git authors: All 5 commits by agent-0bserver07 <agent-0bserver07@users.noreply.github.com>. No author drift this wave (vs wave 3's pomdp-flag-maze drift).
Cleanliness: no TODO/FIXME/XXX/HACK in any .py, no hardcoded paths, no __pycache__ committed (properly .gitignored).

Algorithmic faithfulness (3 deep dives)

chunker-22-symbol (chunker_22_symbol.py:367–391): Surprise detection at line 377; A's prediction failures gated to C; A muted at boundary t=20 (documented). Architecturally identical baseline (A-alone) verified at chance.
fast-weights-key-value (fast_weights_key_value.py:173–196): Forward W_fast = values.T @ K; retrieval y = W_fast @ k_q; backward np.outer(dy, k_q). Exact linear-attention math.
self-referential-weight-matrix (self_referential_weight_matrix.py:206–260): W_eff = W_slow + W_fast at line 230; W_fast self-write per step via gated outer product. Reads + writes use the same modified matrix → genuine self-reference.

Reproduce results (spot-checks)

=== chunker-22-symbol --seed 0 ===
label accuracy:  A-alone  57.0%   Chunker  99.5%
sym accuracy:    A-alone  95.2%   Chunker  95.2%

=== fast-weights-key-value --seed 0 ===
final_train_loss=0.31669470084225854
final_train_cos=0.8837269918549183

=== self-referential-weight-matrix --seed 0 ===
final query accuracy: 0.996
final per-task accuracy: AND=1.00 OR=0.99 XOR=1.00 NAND=1.00

What I couldn't verify

Multi-seed sweeps beyond 3 spot-checks (each stub runs 8-10 seed sweeps; verified determinism on single seeds, spot-checked README mean/stddev claims).
Hyperparameter sensitivity ablations (correctly listed as v1.5 follow-ups, not silent).

agent-0bserver07 (Claude Code) on behalf of Yad — wave-4 audit subagent

agent-0bserver07 and others added 6 commits May 7, 2026 08:31

0bserver07 mentioned this pull request May 7, 2026

Spec: minimum implementation requirements for Schmidhuber-problem stubs (v1) #1

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wave 4: history compression + fast-weights + self-reference (5 stubs)#8

wave 4: history compression + fast-weights + self-reference (5 stubs)#8
0bserver07 wants to merge 6 commits intomainfrom
wave/4-history-fastweights

0bserver07 commented May 7, 2026

Uh oh!

0bserver07 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0bserver07 commented May 7, 2026