Skip to content

wave 4: history compression + fast-weights + self-reference (5 stubs)#8

Open
0bserver07 wants to merge 6 commits intomainfrom
wave/4-history-fastweights
Open

wave 4: history compression + fast-weights + self-reference (5 stubs)#8
0bserver07 wants to merge 6 commits intomainfrom
wave/4-history-fastweights

Conversation

@0bserver07
Copy link
Copy Markdown

Wave 4 — history compression + fast-weights + self-reference

Five stubs implementing Schmidhuber's 1991-1993 sequence-indexing and meta-learning lineage per SPEC issue #1. Octopus-merged from 5 local-only wave-4-local/<slug> branches.

Stub Method Paper Headline
chunker-22-symbol Two-stack neural sequence chunker Schmidhuber 1991/1992 99.5% label accuracy 10/10 seeds at 1500 blocks (paper: 13/17 at <5k); A-alone baseline at chance
chunker-very-deep-1200 Very-deep history compression Schmidhuber 1993 (Habilitation) 599.5× depth-reduction at T=1200; chunker 100% recall vs single-net 0% (gradient vanishes by t=4)
fast-weights-unknown-delay Slow-net controlling fast-weight memory Schmidhuber 1992 (NC 4(1)) 100% bit-accuracy at K=5-30 trained / K=1-60 extrapolation; 10/10 seeds; ~3s wall
fast-weights-key-value Key/value binding via outer-product (linear-Transformer ancestor) Schmidhuber 1992 (NC 4(1)) Slow projector boosts mean retrieval cosine 0.428 → 0.754 (1.76×); multi-seed 0.75–0.81; numerical grad-check <1e-9
self-referential-weight-matrix RNN reading/writing its own weight matrix Schmidhuber 1993 (ICANN-93) 99.6% on 4-way boolean meta-learning; 8/8 seeds > 0.95; manual BPTT grad-check 8e-7

Audit verdict (separate Explore subagent)

APPROVE across all 5 stubs.

  • Numpy-only (hard pass): All 5 verified. Imports limited to numpy/matplotlib/PIL/imageio/stdlib.
  • Determinism (3 spot-checks): chunker-22-symbol, fast-weights-key-value, self-referential-weight-matrix — each ran twice with seed 0, identical output.
  • Branch protocol: zero wave-4-local/* branches on origin (verified).
  • Algorithmic faithfulness (3 deep dives):
    • chunker-22-symbol: two-network chunker verified — automatizer A predicts next symbol; chunker C only sees A's prediction failures (surprises, threshold 0.95).
    • fast-weights-key-value: outer-product math verified — line 173 W_fast = values.T @ K, retrieval y = W_fast @ k_q. Exact linear-attention ancestor.
    • self-referential-weight-matrix: W_eff = W_slow + W_fast confirmed; W_fast rewritten by net's own outputs each step.
  • Cleanliness: zero hardcoded paths, zero TODO/FIXME, zero __pycache__ committed.
  • Git authors: all 5 commits authored by agent-0bserver07 <agent-0bserver07@users.noreply.github.com> (no drift like wave 3 had).
  • GIF sizes: 370 KB to 589 KB (all under 2 MB cap).

Per-stub deviations (in each stub's §Deviations)

  • chunker-22-symbol: BPTT instead of RTRL; A's loss muted at boundary transition (t=20); C's h_c=0 readout for clean i.i.d. stream (recurrence accumulates noise from spurious early surprises).
  • chunker-very-deep-1200: synthetic trigger-recall task (Habilitationsschrift not retrievable); threshold-based surprise detector (vs probability-mass test); decoupled stage training; effective-depth metric defined as 1% of terminal gradient norm.
  • fast-weights-unknown-delay: sigmoid gate on every write (modern instantiation); slow net feedforward (no recurrence) — forces memory into W_fast; numerical-gradient test passes at 1e-6 max relative error.
  • fast-weights-key-value: bias-direction trick (raw keys carry shared-bias direction b so slow projector W_K has non-trivial denoising job — pure i.i.d. Gaussian keys are near-orthogonal in d=8 and training has nothing to do).
  • self-referential-weight-matrix: continuous-pointer relaxation instead of discrete addresses (Schlag 2021 / Irie 2022 standard); 4-way boolean meta-learning task (paper's "small toy proof of concept" instantiation).

Citation gaps (in each stub's §Open questions)

  • 1991 FKI-148-91 / 1992 Neural Computation 4(2) chunker paper (retrievable)
  • 1993 Habilitationsschrift (not publicly retrievable; reconstructed from 1992 NC chunker + 2015 DL in NN §6.4-6.5)
  • 1992 NC 4(1) fast-weights paper (retrievable; supplemented with Schlag/Irie/Schmidhuber 2021 linear-Transformer formalization)
  • 1993 ICANN-93 self-referential paper (partially retrievable; supplemented with 2018 Irie/Schlag/Schmidhuber follow-up)

Wave 0 → 1 → 2 → 3 → 4 progress

7 + 5 + 5 + 5 = 22/50 v1 stubs done (44%). 5 waves remaining = 28 stubs.


agent-0bserver07 (Claude Code) on behalf of Yad

agent-0bserver07 and others added 6 commits May 7, 2026 08:31
…ast-weight updates across an unknown distractor gap (Schmidhuber 1992)

The 1992 Neural Computation paper introduced the slow-net + fast-weight-matrix
decomposition --- the slow programmer S emits a key/value/query/gate at each
step, and writes a rank-1 outer product `eta * g_t * outer(v_t, k_t)` into a
fast-weight matrix W_fast. The unknown-delay pattern-association task: pattern
P presented at t=0 with a store flag; K~Uniform[5,30] random-distractor steps;
recall flag at t=K+1; reproduce P. The slow net here is purely feedforward, so
the only path that carries information across the gap is W_fast.

Pure numpy: 917-parameter slow net, manual batched BPTT through the rank-1
fast-weight updates (verified against numerical gradients to 1e-6 max
relative error via `--gradcheck`). Adam, batch=32, gradient norm clipped at
1.0.

Results (seed 0, 1500 iters, ~3 s wallclock):
- 100.00% bit-accuracy at recall over delays K=5..30 (50 episodes per K).
- 100.00% bit-accuracy when extrapolated to delays K=1..60 (the algorithm
  the slow net learns is delay-independent by construction).
- Deterministic across re-runs at the same seed.
- Multi-seed: 10/10 seeds reach 100.00% within 1500 iters.

Visualizations show the gate spiking at the store step (~0.9), staying near
0.1 across all distractor steps, and the recall output overlaying the true
pattern bit for bit. The Frobenius norm of W_fast jumps at store and drifts
only slightly across distractors --- the textbook "load and hold" behaviour.

This stub is the direct ancestor of unnormalised linear self-attention
(Schlag/Irie/Schmidhuber 2021); FROM/TO are renamed key/value, the rank-1
write rule is the same.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ewriting

Implement the wave-4 self-referential weight matrix stub per SPEC #1.

Architecture (faithful to ICANN-93 + modern SRWM lineage):
- Effective W_eff = W_slow (BPTT-trained) + W_fast (per-episode plastic)
- Each step the net outputs row/col soft attention, write value, write gate
- W_fast updated by rank-1: eta * gate * val * outer(row, col)
- Reads happen implicitly: W_eff_{t+1} contains last step's writes
- Manual BPTT with tape; gradient check passes at relative error 8e-7

Task: 4-way meta-learning on AND/OR/XOR/NAND of 2 bits. Episode = 4 demo
steps with labels visible + 4 query steps with labels hidden. The only
mechanism for storing task identity is W_fast.

Headline (seed 0):
- Final query accuracy: 0.996 (AND/OR/XOR/NAND = 1.00/0.99/1.00/1.00)
- Wallclock: ~5 s on M-series laptop CPU
- 8-seed sweep: 8/8 > 0.95, 7/8 > 0.99
- 169 slow params, 3000 episodes, Adam lr=0.01

Files:
- self_referential_weight_matrix.py: model, BPTT, training, --gradcheck CLI
- make_self_referential_weight_matrix_gif.py: 4-task animation (364 KB)
- visualize_self_referential_weight_matrix.py: 5 PNGs in viz/
- README.md: 8 sections per SPEC; includes deviation list and v2 questions

Paper reports: a small-toy proof of concept of the self-referential
weight-update idea. Reproduces: yes (continuous relaxation of the discrete
read/write addresses, which is the standard modern instantiation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…er ancestor)

Slow projector W_K writes outer-product updates to a fast weight matrix
W_fast and reads it back for key-addressed value retrieval -- the
unnormalised linear-attention math later formalised in Schlag/Irie/
Schmidhuber 2021.

- N=5 (key, value) pairs, d_key=d_val=8 per the v1 spec.
- Raw keys share a fixed bias direction so the slow projector has a
  non-trivial denoising job; pure iid keys would be a degenerate task.
- Plain SGD on 0.5 ||y - v_q||^2; gradient analytically derived and
  numerically verified to <1e-9.
- Multi-seed (0..9) post-training cosine 0.75-0.81; pre-training
  cosine 0.43-0.51. Stable.
- Capacity sweep N=1..12 (no retrain) included; smooth fall-off as
  expected for random-projection associative memory.
- Wallclock 0.07s per training run.
- 8 PNGs to viz/, 370 KB GIF showing key-cosine matrix becoming
  diagonal as W_K projects out the bias.

Files:
- fast_weights_key_value.py (forward, backward, grad-check, train, eval, capacity sweep, CLI)
- visualize_fast_weights_key_value.py (8 static PNGs)
- make_fast_weights_key_value_gif.py (training animation)
- README.md (8 sections)
Schmidhuber Habilitationsschrift 1993 / NC 4(2) 1992: a level-0
automatizer learns the deterministic filler in a synthetic
trigger-recall stream of length T = 1200, and a level-1 chunker
trains on the compressed surprise stream of length k = 2.

Headline (seed 0, T = 1200): chunker target accuracy 100% vs single-net
full-BPTT baseline 0%. Effective BPTT depth (gradient-norm 1% cutoff
on the recall-target loss): 4 of 1199 for the baseline, 2 for the
chunker. Depth-reduction ratio (T - 1) / k = 599.5x.

Multi-seed sanity (seeds 1-3, T = 500): 3/3 at 100% chunker / 0%
baseline, 249.5x reduction. Wallclock for the headline run: 30 s on
M-series CPU.

Files: chunker_very_deep_1200.py (RNN + BPTT + 2-stage training +
baseline + eval), visualize_chunker_very_deep_1200.py (4 PNGs),
make_chunker_very_deep_1200_gif.py (50-frame credit-assignment
animation, 410 KB), README.md (8 sections), viz/, results.json.

Documents synthetic-task vs original-benchmark deviation explicitly:
the Habilitationsschrift is not retrievable in original form, the
1200-step number is reconstructed via the 2015 DL-in-NN survey
sec 6.4-6.5 and the 1992 NC chunker paper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…dhuber 1991/1992)

22-symbol stream of {a, x, b1..b20}, with 21-symbol blocks chosen randomly
from {a b1..b20, x b1..b20}.  Two output heads: next-symbol softmax and a
1-d label sigmoid asked at the end of each block ("did this block start
with 'a'?", a 20-step credit-assignment task).

Two modes implemented and compared back-to-back:
- a_alone: a single vanilla Elman RNN handles both heads.  Stays at
  43-57% chance label accuracy on 10/10 seeds.
- chunker (Schmidhuber 1991/1992): an automatizer A is trained on the
  next-symbol task; a chunker C is fed only A's surprises (timesteps
  where A's predicted prob of the actual next symbol is below 0.95) and
  trained on the label task.  Once A learns the deterministic
  b_i -> b_{i+1} transitions, surprises taper to one per block (the
  random choice-bit at the boundary), and C's label task collapses to a
  1-step copy.  10/10 seeds reach 99.5% label accuracy at 1500 blocks
  (~31k input symbols), in ~1 s wallclock per mode on an M-series CPU.

Architecture is faithful to the paper (vanilla Elman RNN for both nets,
hidden=32, BPTT inside each block).  Two notable deviations documented in
README §Deviations: A's loss is muted at the random boundary transition
(otherwise A drifts off-uniform and the surprise threshold misses some
boundaries), and C's hidden state is reset to zero between surprises (the
label task on this clean stream is intrinsically 1-step; persistent
recurrence accumulates noise from early-training spurious surprises).
Both choices preserve the algorithmic claim of history compression while
making the surprise channel reliable enough that 10/10 seeds converge.

Files:
- chunker_22_symbol.py        core: stream + RNN + Adam + train + eval + CLI
- visualize_chunker_22_symbol  4 static PNGs in viz/
- make_chunker_22_symbol_gif   training animation (chunker_22_symbol.gif)
- README.md                    8-section spec-compliant doc
- chunker_22_symbol.gif        576 KB animation, 50 frames + 12-frame hold
- viz/{training_curves,surprise_pattern,network_weights,test_episode}.png

Reproducibility: python3 chunker_22_symbol.py --seed N is deterministic;
verified bit-for-bit identical results across two consecutive runs at
seed 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Octopus merge of 5 wave-4 stubs per SPEC issue #1.

- wave-4-local/chunker-22-symbol: two-stack neural sequence chunker (1991/1992)
- wave-4-local/chunker-very-deep-1200: very-deep history compression (1993)
- wave-4-local/fast-weights-unknown-delay: slow-net controlling fast-weight memory (1992)
- wave-4-local/fast-weights-key-value: key/value binding via outer-product (1992)
- wave-4-local/self-referential-weight-matrix: RNN read/write own weights (1993)

All 5 verified by separate audit subagent: numpy-only, deterministic,
branch protocol followed (no wave-4-local on remote), all 8 README sections,
algorithmic faithfulness confirmed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@0bserver07
Copy link
Copy Markdown
Author

Audit Report — PR #8 wave 4 (5 stubs)

Wave 4 verdict: APPROVE.

Independent review by separate Explore subagent.

Per-stub verdicts

Stub Verdict Reason
chunker-22-symbol APPROVE Two-net chunker verified; surprises threshold at 0.95; A-alone vs chunker = 57% vs 99.5%
chunker-very-deep-1200 APPROVE 599.5× depth-reduction; gradient vanishes by t=4 in baseline; chunker stays at 100%
fast-weights-unknown-delay APPROVE Slow-net + gated W_fast writes; numerical grad test 1e-6 max relative error
fast-weights-key-value APPROVE Exact W_fast = V^T K outer-product math; numerical grad-check <1e-9
self-referential-weight-matrix APPROVE W_eff = W_slow + W_fast confirmed; manual BPTT grad-check at 8e-7

Cross-cut findings

  • Numpy-only (hard pass): All 5 verified. Imports = {numpy, matplotlib, PIL, imageio, argparse, json, os, platform, subprocess, sys, time, typing}.
  • Determinism (3 spot-checks): chunker-22-symbol (57.0% / 99.5%), fast-weights-key-value (loss=0.31669, cos=0.8837), self-referential-weight-matrix (acc=0.996) — all byte-identical across reruns.
  • Branch protocol: All 5 on local-only wave-4-local/*; zero pushed to origin.
  • Git authors: All 5 commits by agent-0bserver07 <agent-0bserver07@users.noreply.github.com>. No author drift this wave (vs wave 3's pomdp-flag-maze drift).
  • Cleanliness: no TODO/FIXME/XXX/HACK in any .py, no hardcoded paths, no __pycache__ committed (properly .gitignored).

Algorithmic faithfulness (3 deep dives)

  1. chunker-22-symbol (chunker_22_symbol.py:367–391): Surprise detection at line 377; A's prediction failures gated to C; A muted at boundary t=20 (documented). Architecturally identical baseline (A-alone) verified at chance.
  2. fast-weights-key-value (fast_weights_key_value.py:173–196): Forward W_fast = values.T @ K; retrieval y = W_fast @ k_q; backward np.outer(dy, k_q). Exact linear-attention math.
  3. self-referential-weight-matrix (self_referential_weight_matrix.py:206–260): W_eff = W_slow + W_fast at line 230; W_fast self-write per step via gated outer product. Reads + writes use the same modified matrix → genuine self-reference.

Reproduce results (spot-checks)

=== chunker-22-symbol --seed 0 ===
label accuracy:  A-alone  57.0%   Chunker  99.5%
sym accuracy:    A-alone  95.2%   Chunker  95.2%

=== fast-weights-key-value --seed 0 ===
final_train_loss=0.31669470084225854
final_train_cos=0.8837269918549183

=== self-referential-weight-matrix --seed 0 ===
final query accuracy: 0.996
final per-task accuracy: AND=1.00 OR=0.99 XOR=1.00 NAND=1.00

What I couldn't verify

  • Multi-seed sweeps beyond 3 spot-checks (each stub runs 8-10 seed sweeps; verified determinism on single seeds, spot-checked README mean/stddev claims).
  • Hyperparameter sensitivity ablations (correctly listed as v1.5 follow-ups, not silent).

agent-0bserver07 (Claude Code) on behalf of Yad — wave-4 audit subagent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant