wave 4: history compression + fast-weights + self-reference (5 stubs)#8
Open
0bserver07 wants to merge 6 commits intomainfrom
Open
wave 4: history compression + fast-weights + self-reference (5 stubs)#80bserver07 wants to merge 6 commits intomainfrom
0bserver07 wants to merge 6 commits intomainfrom
Conversation
…ast-weight updates across an unknown distractor gap (Schmidhuber 1992) The 1992 Neural Computation paper introduced the slow-net + fast-weight-matrix decomposition --- the slow programmer S emits a key/value/query/gate at each step, and writes a rank-1 outer product `eta * g_t * outer(v_t, k_t)` into a fast-weight matrix W_fast. The unknown-delay pattern-association task: pattern P presented at t=0 with a store flag; K~Uniform[5,30] random-distractor steps; recall flag at t=K+1; reproduce P. The slow net here is purely feedforward, so the only path that carries information across the gap is W_fast. Pure numpy: 917-parameter slow net, manual batched BPTT through the rank-1 fast-weight updates (verified against numerical gradients to 1e-6 max relative error via `--gradcheck`). Adam, batch=32, gradient norm clipped at 1.0. Results (seed 0, 1500 iters, ~3 s wallclock): - 100.00% bit-accuracy at recall over delays K=5..30 (50 episodes per K). - 100.00% bit-accuracy when extrapolated to delays K=1..60 (the algorithm the slow net learns is delay-independent by construction). - Deterministic across re-runs at the same seed. - Multi-seed: 10/10 seeds reach 100.00% within 1500 iters. Visualizations show the gate spiking at the store step (~0.9), staying near 0.1 across all distractor steps, and the recall output overlaying the true pattern bit for bit. The Frobenius norm of W_fast jumps at store and drifts only slightly across distractors --- the textbook "load and hold" behaviour. This stub is the direct ancestor of unnormalised linear self-attention (Schlag/Irie/Schmidhuber 2021); FROM/TO are renamed key/value, the rank-1 write rule is the same. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ewriting Implement the wave-4 self-referential weight matrix stub per SPEC #1. Architecture (faithful to ICANN-93 + modern SRWM lineage): - Effective W_eff = W_slow (BPTT-trained) + W_fast (per-episode plastic) - Each step the net outputs row/col soft attention, write value, write gate - W_fast updated by rank-1: eta * gate * val * outer(row, col) - Reads happen implicitly: W_eff_{t+1} contains last step's writes - Manual BPTT with tape; gradient check passes at relative error 8e-7 Task: 4-way meta-learning on AND/OR/XOR/NAND of 2 bits. Episode = 4 demo steps with labels visible + 4 query steps with labels hidden. The only mechanism for storing task identity is W_fast. Headline (seed 0): - Final query accuracy: 0.996 (AND/OR/XOR/NAND = 1.00/0.99/1.00/1.00) - Wallclock: ~5 s on M-series laptop CPU - 8-seed sweep: 8/8 > 0.95, 7/8 > 0.99 - 169 slow params, 3000 episodes, Adam lr=0.01 Files: - self_referential_weight_matrix.py: model, BPTT, training, --gradcheck CLI - make_self_referential_weight_matrix_gif.py: 4-task animation (364 KB) - visualize_self_referential_weight_matrix.py: 5 PNGs in viz/ - README.md: 8 sections per SPEC; includes deviation list and v2 questions Paper reports: a small-toy proof of concept of the self-referential weight-update idea. Reproduces: yes (continuous relaxation of the discrete read/write addresses, which is the standard modern instantiation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…er ancestor) Slow projector W_K writes outer-product updates to a fast weight matrix W_fast and reads it back for key-addressed value retrieval -- the unnormalised linear-attention math later formalised in Schlag/Irie/ Schmidhuber 2021. - N=5 (key, value) pairs, d_key=d_val=8 per the v1 spec. - Raw keys share a fixed bias direction so the slow projector has a non-trivial denoising job; pure iid keys would be a degenerate task. - Plain SGD on 0.5 ||y - v_q||^2; gradient analytically derived and numerically verified to <1e-9. - Multi-seed (0..9) post-training cosine 0.75-0.81; pre-training cosine 0.43-0.51. Stable. - Capacity sweep N=1..12 (no retrain) included; smooth fall-off as expected for random-projection associative memory. - Wallclock 0.07s per training run. - 8 PNGs to viz/, 370 KB GIF showing key-cosine matrix becoming diagonal as W_K projects out the bias. Files: - fast_weights_key_value.py (forward, backward, grad-check, train, eval, capacity sweep, CLI) - visualize_fast_weights_key_value.py (8 static PNGs) - make_fast_weights_key_value_gif.py (training animation) - README.md (8 sections)
Schmidhuber Habilitationsschrift 1993 / NC 4(2) 1992: a level-0 automatizer learns the deterministic filler in a synthetic trigger-recall stream of length T = 1200, and a level-1 chunker trains on the compressed surprise stream of length k = 2. Headline (seed 0, T = 1200): chunker target accuracy 100% vs single-net full-BPTT baseline 0%. Effective BPTT depth (gradient-norm 1% cutoff on the recall-target loss): 4 of 1199 for the baseline, 2 for the chunker. Depth-reduction ratio (T - 1) / k = 599.5x. Multi-seed sanity (seeds 1-3, T = 500): 3/3 at 100% chunker / 0% baseline, 249.5x reduction. Wallclock for the headline run: 30 s on M-series CPU. Files: chunker_very_deep_1200.py (RNN + BPTT + 2-stage training + baseline + eval), visualize_chunker_very_deep_1200.py (4 PNGs), make_chunker_very_deep_1200_gif.py (50-frame credit-assignment animation, 410 KB), README.md (8 sections), viz/, results.json. Documents synthetic-task vs original-benchmark deviation explicitly: the Habilitationsschrift is not retrievable in original form, the 1200-step number is reconstructed via the 2015 DL-in-NN survey sec 6.4-6.5 and the 1992 NC chunker paper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…dhuber 1991/1992)
22-symbol stream of {a, x, b1..b20}, with 21-symbol blocks chosen randomly
from {a b1..b20, x b1..b20}. Two output heads: next-symbol softmax and a
1-d label sigmoid asked at the end of each block ("did this block start
with 'a'?", a 20-step credit-assignment task).
Two modes implemented and compared back-to-back:
- a_alone: a single vanilla Elman RNN handles both heads. Stays at
43-57% chance label accuracy on 10/10 seeds.
- chunker (Schmidhuber 1991/1992): an automatizer A is trained on the
next-symbol task; a chunker C is fed only A's surprises (timesteps
where A's predicted prob of the actual next symbol is below 0.95) and
trained on the label task. Once A learns the deterministic
b_i -> b_{i+1} transitions, surprises taper to one per block (the
random choice-bit at the boundary), and C's label task collapses to a
1-step copy. 10/10 seeds reach 99.5% label accuracy at 1500 blocks
(~31k input symbols), in ~1 s wallclock per mode on an M-series CPU.
Architecture is faithful to the paper (vanilla Elman RNN for both nets,
hidden=32, BPTT inside each block). Two notable deviations documented in
README §Deviations: A's loss is muted at the random boundary transition
(otherwise A drifts off-uniform and the surprise threshold misses some
boundaries), and C's hidden state is reset to zero between surprises (the
label task on this clean stream is intrinsically 1-step; persistent
recurrence accumulates noise from early-training spurious surprises).
Both choices preserve the algorithmic claim of history compression while
making the surprise channel reliable enough that 10/10 seeds converge.
Files:
- chunker_22_symbol.py core: stream + RNN + Adam + train + eval + CLI
- visualize_chunker_22_symbol 4 static PNGs in viz/
- make_chunker_22_symbol_gif training animation (chunker_22_symbol.gif)
- README.md 8-section spec-compliant doc
- chunker_22_symbol.gif 576 KB animation, 50 frames + 12-frame hold
- viz/{training_curves,surprise_pattern,network_weights,test_episode}.png
Reproducibility: python3 chunker_22_symbol.py --seed N is deterministic;
verified bit-for-bit identical results across two consecutive runs at
seed 0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Octopus merge of 5 wave-4 stubs per SPEC issue #1. - wave-4-local/chunker-22-symbol: two-stack neural sequence chunker (1991/1992) - wave-4-local/chunker-very-deep-1200: very-deep history compression (1993) - wave-4-local/fast-weights-unknown-delay: slow-net controlling fast-weight memory (1992) - wave-4-local/fast-weights-key-value: key/value binding via outer-product (1992) - wave-4-local/self-referential-weight-matrix: RNN read/write own weights (1993) All 5 verified by separate audit subagent: numpy-only, deterministic, branch protocol followed (no wave-4-local on remote), all 8 README sections, algorithmic faithfulness confirmed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
Audit Report — PR #8 wave 4 (5 stubs)Wave 4 verdict: APPROVE. Independent review by separate Explore subagent. Per-stub verdicts
Cross-cut findings
Algorithmic faithfulness (3 deep dives)
Reproduce results (spot-checks)What I couldn't verify
agent-0bserver07 (Claude Code) on behalf of Yad — wave-4 audit subagent |
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wave 4 — history compression + fast-weights + self-reference
Five stubs implementing Schmidhuber's 1991-1993 sequence-indexing and meta-learning lineage per SPEC issue #1. Octopus-merged from 5 local-only
wave-4-local/<slug>branches.chunker-22-symbolchunker-very-deep-1200fast-weights-unknown-delayfast-weights-key-valueself-referential-weight-matrixAudit verdict (separate Explore subagent)
APPROVE across all 5 stubs.
wave-4-local/*branches on origin (verified).W_fast = values.T @ K, retrievaly = W_fast @ k_q. Exact linear-attention ancestor.W_eff = W_slow + W_fastconfirmed; W_fast rewritten by net's own outputs each step.__pycache__committed.agent-0bserver07 <agent-0bserver07@users.noreply.github.com>(no drift like wave 3 had).Per-stub deviations (in each stub's §Deviations)
h_c=0readout for clean i.i.d. stream (recurrence accumulates noise from spurious early surprises).bso slow projector W_K has non-trivial denoising job — pure i.i.d. Gaussian keys are near-orthogonal in d=8 and training has nothing to do).Citation gaps (in each stub's §Open questions)
Wave 0 → 1 → 2 → 3 → 4 progress
7 + 5 + 5 + 5 = 22/50 v1 stubs done (44%). 5 waves remaining = 28 stubs.
agent-0bserver07 (Claude Code) on behalf of Yad