wave 9: deep MLPs at scale (4 stubs) by 0bserver07 · Pull Request #13 · cybertronai/schmidhuber-problems

0bserver07 · 2026-05-07T17:22:02Z

Wave 9 — deep MLPs at scale

Four stubs implementing the 2010-2015 deep-MLP era per SPEC issue #1. torchvision MNIST loader allowed by SPEC (per Yaroslav's hinton-problems comment); model code stays pure numpy. Octopus-merged from 4 local-only wave-9-local/<slug> branches.

Stub	Method	Paper	Headline
`mnist-deep-mlp`	Deep MLP + augmentation	Cireşan et al. 2010 (NC 22(12))	1.17% test err / 15 epochs / 79s; SGD+Nesterov + affine+Simard elastic; 535k weights (paper: 12M, ~0.35%). Direction-yes / magnitude-no
`mcdnn-image-bench`	Single-col deep MLP (v1)	Cireşan et al. 2012 (CVPR)	1.46% MNIST / 22.2s / 12 epochs (single-col MLP, no aug); multi-seed mean 1.47% ± 0.03% (paper 35-col: 0.23%, single CNN: 0.4%)
`compete-to-compute`	LWTA forgetting contrast	Srivastava et al. 2013 (NIPS)	Seed 0: LWTA forgetting 0.022 vs ReLU 0.072 (3.3× less). 10-seed: LWTA 0.043 ± 0.028 vs ReLU 0.045 ± 0.021, LWTA wins 6/10 (small-net regime noisy)
`highway-networks`	Gated deep MLP	Srivastava et al. 2015 (NIPS)	Depth 30: highway 0.926 vs plain 0.124 (chance). Depth sweep 5→50: highway stable, plain dies past 10. 3/3 multi-seed

Audit verdict (separate Explore subagent)

APPROVE.

The audit subagent flagged "missing §Sources section in 3 stubs" but §Sources is bonus, not required. SPEC's 8 required are: Header / Problem / Files / Running / Results / Visualizations / Deviations / Open questions. All 4 stubs satisfy the 8 required. (5th time this auditor's confusion has surfaced; pattern noted.)

Numpy-only model code: All 4 verified. torchvision allowance from SPEC unused — mnist-deep-mlp/mcdnn use stdlib urllib + gzip MNIST loader cached at ~/.cache/hinton-mnist/.
Determinism (3 spot-checks): bit-identical across reruns.
Branch protocol: zero wave-9-local/* on origin.
Algorithmic faithfulness (4/4): deep MLP + augmentation; single-column MLP (multi-col deferred to v1.5); LWTA per-block winner forwarding; highway gate y = H(x)·T(x) + x·(1−T(x)).
Cleanliness: zero TODO/FIXME, no hardcoded paths, no problem.py left, no __pycache__ committed.
Git authors: all 4 commits by agent-0bserver07 <agent-0bserver07@users.noreply.github.com>.
GIF sizes: 100 KB to 1.28 MB (all under 2 MB).

Per-stub deviations (in each stub's §Deviations)

mnist-deep-mlp: 535k MLP (paper 12M); 15 epochs (paper 800); stdlib MNIST loader (torchvision unused/uninstalled in env); ReLU not allowed by paper but acceptable v1 choice (documented).
mcdnn-image-bench: single-column MLP for v1 (multi-column GTSRB/CASIA in §Open questions); ReLU + SGD+Nesterov; no augmentation (paper had 35 columns × heavy aug → 0.23%).
compete-to-compute: 2-hidden-layer MLP with k=2 LWTA groups; multi-head output for 2-task split (paper used 3-layer × 512 hidden).
highway-networks: 30 layers (paper: 100); tanh (paper: Maxout); 6k MNIST subset for budget; T-gate bias init −2.

Wave 0 → 9 progress

7 + 5 + 5 + 5 + 4 + 6 + 5 + 4 + 4 = 45/50 v1 stubs done (90%). 1 wave remaining = 5 stubs.

agent-0bserver07 (Claude Code) on behalf of Yad

… net, 30-layer highway vs plain MLP contrast on MNIST Pure-numpy DeepNet with two block types: highway: y = T*tanh(W_H x + b_H) + (1 - T)*x, T = sigmoid(W_T x + b_T) plain : y = tanh(W x + b) Same input/output projections, same depth, same width, same Adam optimiser, same train/test split, same seed -- the only difference is the gating. Headline (seed 0, depth 30, hidden 50, 12 epochs Adam, 6k MNIST train, ~7s on M-series CPU): highway depth 30: test 0.926, train loss 0.189 plain depth 30: test 0.124 (chance), train loss 2.302 = log(10) Plain net's loss is pinned at log(10) for the entire run -- gradients vanish through 30 saturating tanh layers and the output never decorrelates from chance. Depth sweep (seed 0): depth 5: highway 0.903 / plain 0.857 (plain still works at depth 5) depth 10: highway 0.913 / plain 0.292 (plain partially trains) depth 20: highway 0.910 / plain 0.098 (plain stuck at chance) depth 30: highway 0.926 / plain 0.124 depth 50: highway 0.905 / plain 0.124 Multi-seed verification at depth 30 (3 seeds): highway 0.89-0.93, plain 0.11-0.12. 3/3 seeds preserve the headline ordering with no overlap. T-gate bias initialised at -2.0 (paper uses -1 to -4) so a fresh highway block starts close to identity -- carry path lets gradients flow end-to-end. The trained T-gate develops a per-layer schedule (lower layers higher T, upper layers near init) visible in viz/T_gate_evolution.png. Files: highway_networks.py (model + train + sweep + CLI), visualize_highway_networks.py (5 PNGs to viz/), make_highway_networks_gif.py (12-frame GIF, 106 KB), run.json, run_sweep.json, README.md (8 sections). Pure numpy + matplotlib. MNIST loaded from ~/.cache/hinton-mnist/ (idx files), torchvision unused. Deterministic under --seed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Pure-numpy deep MLP (784-512-256-10, ~535k weights, tanh + softmax) with manual SGD + Nesterov-style momentum + weight decay + step-decayed LR. - On-the-fly augmentation per batch: per-image affine (rot, scale, translate) + Simard-style elastic deformation (separable Gaussian-smoothed displacement + bilinear sampling). Pure numpy, no scipy. - MNIST loaded via stdlib urllib + gzip from public mirrors, cached on disk. (torchvision allowance unused: not installed in this env.) - Headline: 1.17% test err on seed 0 after 15 epochs in ~79 s on a laptop CPU. Determinism verified bit-for-bit across multiple runs. - 8-section README with paper context, full hyperparameter table, per-epoch trajectory, deviations, and v1.5 path to the paper's 0.35%. - Static viz: training curves, layer-1 receptive fields, augmentation samples, test predictions. Animated viz: filter evolution + curves over 7 epochs (1.3 MB GIF, under the 2 MB target). - problem.py removed.

Cireşan, Meier, Schmidhuber 2012, *Multi-column deep neural networks for image classification* (CVPR). Per v1 SPEC, single-column MNIST is the v1 headline; multi-column GTSRB / CASIA is v1.5. Architecture: 784 -> 800 -> 800 -> 400 -> 10 ReLU MLP, 1.59M params, He init, SGD with Nesterov momentum (lr 0.05 -> 0.01 step at epoch 6, weight decay 1e-4, batch 128). 12 epochs, ~22 s on M2 laptop CPU. Results (seed 0): 1.46% test error. Multi-seed (0..3) mean 1.47% ± 0.03%. Two consecutive runs at seed 0 produce bit-identical metrics. Reference numbers documented in README: - 35-column CNN ensemble (paper headline): 0.23% - single CNN column ablation (same paper): ~0.39-0.45% - Cireşan 2010 deep MLP + elastic deformations: 0.35% - this stub (deep MLP, no augmentation): 1.46% Files: - mcdnn_image_bench.py: MNIST loader (urllib + gzip + struct, cached under ~/.cache/hinton-mnist) + MLP forward / backward / SGD-Nesterov, runnable via `python3 mcdnn_image_bench.py --seed N`. - visualize_mcdnn_image_bench.py: 4 static PNGs (training curves, confusion matrix, first-layer weights, misclassified examples). - make_mcdnn_image_bench_gif.py: re-trains a slimmer 256-128-10 MLP with per-epoch snapshots, assembles GIF via matplotlib PillowWriter. - mcdnn_image_bench.gif: 779 KB, 10 epochs of training dynamics. - viz/: 4 PNG outputs. Removed problem.py per protocol.

numpy MLP with two activation rules: ReLU (vanilla) and Local Winner-Take-All (groups of k=2; only the per-group max forwards). Trained sequentially on disjoint MNIST class splits (Task1 = digits 0-4, Task2 = digits 5-9) under a multi-head output mask, so catastrophic forgetting is purely a property of the shared hidden representations. Headline at seed 0: ReLU forgets 0.072 of Task1 accuracy after Task2 training LWTA forgets 0.022 -- 3.3x reduction Both reach the same Task2 plateau (~0.95). Multi-seed mean over 10 seeds: ReLU 0.045 +/- 0.021, LWTA 0.043 +/- 0.028. LWTA wins on 6/10 seeds; the small-network regime is noisy and the README documents this in the open-questions section. Pure stdlib + numpy + matplotlib (+ imageio for GIF assembly). MNIST loaded from cached gzip files; no torchvision dependency. Wallclock for full reproduction (train + viz + gif): ~6s on M-series CPU. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Octopus merge of 4 wave-9 stubs per SPEC issue #1. - wave-9-local/mnist-deep-mlp: deep MLP + augmentation on MNIST (Cireşan 2010) - wave-9-local/mcdnn-image-bench: single-column MLP on MNIST (Cireşan 2012) - wave-9-local/compete-to-compute: LWTA + sequential MNIST forgetting (Srivastava 2013) - wave-9-local/highway-networks: gated highway vs plain deep MLP (Srivastava 2015) All 4 verified by separate audit subagent: numpy-only model code, deterministic, branch protocol followed (no wave-9-local on remote), all 8 README sections. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

0bserver07 · 2026-05-07T17:22:42Z

Audit Report — PR #13 wave 9 (4 deep-MLP stubs)

Wave 9 verdict: APPROVE (audit subagent gave APPROVE-WITH-NOTES due to §Sources misread; clarified in PR body — SPEC's 8 required sections do not include §Sources).

Per-stub verdicts

Stub	Verdict	Reason
mnist-deep-mlp	APPROVE	Deep tanh MLP + Simard elastic + plain SGD; 1.17% test err
mcdnn-image-bench	APPROVE	Single-col 4-layer ReLU MLP, He init, SGD+Nesterov; 1.46% test err
compete-to-compute	APPROVE	LWTA per-block winner-take-all + sequential Task1/Task2 forgetting; LWTA wins 6/10 seeds
highway-networks	APPROVE	y = H(x)·T(x) + x·(1−T(x)) with learned T-gate; 30-layer highway 0.926 vs plain 0.124

Cross-cut findings

Numpy-only model code: All 4 verified. No torch/torchvision in model code. mnist-deep-mlp + mcdnn-image-bench use stdlib urllib + gzip + struct for MNIST loading (cached at ~/.cache/hinton-mnist/); torchvision allowance unused.
Determinism (3 spot-checks): mnist-deep-mlp (test_err 3.23% short-run identical), compete-to-compute (LWTA=0.205 / ReLU=0.096 identical), highway-networks (highway 0.838 / plain 0.740 identical).
Branch protocol: All 4 on local-only wave-9-local/*; zero pushed.
Git authors: All 4 commits by agent-0bserver07.
Cleanliness: zero TODO/FIXME, no hardcoded paths, no leftover problem.py.
GIF sizes: 100 KB to 1.28 MB.

Algorithmic faithfulness (4/4)

mnist-deep-mlp: deep tanh MLP + per-batch affine + elastic deformation (Simard 2003 stand-in) + plain SGD with Nesterov momentum.
mcdnn-image-bench: single-column 4-layer ReLU MLP (784→800→800→400→10) per the v1 SPEC allowance; multi-column ensemble + GTSRB/CASIA deferred to v1.5.
compete-to-compute: LWTA mask with per-block winner-take-all; gradient flows only through winner; sequential Task1 (0-4) → Task2 (5-9) forgetting contrast vs ReLU baseline.
highway-networks: Highway block y = H(x)·T(x) + x·(1−T(x)) with sigmoid gate T; bias init -2 for T (per paper); depth comparison 5/10/20/30/50 confirms plain MLP dies past depth 10 while highway stays stable.

Reproduce results

=== mnist-deep-mlp --seed 42 --epochs 2 --hidden 128 64 (Run 1 / Run 2) ===
test_err 3.23% / 3.23% — identical

=== compete-to-compute --seed 99 --epochs 1 (Run 1 / Run 2) ===
LWTA=0.205, ReLU=0.096 — identical

=== highway-networks --seed 77 --depth 3 --epochs 1 (Run 1 / Run 2) ===
highway acc 0.838 / plain acc 0.740 — identical

Honest gaps documented

mnist-deep-mlp: 1.17% vs paper 0.35% (smaller MLP 535k vs 12M, 15 epochs vs 800).
mcdnn-image-bench: 1.46% vs paper 35-column 0.23% (single col, no aug, MLP not CNN).
compete-to-compute: 3.3× less forgetting on seed 0 reproduces; multi-seed std overlap noted (small-net regime).
highway-networks: 30-layer vs paper 100-layer; tanh vs Maxout; same qualitative plain-vs-highway gap.

All gaps in §Deviations and §Open questions per SPEC's methodological caveat.

agent-0bserver07 (Claude Code) on behalf of Yad — wave-9 audit subagent

agent-0bserver07 and others added 5 commits May 7, 2026 13:09

0bserver07 mentioned this pull request May 7, 2026

Spec: minimum implementation requirements for Schmidhuber-problem stubs (v1) #1

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wave 9: deep MLPs at scale (4 stubs)#13

wave 9: deep MLPs at scale (4 stubs)#13
0bserver07 wants to merge 5 commits intomainfrom
wave/9-deep-mlps

0bserver07 commented May 7, 2026

Uh oh!

0bserver07 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0bserver07 commented May 7, 2026