wave 9: deep MLPs at scale (4 stubs)#13
Open
0bserver07 wants to merge 5 commits intomainfrom
Open
Conversation
… net, 30-layer highway vs plain MLP contrast on MNIST Pure-numpy DeepNet with two block types: highway: y = T*tanh(W_H x + b_H) + (1 - T)*x, T = sigmoid(W_T x + b_T) plain : y = tanh(W x + b) Same input/output projections, same depth, same width, same Adam optimiser, same train/test split, same seed -- the only difference is the gating. Headline (seed 0, depth 30, hidden 50, 12 epochs Adam, 6k MNIST train, ~7s on M-series CPU): highway depth 30: test 0.926, train loss 0.189 plain depth 30: test 0.124 (chance), train loss 2.302 = log(10) Plain net's loss is pinned at log(10) for the entire run -- gradients vanish through 30 saturating tanh layers and the output never decorrelates from chance. Depth sweep (seed 0): depth 5: highway 0.903 / plain 0.857 (plain still works at depth 5) depth 10: highway 0.913 / plain 0.292 (plain partially trains) depth 20: highway 0.910 / plain 0.098 (plain stuck at chance) depth 30: highway 0.926 / plain 0.124 depth 50: highway 0.905 / plain 0.124 Multi-seed verification at depth 30 (3 seeds): highway 0.89-0.93, plain 0.11-0.12. 3/3 seeds preserve the headline ordering with no overlap. T-gate bias initialised at -2.0 (paper uses -1 to -4) so a fresh highway block starts close to identity -- carry path lets gradients flow end-to-end. The trained T-gate develops a per-layer schedule (lower layers higher T, upper layers near init) visible in viz/T_gate_evolution.png. Files: highway_networks.py (model + train + sweep + CLI), visualize_highway_networks.py (5 PNGs to viz/), make_highway_networks_gif.py (12-frame GIF, 106 KB), run.json, run_sweep.json, README.md (8 sections). Pure numpy + matplotlib. MNIST loaded from ~/.cache/hinton-mnist/ (idx files), torchvision unused. Deterministic under --seed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Pure-numpy deep MLP (784-512-256-10, ~535k weights, tanh + softmax) with manual SGD + Nesterov-style momentum + weight decay + step-decayed LR. - On-the-fly augmentation per batch: per-image affine (rot, scale, translate) + Simard-style elastic deformation (separable Gaussian-smoothed displacement + bilinear sampling). Pure numpy, no scipy. - MNIST loaded via stdlib urllib + gzip from public mirrors, cached on disk. (torchvision allowance unused: not installed in this env.) - Headline: 1.17% test err on seed 0 after 15 epochs in ~79 s on a laptop CPU. Determinism verified bit-for-bit across multiple runs. - 8-section README with paper context, full hyperparameter table, per-epoch trajectory, deviations, and v1.5 path to the paper's 0.35%. - Static viz: training curves, layer-1 receptive fields, augmentation samples, test predictions. Animated viz: filter evolution + curves over 7 epochs (1.3 MB GIF, under the 2 MB target). - problem.py removed.
Cireşan, Meier, Schmidhuber 2012, *Multi-column deep neural networks for
image classification* (CVPR). Per v1 SPEC, single-column MNIST is the v1
headline; multi-column GTSRB / CASIA is v1.5.
Architecture: 784 -> 800 -> 800 -> 400 -> 10 ReLU MLP, 1.59M params, He
init, SGD with Nesterov momentum (lr 0.05 -> 0.01 step at epoch 6,
weight decay 1e-4, batch 128). 12 epochs, ~22 s on M2 laptop CPU.
Results (seed 0): 1.46% test error. Multi-seed (0..3) mean 1.47% ± 0.03%.
Two consecutive runs at seed 0 produce bit-identical metrics.
Reference numbers documented in README:
- 35-column CNN ensemble (paper headline): 0.23%
- single CNN column ablation (same paper): ~0.39-0.45%
- Cireşan 2010 deep MLP + elastic deformations: 0.35%
- this stub (deep MLP, no augmentation): 1.46%
Files:
- mcdnn_image_bench.py: MNIST loader (urllib + gzip + struct, cached
under ~/.cache/hinton-mnist) + MLP forward / backward / SGD-Nesterov,
runnable via `python3 mcdnn_image_bench.py --seed N`.
- visualize_mcdnn_image_bench.py: 4 static PNGs (training curves,
confusion matrix, first-layer weights, misclassified examples).
- make_mcdnn_image_bench_gif.py: re-trains a slimmer 256-128-10 MLP
with per-epoch snapshots, assembles GIF via matplotlib PillowWriter.
- mcdnn_image_bench.gif: 779 KB, 10 epochs of training dynamics.
- viz/: 4 PNG outputs.
Removed problem.py per protocol.
numpy MLP with two activation rules: ReLU (vanilla) and Local Winner-Take-All (groups of k=2; only the per-group max forwards). Trained sequentially on disjoint MNIST class splits (Task1 = digits 0-4, Task2 = digits 5-9) under a multi-head output mask, so catastrophic forgetting is purely a property of the shared hidden representations. Headline at seed 0: ReLU forgets 0.072 of Task1 accuracy after Task2 training LWTA forgets 0.022 -- 3.3x reduction Both reach the same Task2 plateau (~0.95). Multi-seed mean over 10 seeds: ReLU 0.045 +/- 0.021, LWTA 0.043 +/- 0.028. LWTA wins on 6/10 seeds; the small-network regime is noisy and the README documents this in the open-questions section. Pure stdlib + numpy + matplotlib (+ imageio for GIF assembly). MNIST loaded from cached gzip files; no torchvision dependency. Wallclock for full reproduction (train + viz + gif): ~6s on M-series CPU. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Octopus merge of 4 wave-9 stubs per SPEC issue #1. - wave-9-local/mnist-deep-mlp: deep MLP + augmentation on MNIST (Cireşan 2010) - wave-9-local/mcdnn-image-bench: single-column MLP on MNIST (Cireşan 2012) - wave-9-local/compete-to-compute: LWTA + sequential MNIST forgetting (Srivastava 2013) - wave-9-local/highway-networks: gated highway vs plain deep MLP (Srivastava 2015) All 4 verified by separate audit subagent: numpy-only model code, deterministic, branch protocol followed (no wave-9-local on remote), all 8 README sections. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
Audit Report — PR #13 wave 9 (4 deep-MLP stubs)Wave 9 verdict: APPROVE (audit subagent gave APPROVE-WITH-NOTES due to §Sources misread; clarified in PR body — SPEC's 8 required sections do not include §Sources). Per-stub verdicts
Cross-cut findings
Algorithmic faithfulness (4/4)
Reproduce resultsHonest gaps documented
All gaps in §Deviations and §Open questions per SPEC's methodological caveat. agent-0bserver07 (Claude Code) on behalf of Yad — wave-9 audit subagent |
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wave 9 — deep MLPs at scale
Four stubs implementing the 2010-2015 deep-MLP era per SPEC issue #1. torchvision MNIST loader allowed by SPEC (per Yaroslav's hinton-problems comment); model code stays pure numpy. Octopus-merged from 4 local-only
wave-9-local/<slug>branches.mnist-deep-mlpmcdnn-image-benchcompete-to-computehighway-networksAudit verdict (separate Explore subagent)
APPROVE.
The audit subagent flagged "missing §Sources section in 3 stubs" but §Sources is bonus, not required. SPEC's 8 required are: Header / Problem / Files / Running / Results / Visualizations / Deviations / Open questions. All 4 stubs satisfy the 8 required. (5th time this auditor's confusion has surfaced; pattern noted.)
urllib + gzipMNIST loader cached at~/.cache/hinton-mnist/.wave-9-local/*on origin.y = H(x)·T(x) + x·(1−T(x)).problem.pyleft, no__pycache__committed.agent-0bserver07 <agent-0bserver07@users.noreply.github.com>.Per-stub deviations (in each stub's §Deviations)
Wave 0 → 9 progress
7 + 5 + 5 + 5 + 4 + 6 + 5 + 4 + 4 = 45/50 v1 stubs done (90%). 1 wave remaining = 5 stubs.
agent-0bserver07 (Claude Code) on behalf of Yad