# Neural Forge — Curriculum

Sixteen modules. Each module has the same structure:

- **The Hook** — a 90-second "why does this matter" cold-open.
- **The Math** — derivations, proofs where useful, geometric intuition.
- **The Code Lab** — interactive runnable code in 2–3 languages (always Python; often C, Rust, JAX, or CUDA depending on module).
- **The Boss Fight** — a non-trivial implementation problem.
- **The Forge Gate** — a math-grounded password puzzle.
- **The Unlock** — a piece of proprietary course software unique to that module.

Languages used: **Python (NumPy, PyTorch, JAX), C, Rust, JavaScript, Mojo, Triton, CUDA** — student picks their primary; we provide every example in their language plus one reference (Python).

---

## Phase I — Foundations (Modules 1–4)

### Module 1 — *The Vector Awakens*
**Math:** Vectors in ℝⁿ, dot product, norms, projections, basis, span, linear independence, geometric interpretation of matrix-vector multiply.
**Why it matters:** Embeddings are vectors. Similarity is the dot product. Attention is a dot product. If you understand vectors, you have a foothold.
**Code Lab:** Implement `dot`, `norm`, `cosine_similarity`, `project` from scratch in Python *and* C. Visualize them on a live 3D canvas.
**Boss Fight:** Given 50 word embeddings, find the 5 closest pairs by cosine similarity without any library function for similarity.
**Forge Gate puzzle:** "The matrix M = [[2, 1], [1, 3]] acts on v = [3, 2]. Type the 2-vector Mv as 'a,b'." (Answer: `8,9`)
**Unlocks:** `MatrixLab` — a live matrix-vector visualizer with draggable basis vectors.

### Module 2 — *The Calculus of Slopes*
**Math:** Limits, derivatives, gradient as a vector, partial derivatives, chain rule, directional derivatives, Jacobian, Hessian, Taylor expansion to second order.
**Why it matters:** Training a network = walking downhill on a loss surface. The gradient is which way is down.
**Code Lab:** Finite-difference gradient. Compare to symbolic. Build a tiny `autograd` in 80 lines.
**Boss Fight:** Implement reverse-mode autodiff for scalar functions composed of `+`, `*`, `exp`, `log`, `sin`. Pass a 6-function test suite.
**Forge Gate puzzle:** "f(x,y) = x²y + e^y. What is ∂f/∂y at (2, 0)? Round to 3 decimals." (Answer: `5.000`)
**Unlocks:** `GradientScope` — interactive loss surface explorer with live gradient arrows.

### Module 3 — *Matrices and Their Souls*
**Math:** Matrix multiplication as composition of linear maps, rank, null space, eigenvalues/vectors, SVD, condition number, positive definiteness.
**Why it matters:** Weight matrices have personalities. Their spectrum predicts training stability, gradient explosion/vanishing, expressivity.
**Code Lab:** Power iteration for top eigenvector. SVD-by-hand for 2×2. Compare numpy.linalg.svd to your implementation.
**Boss Fight:** Given a 4×4 matrix, compute its SVD without `np.linalg.svd`. Match to 1e-4.
**Forge Gate puzzle:** "The 2×2 matrix [[4, 1], [2, 3]] has two real eigenvalues. Their product is the determinant. What is it?" (Answer: `10`)
**Unlocks:** `SpectraScope` — eigenvalue/SVD visualizer for arbitrary matrices, with stability heatmap.

### Module 4 — *Probability for the Loss Function*
**Math:** PMFs, PDFs, expectation, variance, Bayes, KL divergence, cross-entropy, MLE vs MAP, softmax as Gibbs distribution, the Jensen inequality.
**Why it matters:** Every loss function in deep learning is a likelihood in disguise. Cross-entropy *is* maximum likelihood.
**Code Lab:** Implement KL divergence. Show that minimizing cross-entropy ≡ minimizing KL(empirical || model). Visualize softmax temperature.
**Boss Fight:** Derive (with pen and pencil) and then implement the gradient of cross-entropy w.r.t. softmax logits. Show it equals `softmax(z) - y`.
**Forge Gate puzzle:** "softmax([1, 2, 3]) — what is the 3rd component, rounded to 3 decimals?" (Answer: `0.665`)
**Unlocks:** `LossLab` — try any loss function on any prediction/target pair, see gradient flow.

---

## Phase II — From Neuron to Network (Modules 5–8)

### Module 5 — *The Perceptron, Reborn*
**Math:** Single neuron as `σ(wᵀx + b)`, decision boundary geometry, why ReLU works, the universal approximation theorem (statement, not proof).
**Code Lab:** Train a perceptron on linearly separable data in raw NumPy. Then on XOR — fail, fail, fail. Then with one hidden layer — succeed.
**Boss Fight:** Build a 2-layer MLP from scratch (no torch, no autograd) that learns XOR and the spiral dataset.
**Forge Gate puzzle:** "The minimum number of hidden units needed to solve XOR with a feedforward net using ReLU is N. What is N?" (Answer: `2`)
**Unlocks:** `NeuronForge` — drag-and-drop neuron playground with live decision boundary.

### Module 6 — *Backprop, Step by Step*
**Math:** Backprop as the chain rule applied to a computational graph. Forward mode vs reverse mode. Memory–compute tradeoff. Why we use reverse mode for scalar losses.
**Code Lab:** A 100-line `micrograd`-style tensor lib. Forward & backward. Train MLP on MNIST in your own framework.
**Boss Fight:** Implement backprop *in C* for a 3-layer MLP. No Python allowed. Match Python output to 1e-5.
**Forge Gate puzzle:** "In a 3-layer MLP with widths [784, 256, 10], how many learnable parameters are there (no bias)? Answer as one integer." (Answer: `203264`)
**Unlocks:** `BackpropArena` — step debugger that walks through backprop one node at a time on a graph you draw.

### Module 7 — *Optimizers — Inside SGD, Momentum, Adam*
**Math:** SGD vs full-batch. Momentum as moving average. RMSProp as adaptive learning rate. Adam = Momentum + RMSProp. Bias correction. Convergence guarantees (sketch).
**Code Lab:** Implement SGD, SGD+Momentum, Adam in 40 lines each. Train MLP. Plot loss curves.
**Boss Fight:** Reproduce Figure 2 of the Adam paper on a toy 2D loss surface. Annotate why each optimizer follows its trajectory.
**Forge Gate puzzle:** "Adam's default β₁ × default β₂ = ? (4 decimal places)" (Answer: `0.8991`)
**Unlocks:** `OptiTrack` — race three optimizers on any loss surface you sketch.

### Module 8 — *Regularization, Initialization, Normalization*
**Math:** L1 / L2 / dropout / data aug. Xavier and He init derivations. BatchNorm, LayerNorm, RMSNorm. The variance-preservation argument.
**Code Lab:** Show vanishing gradients with bad init. Show that He init fixes it. Implement LayerNorm by hand.
**Boss Fight:** Build a 20-layer MLP. Make it trainable. (Hint: pick *two* of init/normalization/residual carefully.)
**Forge Gate puzzle:** "He initialization sets weights to N(0, σ²) where σ² = c/fan_in. What is c for ReLU?" (Answer: `2`)
**Unlocks:** `DeepStack` — visualize gradient norms layer-by-layer in arbitrary depth networks.

---

## Phase III — Sequences and Attention (Modules 9–12)

### Module 9 — *Embeddings — Words Become Vectors*
**Math:** One-hot vs distributed representation. Skip-gram derivation. Negative sampling math. Cosine similarity revisited. The embedding matrix as a soft lookup.
**Code Lab:** Train word2vec from scratch on a tiny corpus. Visualize with t-SNE.
**Boss Fight:** Given an embedding matrix, find the analogy "king - man + woman = ?". Top-5 nearest neighbors must include "queen".
**Forge Gate puzzle:** "If your vocab is 50,257 tokens and your embedding dim is 768, how many parameters in the embedding table?" (Answer: `38597376`)
**Unlocks:** `EmbedExplorer` — drag a word, see its 50 nearest neighbors live in a t-SNE/UMAP plot.

### Module 10 — *Sequences — RNNs, LSTMs, and Why They Lost*
**Math:** Recurrence relations as dynamical systems. Vanishing/exploding gradients in BPTT. LSTM gating. The path-dependence problem that motivated attention.
**Code Lab:** Implement a vanilla RNN and an LSTM in NumPy. Train on a counting task.
**Boss Fight:** Train an LSTM to add two binary numbers character-by-character. ≥95% accuracy.
**Forge Gate puzzle:** "An LSTM cell has 4 gate matrices. If hidden_dim = h and input_dim = d, give the total parameter count as an expression in h and d, then evaluate at h=128, d=64." (Answer: `98816` — i.e. `4*(h*(h+d) + h)`)
**Unlocks:** `SeqLab` — visualize hidden state evolution token by token in any RNN you train.

### Module 11 — *Attention Is All You Need*
**Math:** From "weighted average of values" to **softmax(QKᵀ/√d_k)V**. Why the √d_k. Multi-head as parallel subspaces. Causal masking. Positional encoding (sinusoidal *and* RoPE — derive RoPE).
**Code Lab:** Implement scaled dot-product attention. Then multi-head. Then a full encoder block. In PyTorch *and* in raw NumPy.
**Boss Fight:** Write your own multi-head attention layer that exactly matches `torch.nn.MultiheadAttention` outputs (same weights).
**Forge Gate puzzle:** "In an attention layer with d_model=512, num_heads=8, each head has d_k = ?" (Answer: `64`)
**Unlocks:** `AttentionMap` — feed any text, see live attention patterns per head per layer.

### Module 12 — *The Transformer Block, Built From Parts*
**Math:** Pre-LN vs Post-LN. The full transformer block: Attn → Add&Norm → FFN → Add&Norm. Why the FFN is 4× wide. SwiGLU and GeGLU.
**Code Lab:** Assemble a full GPT-style block. Stack 6 of them. Train on Shakespeare. Generate text.
**Boss Fight:** Implement a complete decoder-only transformer in **one language of the student's choice that isn't Python**. Run it to a perplexity target.
**Forge Gate puzzle:** "A standard GPT-2 small has 12 layers, d_model=768, 4× FFN. How many parameters in the FFN of ONE block (ignoring biases)? Answer as integer." (Answer: `4718592` — i.e. `2 * 768 * 3072`)
**Unlocks:** `BlockBuilder` — Lego-style transformer block assembler with live forward-pass tracing.

---

## Phase IV — Modern LLMs and Shipping Research (Modules 13–16)

### Module 13 — *Scaling Laws and the Modern LM Recipe*
**Math:** Chinchilla scaling law derivation. Compute-optimal training. Tokens-per-parameter ratios. Loss as power law in compute.
**Code Lab:** Fit a power law to provided loss-vs-compute data. Predict the loss at 10× compute. Be within 5%.
**Boss Fight:** Given a $50 GPU budget, design the *largest* trainable transformer that hits a target loss on TinyStories. Justify with scaling math.
**Forge Gate puzzle:** "Chinchilla recommends approximately N tokens per parameter for compute-optimal training. What is N? (the famous number)" (Answer: `20`)
**Unlocks:** `ScaleScope` — plug in your hyperparameters, see predicted final loss curves.

### Module 14 — *Efficient Attention — FlashAttention, KV Cache, MQA, GQA*
**Math:** Memory hierarchy of a GPU. Why standard attention is bandwidth-bound. FlashAttention's tiling argument. KV cache during inference. Multi-Query and Grouped-Query Attention savings.
**Code Lab:** Implement KV cache for your transformer. Measure speedup. Implement GQA. Optionally: write a FlashAttention kernel in Triton.
**Boss Fight:** Take your Module 12 transformer. Make inference 5× faster with KV cache + GQA. Measure with `torch.profiler`.
**Forge Gate puzzle:** "GPT-3's MHA has 96 heads. If we converted to GQA with 8 KV heads, what's the KV-cache compression ratio?" (Answer: `12`)
**Unlocks:** `KernelLab` — Triton playground with side-by-side benchmark vs CUDA reference.

### Module 15 — *Training a Real LM — Data, Tokenizers, Pretraining, Finetuning, RLHF*
**Math:** BPE tokenization. Token frequencies and Zipf. Cross-entropy on a token stream. Reward modeling and PPO objective (full derivation).
**Code Lab:** Train your own BPE tokenizer. Train your transformer on 1B tokens of OpenWebText. Then SFT on a small chat dataset. Then DPO on a preference dataset.
**Boss Fight:** Ship a 100M-parameter chat-style model that can answer 10 held-out questions sensibly. Tested by an LLM-as-judge rubric.
**Forge Gate puzzle:** "DPO collapses to a closed-form gradient. The DPO loss is -log(sigmoid(β·(r_chosen - r_rejected))) where r_x = log(π_θ(x)/π_ref(x)). β default in the paper is what? (decimal)" (Answer: `0.1`)
**Unlocks:** `TrainerHub` — full pretraining+SFT+DPO pipeline launcher with live loss + sample generation.

### Module 16 — *The Capstone — Ship a Novel Research LM*
**No password. The student is now a peer.**
**The Capstone:** Pick a research direction. Examples we offer:
- Train a Mixture-of-Experts variant of your Module 15 model.
- Implement Mamba/SSM and compare to your transformer at matched compute.
- Apply RLHF with a custom reward model focused on a niche (poetry, code, math).
- Write a paper-length writeup with an ablation table and a benchmark.

**Deliverables:** Code, trained weights, 4-page writeup in NeurIPS-workshop format, a live demo. The course "graduates" the student when an automated reviewer (using a frozen LLM judge with rubric) and one human (the course operator) sign off.
**Unlocks:** `ResearchLab` — the final tool. A full training cluster front-end (works on Modal/Lambda/cloud), experiment tracker, and a "submit to arXiv" stub.

---

## A note on "flavor"

The course adapts to the student's stated interests. At signup the student picks two flavors from:

- **The Hacker** — extra C / Rust / CUDA challenges, kernel-level optimizations.
- **The Mathematician** — extra proofs, optional deep dives into measure theory, optimization theory.
- **The Linguist** — extra material on tokenization, low-resource languages, multilingual models.
- **The Artist** — projects geared toward poetry, music, image-conditioned LMs.
- **The Scientist** — extra emphasis on benchmarking, ablations, reproducibility.
- **The Builder** — extra emphasis on deployment, quantization, edge inference.

The interactive course shell pulls in the appropriate side-quest tracks at each module.