MLP Language Model
Goal: Build a character-level language model with an embedding table and MLP hidden layer. Bengio et al. 2003 — the paper that launched neural language models. Inspired by Karpathy’s makemore Part 2.
Prerequisites: Language Models, Embeddings, Neurons and Activation Functions, 16 - Bigram Language Model
From Bigrams to Context
The bigram model only sees 1 character. The MLP model sees a context window of characters, embeds each into a learned vector, and predicts the next character from the concatenated embeddings.
[c1, c2, c3] → embed each → [e1 ⊕ e2 ⊕ e3] → hidden layer → output
Setup
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import random
# Same dataset as tutorial 16
words = open('names.txt', 'r').read().splitlines() if __import__('os').path.exists('names.txt') else [
"emma", "olivia", "ava", "sophia", "isabella", "mia", "charlotte", "amelia",
"harper", "evelyn", "abigail", "emily", "elizabeth", "sofia", "avery",
"ella", "scarlett", "grace", "chloe", "victoria", "riley", "aria", "lily",
"aurora", "zoey", "nora", "luna", "hannah", "penelope", "layla",
"eleanor", "stella", "violet", "hazel", "aurora", "savannah", "audrey",
"brooklyn", "bella", "claire", "skylar", "lucy", "paisley", "everly",
"anna", "caroline", "genesis", "emilia", "kennedy", "maya", "willow",
]
chars = sorted(set(''.join(words)))
stoi = {c: i+1 for i, c in enumerate(chars)}
stoi['.'] = 0
itos = {i: c for c, i in stoi.items()}
vocab_size = len(stoi)Build the Dataset
block_size = 3 # context length: predict from 3 previous characters
def build_dataset(words):
X, Y = [], []
for word in words:
context = [0] * block_size # start with '...'
for ch in word + '.':
ix = stoi[ch]
X.append(context[:])
Y.append(ix)
context = context[1:] + [ix] # slide window
return torch.tensor(X), torch.tensor(Y)
# Train/val/test split
random.seed(42)
random.shuffle(words)
n1, n2 = int(0.8 * len(words)), int(0.9 * len(words))
X_train, Y_train = build_dataset(words[:n1])
X_val, Y_val = build_dataset(words[n1:n2])
X_test, Y_test = build_dataset(words[n2:])
print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
# Show a few examples
for i in range(5):
context = ''.join(itos[c.item()] for c in X_train[i])
target = itos[Y_train[i].item()]
print(f" '{context}' → '{target}'")The Model
# Hyperparameters
n_embed = 10 # embedding dimension (each char → 10d vector)
n_hidden = 200 # hidden layer size
torch.manual_seed(42)
# Parameters
C = torch.randn(vocab_size, n_embed) # embedding table
W1 = torch.randn(block_size * n_embed, n_hidden) * (5/3) / (block_size * n_embed)**0.5 # Kaiming
b1 = torch.randn(n_hidden) * 0.01
W2 = torch.randn(n_hidden, vocab_size) * 0.01 # small init for output
b2 = torch.randn(vocab_size) * 0
parameters = [C, W1, b1, W2, b2]
for p in parameters:
p.requires_grad = True
print(f"Total parameters: {sum(p.nelement() for p in parameters)}")Why these initializations?
- C (embeddings): random is fine, they’ll be learned
- W1: Kaiming init —
gain / sqrt(fan_in)keeps activations from exploding/vanishing - W2: small — we want initial predictions to be roughly uniform (low confidence)
- b2: zero — no bias toward any character initially
Training
losses_train = []
batch_size = 32
for step in range(20000):
# Mini-batch
ix = torch.randint(0, len(X_train), (batch_size,))
X_batch, Y_batch = X_train[ix], Y_train[ix]
# Forward pass
emb = C[X_batch] # (batch, block_size, n_embed)
h_pre = emb.view(-1, block_size * n_embed) @ W1 + b1 # (batch, n_hidden)
h = torch.tanh(h_pre) # activation
logits = h @ W2 + b2 # (batch, vocab_size)
loss = F.cross_entropy(logits, Y_batch)
# Backward pass
for p in parameters:
p.grad = None
loss.backward()
# Update
lr = 0.1 if step < 10000 else 0.01
for p in parameters:
p.data -= lr * p.grad
if step % 2000 == 0:
losses_train.append(loss.item())
print(f"Step {step:5d} | Loss: {loss.item():.4f}")
# Final losses
with torch.no_grad():
emb = C[X_train]; h = torch.tanh(emb.view(-1, block_size * n_embed) @ W1 + b1); logits = h @ W2 + b2
train_loss = F.cross_entropy(logits, Y_train).item()
emb = C[X_val]; h = torch.tanh(emb.view(-1, block_size * n_embed) @ W1 + b1); logits = h @ W2 + b2
val_loss = F.cross_entropy(logits, Y_val).item()
print(f"\nTrain loss: {train_loss:.4f}, Val loss: {val_loss:.4f}")Visualize the Learned Embeddings
The embedding table maps each character to a 2D+ vector. Similar characters should cluster together:
# If n_embed > 2, project to 2D with PCA
from sklearn.decomposition import PCA
emb_np = C.detach().numpy()
if n_embed > 2:
emb_2d = PCA(n_components=2).fit_transform(emb_np)
else:
emb_2d = emb_np
plt.figure(figsize=(8, 8))
plt.scatter(emb_2d[:, 0], emb_2d[:, 1], s=50)
for i in range(vocab_size):
plt.annotate(itos[i], emb_2d[i], fontsize=12, ha='center', va='center')
plt.title("Learned character embeddings")
plt.grid(alpha=0.3)
plt.show()Vowels often cluster together. Common consonants form their own groups.
Sample from the Model
def sample(n=20):
names = []
for _ in range(n):
name = []
context = [0] * block_size
while True:
emb = C[torch.tensor(context)] # (block_size, n_embed)
h = torch.tanh(emb.view(1, -1) @ W1 + b1)
logits = h @ W2 + b2
probs = F.softmax(logits, dim=1)
ix = torch.multinomial(probs, 1).item()
if ix == 0:
break
name.append(itos[ix])
context = context[1:] + [ix]
names.append(''.join(name))
return names
print("Generated names:")
for name in sample(15):
print(f" {name}")Compare: Bigram vs MLP
# The bigram model from tutorial 16 gets ~2.45 NLL
# This MLP should get ~2.1-2.2 NLL — better because it sees 3 chars of context
print(f"MLP model val loss: {val_loss:.4f}")
print(f"Bigram baseline: ~2.45")
print(f"Improvement: ~{2.45 - val_loss:.2f}")Hyperparameter Exploration
# What matters most? Try varying:
experiments = {
"block_size": [2, 3, 5, 8], # more context
"n_embed": [2, 5, 10, 20], # richer embeddings
"n_hidden": [64, 128, 200, 300], # more capacity
}
# Run each and track val loss. You'll find:
# - block_size: big gains up to ~5, then diminishing
# - n_embed: big gains up to ~10, then diminishing
# - n_hidden: big gains up to ~200, then overfittingExercises
-
Deeper network: Add a second hidden layer. Does it help? Watch out for vanishing gradients — you may need better initialization or batchnorm (18 - Activations and Initialization Deep Dive).
-
Context window sweep: Train models with block_size 1, 2, 3, 5, 8, 10. Plot val loss vs context. Where do returns diminish?
-
Learning rate finder: Before training, run 1000 steps with lr exponentially increasing from 1e-4 to 1. Plot loss vs lr. The optimal lr is just before the loss spikes.
-
Embedding dimension vs vocab size: With 27 chars, do you need 10 dimensions? Try n_embed=2 and visualize the 2D embeddings directly (no PCA needed).
-
Replace tanh with ReLU: Does it train faster? Watch for dead neurons.
Next: 18 - Activations and Initialization Deep Dive — diagnose what goes wrong inside deep networks.