MLP Language Model

Goal: Build a character-level language model with an embedding table and MLP hidden layer. Bengio et al. 2003 — the paper that launched neural language models. Inspired by Karpathy’s makemore Part 2.

Prerequisites: Language Models, Embeddings, Neurons and Activation Functions, 16 - Bigram Language Model


From Bigrams to Context

The bigram model only sees 1 character. The MLP model sees a context window of characters, embeds each into a learned vector, and predicts the next character from the concatenated embeddings.

[c1, c2, c3] → embed each → [e1 ⊕ e2 ⊕ e3] → hidden layer → output

Setup

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import random
 
# Same dataset as tutorial 16
words = open('names.txt', 'r').read().splitlines() if __import__('os').path.exists('names.txt') else [
    "emma", "olivia", "ava", "sophia", "isabella", "mia", "charlotte", "amelia",
    "harper", "evelyn", "abigail", "emily", "elizabeth", "sofia", "avery",
    "ella", "scarlett", "grace", "chloe", "victoria", "riley", "aria", "lily",
    "aurora", "zoey", "nora", "luna", "hannah", "penelope", "layla",
    "eleanor", "stella", "violet", "hazel", "aurora", "savannah", "audrey",
    "brooklyn", "bella", "claire", "skylar", "lucy", "paisley", "everly",
    "anna", "caroline", "genesis", "emilia", "kennedy", "maya", "willow",
]
 
chars = sorted(set(''.join(words)))
stoi = {c: i+1 for i, c in enumerate(chars)}
stoi['.'] = 0
itos = {i: c for c, i in stoi.items()}
vocab_size = len(stoi)

Build the Dataset

block_size = 3  # context length: predict from 3 previous characters
 
def build_dataset(words):
    X, Y = [], []
    for word in words:
        context = [0] * block_size  # start with '...'
        for ch in word + '.':
            ix = stoi[ch]
            X.append(context[:])
            Y.append(ix)
            context = context[1:] + [ix]  # slide window
    return torch.tensor(X), torch.tensor(Y)
 
# Train/val/test split
random.seed(42)
random.shuffle(words)
n1, n2 = int(0.8 * len(words)), int(0.9 * len(words))
 
X_train, Y_train = build_dataset(words[:n1])
X_val, Y_val = build_dataset(words[n1:n2])
X_test, Y_test = build_dataset(words[n2:])
 
print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
 
# Show a few examples
for i in range(5):
    context = ''.join(itos[c.item()] for c in X_train[i])
    target = itos[Y_train[i].item()]
    print(f"  '{context}' → '{target}'")

The Model

# Hyperparameters
n_embed = 10      # embedding dimension (each char → 10d vector)
n_hidden = 200    # hidden layer size
 
torch.manual_seed(42)
 
# Parameters
C = torch.randn(vocab_size, n_embed)                              # embedding table
W1 = torch.randn(block_size * n_embed, n_hidden) * (5/3) / (block_size * n_embed)**0.5  # Kaiming
b1 = torch.randn(n_hidden) * 0.01
W2 = torch.randn(n_hidden, vocab_size) * 0.01                     # small init for output
b2 = torch.randn(vocab_size) * 0
 
parameters = [C, W1, b1, W2, b2]
for p in parameters:
    p.requires_grad = True
 
print(f"Total parameters: {sum(p.nelement() for p in parameters)}")

Why these initializations?

  • C (embeddings): random is fine, they’ll be learned
  • W1: Kaiming init — gain / sqrt(fan_in) keeps activations from exploding/vanishing
  • W2: small — we want initial predictions to be roughly uniform (low confidence)
  • b2: zero — no bias toward any character initially

Training

losses_train = []
batch_size = 32
 
for step in range(20000):
    # Mini-batch
    ix = torch.randint(0, len(X_train), (batch_size,))
    X_batch, Y_batch = X_train[ix], Y_train[ix]
 
    # Forward pass
    emb = C[X_batch]                           # (batch, block_size, n_embed)
    h_pre = emb.view(-1, block_size * n_embed) @ W1 + b1  # (batch, n_hidden)
    h = torch.tanh(h_pre)                      # activation
    logits = h @ W2 + b2                       # (batch, vocab_size)
    loss = F.cross_entropy(logits, Y_batch)
 
    # Backward pass
    for p in parameters:
        p.grad = None
    loss.backward()
 
    # Update
    lr = 0.1 if step < 10000 else 0.01
    for p in parameters:
        p.data -= lr * p.grad
 
    if step % 2000 == 0:
        losses_train.append(loss.item())
        print(f"Step {step:5d} | Loss: {loss.item():.4f}")
 
# Final losses
with torch.no_grad():
    emb = C[X_train]; h = torch.tanh(emb.view(-1, block_size * n_embed) @ W1 + b1); logits = h @ W2 + b2
    train_loss = F.cross_entropy(logits, Y_train).item()
    emb = C[X_val]; h = torch.tanh(emb.view(-1, block_size * n_embed) @ W1 + b1); logits = h @ W2 + b2
    val_loss = F.cross_entropy(logits, Y_val).item()
print(f"\nTrain loss: {train_loss:.4f}, Val loss: {val_loss:.4f}")

Visualize the Learned Embeddings

The embedding table maps each character to a 2D+ vector. Similar characters should cluster together:

# If n_embed > 2, project to 2D with PCA
from sklearn.decomposition import PCA
 
emb_np = C.detach().numpy()
if n_embed > 2:
    emb_2d = PCA(n_components=2).fit_transform(emb_np)
else:
    emb_2d = emb_np
 
plt.figure(figsize=(8, 8))
plt.scatter(emb_2d[:, 0], emb_2d[:, 1], s=50)
for i in range(vocab_size):
    plt.annotate(itos[i], emb_2d[i], fontsize=12, ha='center', va='center')
plt.title("Learned character embeddings")
plt.grid(alpha=0.3)
plt.show()

Vowels often cluster together. Common consonants form their own groups.


Sample from the Model

def sample(n=20):
    names = []
    for _ in range(n):
        name = []
        context = [0] * block_size
        while True:
            emb = C[torch.tensor(context)]  # (block_size, n_embed)
            h = torch.tanh(emb.view(1, -1) @ W1 + b1)
            logits = h @ W2 + b2
            probs = F.softmax(logits, dim=1)
            ix = torch.multinomial(probs, 1).item()
            if ix == 0:
                break
            name.append(itos[ix])
            context = context[1:] + [ix]
        names.append(''.join(name))
    return names
 
print("Generated names:")
for name in sample(15):
    print(f"  {name}")

Compare: Bigram vs MLP

# The bigram model from tutorial 16 gets ~2.45 NLL
# This MLP should get ~2.1-2.2 NLL — better because it sees 3 chars of context
print(f"MLP model val loss: {val_loss:.4f}")
print(f"Bigram baseline:    ~2.45")
print(f"Improvement:        ~{2.45 - val_loss:.2f}")

Hyperparameter Exploration

# What matters most? Try varying:
experiments = {
    "block_size": [2, 3, 5, 8],       # more context
    "n_embed":    [2, 5, 10, 20],      # richer embeddings
    "n_hidden":   [64, 128, 200, 300], # more capacity
}
# Run each and track val loss. You'll find:
# - block_size: big gains up to ~5, then diminishing
# - n_embed: big gains up to ~10, then diminishing
# - n_hidden: big gains up to ~200, then overfitting

Exercises

  1. Deeper network: Add a second hidden layer. Does it help? Watch out for vanishing gradients — you may need better initialization or batchnorm (18 - Activations and Initialization Deep Dive).

  2. Context window sweep: Train models with block_size 1, 2, 3, 5, 8, 10. Plot val loss vs context. Where do returns diminish?

  3. Learning rate finder: Before training, run 1000 steps with lr exponentially increasing from 1e-4 to 1. Plot loss vs lr. The optimal lr is just before the loss spikes.

  4. Embedding dimension vs vocab size: With 27 chars, do you need 10 dimensions? Try n_embed=2 and visualize the 2D embeddings directly (no PCA needed).

  5. Replace tanh with ReLU: Does it train faster? Watch for dead neurons.


Next: 18 - Activations and Initialization Deep Dive — diagnose what goes wrong inside deep networks.