Mixture of Experts

What

A sparse architecture where each input is processed by only a subset of the model’s parameters. A router (gating network) decides which “expert” sub-networks to activate.

input → router → selects top-k experts → each expert processes input → combine outputs

Why it matters

Scale model capacity (total parameters) without scaling compute proportionally
A 100B parameter MoE model might only use 20B parameters per token
Mixtral 8x7B, DeepSeek-V2/V3/R1, Llama 4, Gemini 1.5/2.x all use MoE (GPT-4 is widely reported to use MoE but OpenAI has not officially confirmed)
Over 60% of open-source model releases now use MoE

Core concepts

Expert

A feedforward network (one of many in each layer). In transformers, experts replace the feedforward sub-layer in each attention block. Each expert has the same architecture but different learned weights — like having many specialized的大脑.

Router/Gate

A small network (usually a linear layer + softmax) that assigns input tokens to experts:

class Router(nn.Module):
    def __init__(self, d_model, n_experts):
        super().__init__()
        self.gate = nn.Linear(d_model, n_experts, bias=False)
    
    def forward(self, x):
        # x: [batch, seq_len, d_model]
        logits = self.gate(x)  # [batch, seq_len, n_experts]
        probs = F.softmax(logits, dim=-1)
        return probs

Top-k routing

Each token goes to the top k experts (typically k=2). The rest get zero weight:

def top_k routing(router_probs, k=2):
    top_k_vals, top_k_idx = torch.topk(router_probs, k, dim=-1)
    # Normalize top-k weights
    top_k_vals = top_k_vals / top_k_vals.sum(dim=-1, keepdim=True)
    return top_k_vals, top_k_idx

Expert specialization

Experts can develop emergent specialization:

Some experts activate strongly for code
Some for specific languages
Some for math or reasoning This is observed but not guaranteed — it’s an emergent property, not explicitly designed.

Load balancing

The router can collapse into always selecting the same few experts (routing collapse), leaving most experts idle. Two mitigation strategies:

Auxiliary loss (standard)

Add a penalty term that encourages equal expert utilization:

aux_loss = n_experts * sum(load_per_expert) / n_tokens  # should be ~1.0
total_loss = main_loss + λ * aux_loss

Auxiliary-loss-free load balancing (DeepSeek-V3)

Avoids the quality degradation caused by auxiliary loss. Uses a bias term added to router logits after each routing step — bias is adjusted dynamically to balance load without affecting the main loss.

DeepSeek-V3 architecture

671B total, 37B active per token
Multi-head Latent Attention (MLA): compresses KV cache via low-rank projections
Auxiliary-loss-free load balancing: avoids quality degradation from auxiliary loss
Multi-Token Prediction (MTP): predicts multiple future tokens, densifies training signal
Trained on 14.8T tokens for only 2.788M H800 GPU hours (~$5.6M)
arXiv:2412.19437

MoE implementations

Mixtral 8x7B (Mistral)

8 experts per layer, top-2 routing
Each expert is a standard FFN (up_proj, gate_proj, down_proj)
Active: 12B params per token, total: 46.7B
Routed: each token uses only 2 of 8 experts

DBRX (Databricks)

Fine-grained experts with 4 active per 36B total params
Uses more, smaller experts for better specialization

Jamba (AI21)

Hybrid: combines MoE (experts) with standard attention layers
Not every layer is MoE — allows mixing expert and dense layers

Llama 4 (Meta)

ScatteredMoE: not all tokens route to experts uniformly
Uses backbone + many small expert groups

DeepSeek-V2/V3

Multi-head Latent Attention (MLA) reduces KV cache dramatically
Custom CUDA kernels for efficient MoE dispatch
236B/21B active/total at DeepSeek-V2

Tradeoffs

Pro	Con
More capacity per FLOP	More total parameters (memory)
Fast inference (only subset active)	Communication overhead in distributed settings
Can specialize experts	Training instability, routing collapse
Better parallel efficiency at scale	Load balancing requires careful tuning

MoE vs Dense

At equal active parameters, MoE and dense models have similar quality. The advantage is economic: MoE gives you a larger effective model at the same inference cost as a smaller dense model.

The catch: MoE models need more total memory (all experts must fit in memory even though only some are active). This makes MoE most attractive when you can fit the full model in memory but want more effective capacity.

Fine-tuning MoE

MoE models require special fine-tuning care:

Lower learning rate: MoE models are more sensitive to LR
Expert Adapter (ECA): Fine-tune only a small adapter per expert rather than full fine-tuning
Prefix tuning / LoRA: Often works better than full fine-tuning for MoE
Router dropout: Drop experts during fine-tuning to improve generalization

Key papers

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al., 2017)
Mixtral of Experts (Jiang et al., 2024)
DeepSeek-V3 Technical Report (DeepSeek-AI, 2024) · arXiv:2412.19437
DBRX: A Open, Free, and Powerful Mixture-of-Experts (Team DBRX, 2024)
ST-MoE: Stable and Transferable Mixture of Experts (Zoph et al., 2022)

AI/ML Notes

Explorer

Mixture of Experts

Mixture of Experts

What

Why it matters

Core concepts

Expert

Router/Gate

Top-k routing

Expert specialization

Load balancing

Auxiliary loss (standard)

Auxiliary-loss-free load balancing (DeepSeek-V3)

DeepSeek-V3 architecture

MoE implementations

Mixtral 8x7B (Mistral)

DBRX (Databricks)

Jamba (AI21)

Llama 4 (Meta)

DeepSeek-V2/V3

Tradeoffs

MoE vs Dense

Fine-tuning MoE

Key papers

Links

Graph View

Table of Contents

Backlinks