Mixture of Experts

What

A sparse architecture where each input is processed by only a subset of the model’s parameters. A router (gating network) decides which “expert” sub-networks to activate.

input → router → selects top-k experts → each expert processes input → combine outputs

Why it matters

  • Scale model capacity (total parameters) without scaling compute proportionally
  • A 100B parameter MoE model might only use 20B parameters per token
  • Mixtral 8x7B, DeepSeek-V2/V3/R1, Llama 4, Gemini 1.5/2.x all use MoE (GPT-4 is widely reported to use MoE but OpenAI has not officially confirmed)
  • Over 60% of open-source model releases now use MoE

Core concepts

Expert

A feedforward network (one of many in each layer). In transformers, experts replace the feedforward sub-layer in each attention block. Each expert has the same architecture but different learned weights — like having many specialized的大脑.

Router/Gate

A small network (usually a linear layer + softmax) that assigns input tokens to experts:

class Router(nn.Module):
    def __init__(self, d_model, n_experts):
        super().__init__()
        self.gate = nn.Linear(d_model, n_experts, bias=False)
    
    def forward(self, x):
        # x: [batch, seq_len, d_model]
        logits = self.gate(x)  # [batch, seq_len, n_experts]
        probs = F.softmax(logits, dim=-1)
        return probs

Top-k routing

Each token goes to the top k experts (typically k=2). The rest get zero weight:

def top_k routing(router_probs, k=2):
    top_k_vals, top_k_idx = torch.topk(router_probs, k, dim=-1)
    # Normalize top-k weights
    top_k_vals = top_k_vals / top_k_vals.sum(dim=-1, keepdim=True)
    return top_k_vals, top_k_idx

Expert specialization

Experts can develop emergent specialization:

  • Some experts activate strongly for code
  • Some for specific languages
  • Some for math or reasoning This is observed but not guaranteed — it’s an emergent property, not explicitly designed.

Load balancing

The router can collapse into always selecting the same few experts (routing collapse), leaving most experts idle. Two mitigation strategies:

Auxiliary loss (standard)

Add a penalty term that encourages equal expert utilization:

aux_loss = n_experts * sum(load_per_expert) / n_tokens  # should be ~1.0
total_loss = main_loss + λ * aux_loss

Auxiliary-loss-free load balancing (DeepSeek-V3)

Avoids the quality degradation caused by auxiliary loss. Uses a bias term added to router logits after each routing step — bias is adjusted dynamically to balance load without affecting the main loss.

DeepSeek-V3 architecture

  • 671B total, 37B active per token
  • Multi-head Latent Attention (MLA): compresses KV cache via low-rank projections
  • Auxiliary-loss-free load balancing: avoids quality degradation from auxiliary loss
  • Multi-Token Prediction (MTP): predicts multiple future tokens, densifies training signal
  • Trained on 14.8T tokens for only 2.788M H800 GPU hours (~$5.6M)
  • arXiv:2412.19437

MoE implementations

Mixtral 8x7B (Mistral)

  • 8 experts per layer, top-2 routing
  • Each expert is a standard FFN (up_proj, gate_proj, down_proj)
  • Active: 12B params per token, total: 46.7B
  • Routed: each token uses only 2 of 8 experts

DBRX (Databricks)

  • Fine-grained experts with 4 active per 36B total params
  • Uses more, smaller experts for better specialization

Jamba (AI21)

  • Hybrid: combines MoE (experts) with standard attention layers
  • Not every layer is MoE — allows mixing expert and dense layers

Llama 4 (Meta)

  • ScatteredMoE: not all tokens route to experts uniformly
  • Uses backbone + many small expert groups

DeepSeek-V2/V3

  • Multi-head Latent Attention (MLA) reduces KV cache dramatically
  • Custom CUDA kernels for efficient MoE dispatch
  • 236B/21B active/total at DeepSeek-V2

Tradeoffs

ProCon
More capacity per FLOPMore total parameters (memory)
Fast inference (only subset active)Communication overhead in distributed settings
Can specialize expertsTraining instability, routing collapse
Better parallel efficiency at scaleLoad balancing requires careful tuning

MoE vs Dense

At equal active parameters, MoE and dense models have similar quality. The advantage is economic: MoE gives you a larger effective model at the same inference cost as a smaller dense model.

The catch: MoE models need more total memory (all experts must fit in memory even though only some are active). This makes MoE most attractive when you can fit the full model in memory but want more effective capacity.

Fine-tuning MoE

MoE models require special fine-tuning care:

  1. Lower learning rate: MoE models are more sensitive to LR
  2. Expert Adapter (ECA): Fine-tune only a small adapter per expert rather than full fine-tuning
  3. Prefix tuning / LoRA: Often works better than full fine-tuning for MoE
  4. Router dropout: Drop experts during fine-tuning to improve generalization

Key papers

  • Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al., 2017)
  • Mixtral of Experts (Jiang et al., 2024)
  • DeepSeek-V3 Technical Report (DeepSeek-AI, 2024) · arXiv:2412.19437
  • DBRX: A Open, Free, and Powerful Mixture-of-Experts (Team DBRX, 2024)
  • ST-MoE: Stable and Transferable Mixture of Experts (Zoph et al., 2022)