Vision Transformers

What

Apply the transformer architecture to images by splitting them into patches and treating each patch like a token. Sounds crude, works extremely well at scale. Now dominant in computer vision for large models.

ViT: how it works

“An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020):

  1. Split image (224x224) into non-overlapping patches (16x16) — gives 196 patches
  2. Flatten each patch into a vector (16 * 16 * 3 = 768 dims for RGB)
  3. Linear projection to embedding dimension
  4. Add learnable position embeddings (so the model knows where each patch is)
  5. Prepend a [CLS] token
  6. Feed through a standard transformer encoder
  7. MLP classification head on the [CLS] output
image (224x224x3)
  → 196 patches (16x16x3 each)
  → linear projection → 196 tokens of dim 768
  → + position embeddings
  → transformer encoder (12 layers)
  → [CLS] token → MLP head → class prediction

No convolutions, no pooling, no inductive bias about locality. The model learns spatial relationships purely from data.

The data hunger problem

ViT trained on ImageNet alone (1.3M images) is worse than a good CNN. It needs large-scale pretraining — the original paper used JFT-300M (300M images). Without strong inductive biases (locality, translation invariance), the model needs more data to learn what CNNs get for free from their architecture.

DeiT: data-efficient training

DeiT (Touvron et al., 2021) showed you can train ViT well on ImageNet alone using:

  • Strong data augmentation (RandAugment, Mixup, CutMix)
  • Knowledge distillation from a CNN teacher
  • Careful regularization (stochastic depth, repeated augmentation)

This made ViTs practical without Google-scale datasets.

Swin Transformer

Swin (Liu et al., 2021) adds back some CNN-like inductive biases:

  • Hierarchical feature maps (like CNN stages) — patch merging reduces resolution
  • Shifted window attention — attend within local windows, shift between layers
  • Linear complexity with image size (regular ViT is quadratic)

This makes it a general-purpose backbone for detection, segmentation, and other dense tasks — not just classification.

ViT vs CNN

AspectCNNViT
Inductive biasStrong (locality, translation invariance)Minimal
Data efficiencyGood with small datasetsNeeds lots of data (or DeiT tricks)
ScalabilityPlateaus at extreme scaleKeeps improving with more data/compute
Global contextLimited (large receptive fields are deep)Every patch attends to every other patch
Speed (small)FastSlower (attention overhead)
Speed (large scale)Diminishing returnsEfficient scaling
Dense tasksNatural multi-scale featuresNeeds Swin-like modifications

At small scale and limited data: CNNs still win. At large scale with plenty of data: ViTs dominate.

Key papers

  • An Image is Worth 16x16 Words (Dosovitskiy et al., 2020) — original ViT
  • Training Data-Efficient Image Transformers (Touvron et al., 2021) — DeiT
  • Swin Transformer (Liu et al., 2021)