Vision Transformers
What
Apply the transformer architecture to images by splitting them into patches and treating each patch like a token. Sounds crude, works extremely well at scale. Now dominant in computer vision for large models.
ViT: how it works
“An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020):
- Split image (224x224) into non-overlapping patches (16x16) — gives 196 patches
- Flatten each patch into a vector (16 * 16 * 3 = 768 dims for RGB)
- Linear projection to embedding dimension
- Add learnable position embeddings (so the model knows where each patch is)
- Prepend a [CLS] token
- Feed through a standard transformer encoder
- MLP classification head on the [CLS] output
image (224x224x3)
→ 196 patches (16x16x3 each)
→ linear projection → 196 tokens of dim 768
→ + position embeddings
→ transformer encoder (12 layers)
→ [CLS] token → MLP head → class prediction
No convolutions, no pooling, no inductive bias about locality. The model learns spatial relationships purely from data.
The data hunger problem
ViT trained on ImageNet alone (1.3M images) is worse than a good CNN. It needs large-scale pretraining — the original paper used JFT-300M (300M images). Without strong inductive biases (locality, translation invariance), the model needs more data to learn what CNNs get for free from their architecture.
DeiT: data-efficient training
DeiT (Touvron et al., 2021) showed you can train ViT well on ImageNet alone using:
- Strong data augmentation (RandAugment, Mixup, CutMix)
- Knowledge distillation from a CNN teacher
- Careful regularization (stochastic depth, repeated augmentation)
This made ViTs practical without Google-scale datasets.
Swin Transformer
Swin (Liu et al., 2021) adds back some CNN-like inductive biases:
- Hierarchical feature maps (like CNN stages) — patch merging reduces resolution
- Shifted window attention — attend within local windows, shift between layers
- Linear complexity with image size (regular ViT is quadratic)
This makes it a general-purpose backbone for detection, segmentation, and other dense tasks — not just classification.
ViT vs CNN
| Aspect | CNN | ViT |
|---|---|---|
| Inductive bias | Strong (locality, translation invariance) | Minimal |
| Data efficiency | Good with small datasets | Needs lots of data (or DeiT tricks) |
| Scalability | Plateaus at extreme scale | Keeps improving with more data/compute |
| Global context | Limited (large receptive fields are deep) | Every patch attends to every other patch |
| Speed (small) | Fast | Slower (attention overhead) |
| Speed (large scale) | Diminishing returns | Efficient scaling |
| Dense tasks | Natural multi-scale features | Needs Swin-like modifications |
At small scale and limited data: CNNs still win. At large scale with plenty of data: ViTs dominate.
Key papers
- An Image is Worth 16x16 Words (Dosovitskiy et al., 2020) — original ViT
- Training Data-Efficient Image Transformers (Touvron et al., 2021) — DeiT
- Swin Transformer (Liu et al., 2021)
Links
- Transformers — the architecture ViT borrows from NLP
- Convolutional Neural Networks — what ViT is replacing
- Image Classification — the task ViT was first applied to
- Transfer Learning — pretrain large, fine-tune small
- Attention Mechanism — the core of transformer-based vision