Vision Transformers

What

Apply the transformer architecture to images by splitting them into patches and treating each patch like a token. Sounds crude, works extremely well at scale. Now dominant in computer vision for large models.

ViT: how it works

“An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020):

Split image (224x224) into non-overlapping patches (16x16) — gives 196 patches
Flatten each patch into a vector (16 * 16 * 3 = 768 dims for RGB)
Linear projection to embedding dimension
Add learnable position embeddings (so the model knows where each patch is)
Prepend a [CLS] token
Feed through a standard transformer encoder
MLP classification head on the [CLS] output

image (224x224x3)
  → 196 patches (16x16x3 each)
  → linear projection → 196 tokens of dim 768
  → + position embeddings
  → transformer encoder (12 layers)
  → [CLS] token → MLP head → class prediction

No convolutions, no pooling, no inductive bias about locality. The model learns spatial relationships purely from data.

The data hunger problem

ViT trained on ImageNet alone (1.3M images) is worse than a good CNN. It needs large-scale pretraining — the original paper used JFT-300M (300M images). Without strong inductive biases (locality, translation invariance), the model needs more data to learn what CNNs get for free from their architecture.

DeiT: data-efficient training

DeiT (Touvron et al., 2021) showed you can train ViT well on ImageNet alone using:

Strong data augmentation (RandAugment, Mixup, CutMix)
Knowledge distillation from a CNN teacher
Careful regularization (stochastic depth, repeated augmentation)

This made ViTs practical without Google-scale datasets.

Swin Transformer

Swin (Liu et al., 2021) adds back some CNN-like inductive biases:

Hierarchical feature maps (like CNN stages) — patch merging reduces resolution
Shifted window attention — attend within local windows, shift between layers
Linear complexity with image size (regular ViT is quadratic)

This makes it a general-purpose backbone for detection, segmentation, and other dense tasks — not just classification.

ViT vs CNN

Aspect	CNN	ViT
Inductive bias	Strong (locality, translation invariance)	Minimal
Data efficiency	Good with small datasets	Needs lots of data (or DeiT tricks)
Scalability	Plateaus at extreme scale	Keeps improving with more data/compute
Global context	Limited (large receptive fields are deep)	Every patch attends to every other patch
Speed (small)	Fast	Slower (attention overhead)
Speed (large scale)	Diminishing returns	Efficient scaling
Dense tasks	Natural multi-scale features	Needs Swin-like modifications

At small scale and limited data: CNNs still win. At large scale with plenty of data: ViTs dominate.

Key papers

An Image is Worth 16x16 Words (Dosovitskiy et al., 2020) — original ViT
Training Data-Efficient Image Transformers (Touvron et al., 2021) — DeiT
Swin Transformer (Liu et al., 2021)

AI/ML Notes

Explorer

Vision Transformers

Vision Transformers

What

ViT: how it works

The data hunger problem

DeiT: data-efficient training

Swin Transformer

ViT vs CNN

Key papers

Links

Graph View

Table of Contents

Backlinks