Multimodal Models

What

Models that process and reason over multiple types of data — text, images, audio, video — within a single architecture. Rather than separate models for each modality, multimodal models share a common representation space, enabling cross-modal reasoning: “describe this image,” “find the timestamp in this video where X happens,” “transcribe this audio and summarize it.”

Why It Matters

Richer understanding: real-world tasks rarely involve text alone. Documents have images, meetings have audio, instructions reference physical objects
Zero-shot transfer: CLIP enables image classification, search, and detection without task-specific training data
Foundation for agents: computer-use agents (Claude) and GUI automation require vision + language reasoning
Closing the open-source gap: open VLMs (InternVL, Qwen2-VL) now rival commercial models on benchmarks, democratizing multimodal AI

How It Works

Vision-Language Model Architecture

The dominant pattern for combining vision and language:

Image → Vision encoder (ViT/SigLIP) → Projection layer → [visual tokens]
                                                              ↓
Text  → Tokenizer → [text tokens] ─────────────────→ [visual + text tokens]
                                                              ↓
                                                     Language model (decoder)
                                                              ↓
                                                         Response

Vision encoder: splits image into patches (e.g., 14x14 or 16x16 pixels), processes them through a Vision Transformer (ViT). Output: a sequence of visual embeddings
Projection layer: maps visual embeddings into the LLM’s embedding space (linear layer or small MLP). This aligns the visual and textual representations
LLM decoder: processes the combined sequence of visual and text tokens. The language model doesn’t know the difference — it just sees a long sequence of embeddings

For video: extract frames at intervals, encode each frame, concatenate the visual tokens. Context length becomes the bottleneck.

CLIP — The Foundation

Contrastive Language-Image Pretraining (Radford et al., 2021) is the backbone of modern VLMs.

Training: take 400M image-text pairs from the internet. Train two encoders (image and text) so that matching pairs have high cosine similarity and non-matching pairs have low similarity.

          Image encoder          Text encoder
              ↓                       ↓
        image_embedding          text_embedding
              ↓                       ↓
         cosine_similarity(image_emb, text_emb)
              ↓
    Maximize for matching pairs, minimize for non-matching

Contrastive loss (simplified): given a batch of N image-text pairs, CLIP computes an NxN similarity matrix. The diagonal (matching pairs) should be high; off-diagonal (non-matching) should be low. This is InfoNCE loss applied to both rows (image-to-text) and columns (text-to-image).

What CLIP enables: zero-shot image classification (compare image embedding to text embeddings of class names), image search (query with text, rank by similarity), and foundation encoders for all downstream VLMs.

Key Models (as of early 2026)

Model	Modalities	Architecture	Notes
CLIP	Image + text	Dual encoder (contrastive)	Foundation, zero-shot classification
GPT-4o	Text + image + audio	Native joint transformer	Natively multimodal, not bolted-on
Claude 4.x	Text + image + computer use	Vision encoder + LLM	Strong vision, unique GUI interaction
Gemini 2.5 Pro	Text + image + audio + video	Joint transformer	1M token context, Deep Think reasoning
LLaVA-OneVision	Image + video + text	SigLIP + Qwen2	Open-source, excels at multi-image
Qwen2.5-VL	Image + video + text	ViT + Qwen2.5	Dynamic resolution, 3B-72B
InternVL 2.5	Image + text	InternViT + InternLM	Rivals GPT-4o, fully open, 2B-78B
Whisper	Audio → text	Encoder-decoder transformer	Robust speech recognition, multilingual

Native vs Adapter Multimodal

Adapter approach (LLaVA, InternVL): train a vision encoder and LLM separately, then connect them with a projection layer. Cheaper to train, can leverage existing LLMs. Most open-source VLMs use this.

Native multimodal (GPT-4o, Gemini): train a single model on all modalities from scratch. Better cross-modal reasoning, more expensive to train. Only feasible at frontier lab scale.

Code Example

CLIP: Zero-Shot Image Classification

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
 
# Load CLIP
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
 
# Load an image (or use a URL)
image = Image.open("photo.jpg")
 
# Define candidate labels
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car",
          "a photo of a building", "a photo of food"]
 
# Compute similarity
inputs = processor(text=labels, images=image, return_tensors="pt",
                   padding=True)
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits_per_image  # shape: [1, num_labels]
    probs = logits.softmax(dim=1)
 
# Results
for label, prob in zip(labels, probs[0]):
    print(f"{label}: {prob:.3f}")
# No training on these specific classes -- pure zero-shot!

Image Embeddings for Search

import torch
import numpy as np
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
 
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
 
def embed_image(image_path):
    image = Image.open(image_path)
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        return model.get_image_features(**inputs)[0].numpy()
 
def embed_text(text):
    inputs = processor(text=text, return_tensors="pt", padding=True)
    with torch.no_grad():
        return model.get_text_features(**inputs)[0].numpy()
 
# Build an image index
image_paths = ["cat.jpg", "dog.jpg", "car.jpg", "building.jpg"]
image_embeddings = np.stack([embed_image(p) for p in image_paths])
image_embeddings /= np.linalg.norm(image_embeddings, axis=1, keepdims=True)
 
# Search with text
query = "a fluffy animal"
query_emb = embed_text(query)
query_emb /= np.linalg.norm(query_emb)
 
similarities = image_embeddings @ query_emb
ranked = np.argsort(-similarities)
for i in ranked:
    print(f"{image_paths[i]}: similarity = {similarities[i]:.3f}")

Key Tradeoffs

Decision	Option A	Option B
Architecture	Adapter VLM (cheap, modular)	Native multimodal (better fusion, expensive)
Vision encoder	ViT-B (fast, smaller)	ViT-L/SigLIP (more accurate, more tokens)
Image resolution	Fixed patches (simple)	Dynamic resolution (handles varied sizes)
Video handling	Uniform frame sampling (simple)	Keyframe extraction (efficient, lossy)

Common Pitfalls

Hallucinated visual details: VLMs confidently describe objects that aren’t in the image. Always validate critical visual claims
Resolution blindness: small text, fine details, and distant objects in images are often missed. High-res input or tiling helps
Token budget for video: a 60-second video at 1 fps with 256 tokens per frame = 15,360 visual tokens. Context length fills fast. Sample frames strategically
CLIP’s limitations: CLIP understands concepts but struggles with spatial relationships (“the red ball is to the left of the blue ball”), counting, and negation
Training data bias: CLIP was trained on internet image-text pairs, inheriting their biases and Western-centric distribution

Exercises

Use CLIP to build a simple image search engine: embed 100 images from a folder, store embeddings in a numpy array, search by text query. Measure retrieval accuracy on known queries
Compare CLIP ViT-B/32 vs ViT-L/14 on zero-shot classification of CIFAR-10. How much does the larger model improve?
Use a VLM API (Claude or GPT-4o) to describe 10 images and evaluate the descriptions for hallucinations. What types of errors are most common?
Implement a simple adapter-style VLM: freeze a pre-trained CLIP vision encoder, add a linear projection, and train it to generate captions using a small GPT-2

Self-Test Questions

Explain the CLIP contrastive training objective. Why does it enable zero-shot classification?
What is the difference between adapter-style VLMs (LLaVA) and native multimodal models (GPT-4o)?
How are images converted to tokens in a Vision Transformer? What determines the number of visual tokens?
Why is video understanding harder than image understanding for multimodal models?
What are CLIP’s known failure modes? Give two examples where it struggles

AI/ML Notes

Explorer

Multimodal Models

Multimodal Models

What

Why It Matters

How It Works

Vision-Language Model Architecture

CLIP — The Foundation

Key Models (as of early 2026)

Native vs Adapter Multimodal

Code Example

CLIP: Zero-Shot Image Classification

Image Embeddings for Search

Key Tradeoffs

Common Pitfalls

Exercises

Self-Test Questions

Links

Graph View

Table of Contents

Backlinks