Multimodal Models

What

Models that process and reason over multiple types of data — text, images, audio, video — within a single architecture. Rather than separate models for each modality, multimodal models share a common representation space, enabling cross-modal reasoning: “describe this image,” “find the timestamp in this video where X happens,” “transcribe this audio and summarize it.”

Why It Matters

  • Richer understanding: real-world tasks rarely involve text alone. Documents have images, meetings have audio, instructions reference physical objects
  • Zero-shot transfer: CLIP enables image classification, search, and detection without task-specific training data
  • Foundation for agents: computer-use agents (Claude) and GUI automation require vision + language reasoning
  • Closing the open-source gap: open VLMs (InternVL, Qwen2-VL) now rival commercial models on benchmarks, democratizing multimodal AI

How It Works

Vision-Language Model Architecture

The dominant pattern for combining vision and language:

Image → Vision encoder (ViT/SigLIP) → Projection layer → [visual tokens]
                                                              ↓
Text  → Tokenizer → [text tokens] ─────────────────→ [visual + text tokens]
                                                              ↓
                                                     Language model (decoder)
                                                              ↓
                                                         Response
  1. Vision encoder: splits image into patches (e.g., 14x14 or 16x16 pixels), processes them through a Vision Transformer (ViT). Output: a sequence of visual embeddings
  2. Projection layer: maps visual embeddings into the LLM’s embedding space (linear layer or small MLP). This aligns the visual and textual representations
  3. LLM decoder: processes the combined sequence of visual and text tokens. The language model doesn’t know the difference — it just sees a long sequence of embeddings

For video: extract frames at intervals, encode each frame, concatenate the visual tokens. Context length becomes the bottleneck.

CLIP — The Foundation

Contrastive Language-Image Pretraining (Radford et al., 2021) is the backbone of modern VLMs.

Training: take 400M image-text pairs from the internet. Train two encoders (image and text) so that matching pairs have high cosine similarity and non-matching pairs have low similarity.

          Image encoder          Text encoder
              ↓                       ↓
        image_embedding          text_embedding
              ↓                       ↓
         cosine_similarity(image_emb, text_emb)
              ↓
    Maximize for matching pairs, minimize for non-matching

Contrastive loss (simplified): given a batch of N image-text pairs, CLIP computes an NxN similarity matrix. The diagonal (matching pairs) should be high; off-diagonal (non-matching) should be low. This is InfoNCE loss applied to both rows (image-to-text) and columns (text-to-image).

What CLIP enables: zero-shot image classification (compare image embedding to text embeddings of class names), image search (query with text, rank by similarity), and foundation encoders for all downstream VLMs.

Key Models (as of early 2026)

ModelModalitiesArchitectureNotes
CLIPImage + textDual encoder (contrastive)Foundation, zero-shot classification
GPT-4oText + image + audioNative joint transformerNatively multimodal, not bolted-on
Claude 4.xText + image + computer useVision encoder + LLMStrong vision, unique GUI interaction
Gemini 2.5 ProText + image + audio + videoJoint transformer1M token context, Deep Think reasoning
LLaVA-OneVisionImage + video + textSigLIP + Qwen2Open-source, excels at multi-image
Qwen2.5-VLImage + video + textViT + Qwen2.5Dynamic resolution, 3B-72B
InternVL 2.5Image + textInternViT + InternLMRivals GPT-4o, fully open, 2B-78B
WhisperAudio textEncoder-decoder transformerRobust speech recognition, multilingual

Native vs Adapter Multimodal

Adapter approach (LLaVA, InternVL): train a vision encoder and LLM separately, then connect them with a projection layer. Cheaper to train, can leverage existing LLMs. Most open-source VLMs use this.

Native multimodal (GPT-4o, Gemini): train a single model on all modalities from scratch. Better cross-modal reasoning, more expensive to train. Only feasible at frontier lab scale.

Code Example

CLIP: Zero-Shot Image Classification

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
 
# Load CLIP
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
 
# Load an image (or use a URL)
image = Image.open("photo.jpg")
 
# Define candidate labels
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car",
          "a photo of a building", "a photo of food"]
 
# Compute similarity
inputs = processor(text=labels, images=image, return_tensors="pt",
                   padding=True)
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits_per_image  # shape: [1, num_labels]
    probs = logits.softmax(dim=1)
 
# Results
for label, prob in zip(labels, probs[0]):
    print(f"{label}: {prob:.3f}")
# No training on these specific classes -- pure zero-shot!
import torch
import numpy as np
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
 
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
 
def embed_image(image_path):
    image = Image.open(image_path)
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        return model.get_image_features(**inputs)[0].numpy()
 
def embed_text(text):
    inputs = processor(text=text, return_tensors="pt", padding=True)
    with torch.no_grad():
        return model.get_text_features(**inputs)[0].numpy()
 
# Build an image index
image_paths = ["cat.jpg", "dog.jpg", "car.jpg", "building.jpg"]
image_embeddings = np.stack([embed_image(p) for p in image_paths])
image_embeddings /= np.linalg.norm(image_embeddings, axis=1, keepdims=True)
 
# Search with text
query = "a fluffy animal"
query_emb = embed_text(query)
query_emb /= np.linalg.norm(query_emb)
 
similarities = image_embeddings @ query_emb
ranked = np.argsort(-similarities)
for i in ranked:
    print(f"{image_paths[i]}: similarity = {similarities[i]:.3f}")

Key Tradeoffs

DecisionOption AOption B
ArchitectureAdapter VLM (cheap, modular)Native multimodal (better fusion, expensive)
Vision encoderViT-B (fast, smaller)ViT-L/SigLIP (more accurate, more tokens)
Image resolutionFixed patches (simple)Dynamic resolution (handles varied sizes)
Video handlingUniform frame sampling (simple)Keyframe extraction (efficient, lossy)

Common Pitfalls

  • Hallucinated visual details: VLMs confidently describe objects that aren’t in the image. Always validate critical visual claims
  • Resolution blindness: small text, fine details, and distant objects in images are often missed. High-res input or tiling helps
  • Token budget for video: a 60-second video at 1 fps with 256 tokens per frame = 15,360 visual tokens. Context length fills fast. Sample frames strategically
  • CLIP’s limitations: CLIP understands concepts but struggles with spatial relationships (“the red ball is to the left of the blue ball”), counting, and negation
  • Training data bias: CLIP was trained on internet image-text pairs, inheriting their biases and Western-centric distribution

Exercises

  1. Use CLIP to build a simple image search engine: embed 100 images from a folder, store embeddings in a numpy array, search by text query. Measure retrieval accuracy on known queries
  2. Compare CLIP ViT-B/32 vs ViT-L/14 on zero-shot classification of CIFAR-10. How much does the larger model improve?
  3. Use a VLM API (Claude or GPT-4o) to describe 10 images and evaluate the descriptions for hallucinations. What types of errors are most common?
  4. Implement a simple adapter-style VLM: freeze a pre-trained CLIP vision encoder, add a linear projection, and train it to generate captions using a small GPT-2

Self-Test Questions

  1. Explain the CLIP contrastive training objective. Why does it enable zero-shot classification?
  2. What is the difference between adapter-style VLMs (LLaVA) and native multimodal models (GPT-4o)?
  3. How are images converted to tokens in a Vision Transformer? What determines the number of visual tokens?
  4. Why is video understanding harder than image understanding for multimodal models?
  5. What are CLIP’s known failure modes? Give two examples where it struggles