From AlexNet to World Models: The Evolution of Multi-Modal Neural Networks

This is a long post. It assumes you know roughly what a neural network is — a stack of layers, each multiplying its input by a matrix of learned weights and applying a nonlinearity — and that they are trained by gradient descent on a loss function. It does not assume you know what a transformer, an LLM, a VLM, or a “world model” is. We will build all of those up from scratch.

The throughline is one question that the field has answered differently in each era: how should a machine represent the world so that the representation is useful? AlexNet answered “a hierarchy of learned image features.” CLIP answered “a shared space where pictures and the words that describe them land in the same place.” The vision-language models behind GPT-4V answered “feed image features into a language model and let it talk.” And world models — JEPA, Dreamer, LeWorldModel — answer “a representation good enough to predict what happens next.” That last answer is where a lot of the field now thinks general intelligence has to come from, and it is where we are headed.

Open Table of contents

Part 0 — What “multi-modal” even means
Part 1 — The vision foundation: AlexNet and the CNN era
Part 2 — Teaching vision to talk: the first multi-modal models
- Image captioning: CNN sees, RNN speaks
- Visual Question Answering
Part 3 — The transformer: one architecture to rule them all
Part 4 — CLIP: aligning vision and language by agreement
Part 5 — Learning to see without labels: the self-supervised detour
- Family 1: contrastive and self-distillation
- Family 2: masked reconstruction (the generative approach)
Part 6 — Vision-Language Models (VLMs): giving an LLM eyes
Part 7 — World models: from perceiving to predicting
Part 8 — JEPA: predicting in representation space
Part 9 — Stepping back: the throughline and the open questions
Reference timeline

A modality is a kind of sensory or symbolic input: pixels (images, video), text, audio, depth maps, robot joint angles, and so on. A multi-modal model is one that consumes more than one of these, or that learns a representation in one modality that is shaped by another.

Why is this hard? Because the modalities are structured completely differently. Text is a one-dimensional sequence of discrete symbols drawn from a finite vocabulary. An image is a two-dimensional grid of continuous-valued pixels with strong local correlations (neighboring pixels are similar) and no obvious “vocabulary.” Video adds a third, temporal axis. The entire history below is, in large part, the story of inventing representations general enough that these different structures can be processed by the same mechanism — and the field’s eventual convergence on one mechanism, the transformer, for all of them.

Let me flag the four big conceptual leaps before we start, so you have signposts:

Learned features beat hand-designed ones (AlexNet, 2012). Before this, computer vision used hand-engineered feature detectors. AlexNet showed that letting the network learn its features from raw pixels, given enough data and compute, wins overwhelmingly.
One architecture for everything (Transformer 2017, ViT 2020). The transformer, invented for translation, turned out to process images, audio, and actions just as well once you chop them into tokens.
Align modalities by training them to agree (CLIP, 2021). Instead of labeling images with a fixed class list, train on raw image–caption pairs from the web and pull matching pairs together in a shared space. This unlocked open-vocabulary vision.
Predict, don’t reconstruct (JEPA / world models, 2022–2026). To understand the world, a model should predict the consequences of what it sees — but in an abstract representation space, not pixel-by-pixel. This is the current research frontier.

Part 1 — The vision foundation: AlexNet and the CNN era

Why 2012 was a turning point

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) asks a model to classify an image into one of 1,000 categories. In 2011, the best systems — built on hand-crafted features like SIFT and Fisher vectors fed into a support vector machine — had a top-5 error rate around 26%. In 2012, a deep convolutional neural network called AlexNet (Krizhevsky, Sutskever & Hinton, NeurIPS 2012) scored 15.3% top-5 error, demolishing the runner-up at 26.2%. That ten-point gap is the moment deep learning took over computer vision.

Reader question: what is “top-5 error”? The model outputs a ranked list of guesses. Top-5 error is the fraction of test images where the correct label is not among the model’s five highest-scoring guesses. Top-1 error is the stricter version: correct label must be the single highest guess.

What a convolution actually computes

A convolutional layer is the key building block. Instead of connecting every input pixel to every neuron (which for a $224\times224\times3$ image would be ~150,000 inputs per neuron, with separate weights for each spatial position), a convolution slides a small bank of learned filters across the image. Each filter is a tiny weight tensor — say $11\times11\times3$ — and at every spatial location it computes a dot product with the patch underneath it:

y_{i,j,k} = \sigma\!\left( b_k + \sum_{u,v,c} W_{u,v,c,k}\, x_{\,i+u,\; j+v,\; c} \right).

Here $x$ is the input image (or previous layer’s output), $W_{:,:,:,k}$ is the $k$ -th filter, $b_k$ its bias, and $\sigma$ a nonlinearity. The same $W_{:,:,:,k}$ is reused at every position $(i,j)$ — this is weight sharing, and it encodes a powerful prior: a feature worth detecting in one part of the image (an edge, a texture, an eye) is worth detecting everywhere. Weight sharing is why a CNN needs far fewer parameters than a fully-connected net and why it generalizes to translated inputs.

Stack many convolutional layers and a hierarchy emerges: early layers learn edges and color blobs, middle layers learn textures and parts (a wheel, an eye), late layers learn whole objects. This hierarchy is learned, not designed — that is the entire point.

Inside AlexNet

AlexNet was 8 learned layers: 5 convolutional, 3 fully-connected, ending in a 1000-way softmax. ~60 million parameters and ~650,000 neurons. Three ingredients made it work where earlier deep nets had stalled:

ReLU activations. Instead of the saturating $\tanh$ or sigmoid, AlexNet used the rectified linear unit $\sigma(x) = \max(0, x)$ . Its gradient is 1 for positive inputs, so it does not “saturate” and kill the gradient signal, letting deep networks train several times faster. ReLU is now the default nonlinearity across deep learning.
Dropout. During training, each neuron in the fully-connected layers is zeroed out with probability 0.5 on each forward pass. This prevents neurons from co-adapting (relying on specific other neurons being present) and acts as a strong regularizer — effectively training an ensemble of sub-networks that share weights. Without it, AlexNet badly overfit.
GPUs. The whole thing was trained on two NVIDIA GTX 580 GPUs (3 GB each) for about a week. The model was literally split across the two cards because it did not fit on one. This was the moment the field learned that compute — not just cleverness — was a primary input to progress.

A fourth, quieter ingredient: data augmentation (random crops, horizontal flips, color jitter) and local response normalization, both used to fight overfitting on “only” 1.2 million training images.

Reader question: why did this not happen in the 1990s? The ideas (convolution, backprop) existed — LeCun’s LeNet-5 recognized digits in 1998. What was missing was (a) a large labeled dataset (ImageNet, 1.2M images, released 2009), and (b) enough cheap parallel compute (GPUs). AlexNet was less a new idea than the first time the old idea was given what it needed.

The CNN dynasty: VGG, Inception, ResNet

AlexNet kicked off four years of rapid architectural progress, each entry pushing ImageNet error down:

VGGNet (Simonyan & Zisserman, 2014) asked: what if we just go deeper with a dead-simple recipe? Stack only $3\times3$ convolutions (the smallest filter that has a notion of left/right, up/down) 16–19 layers deep. Two stacked $3\times3$ convolutions have the same receptive field as one $5\times5$ but with fewer parameters and an extra nonlinearity. VGG-16 hit ~7.3% top-5 error and became the default backbone for transfer learning for years, despite being huge (138M parameters).
GoogLeNet / Inception (Szegedy et al., 2014) went the other way: be deep but efficient. Its “Inception module” runs $1\times1$ , $3\times3$ , and $5\times5$ convolutions in parallel and concatenates them, using $1\times1$ convolutions as cheap “bottlenecks” to reduce channel count before the expensive big filters. 22 layers, ~6.7% top-5 error, but with 12× fewer parameters than VGG.
Batch Normalization (Ioffe & Szegedy, 2015) was a training trick, not an architecture, but it mattered enormously. It normalizes each layer’s pre-activations to zero mean and unit variance within each mini-batch, then rescales with learned parameters $\gamma, \beta$ : $\hat{x} = \gamma \cdot \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} + \beta$ . This stabilizes and dramatically speeds up training of deep nets, and it became near-universal.
ResNet (He et al., 2015) cracked the depth barrier. Naively stacking more layers had hurt accuracy — a 56-layer plain net did worse than a 20-layer one, not from overfitting but from an optimization failure (vanishing gradients, degradation). ResNet’s fix is the residual connection: instead of asking a block to compute a target function $H(x)$ , ask it to compute the residual $F(x) = H(x) - x$ and add the input back: $y = F(x) + x$ . If the optimal thing is to do nothing, the block just has to drive $F$ toward zero, which is easy. Gradients flow directly through the identity shortcut. This let them train 152 layers and reach 3.57% top-5 error — below the ~5% human benchmark on ImageNet. The residual/skip connection is arguably the single most important architectural idea of the decade; every transformer uses it.

Reader question: why do I care about ResNet in a post about multi-modal models? Two reasons. First, every later vision-language model needs a vision encoder, and for years that encoder was a ResNet (CLIP shipped both ResNet and ViT variants). Second, the residual connection is load-bearing inside the transformer, which is load-bearing inside every modern model in this post. The CNN era was not a detour; it built the parts.

The crucial habit this era taught: transfer learning

Here is the idea that makes everything downstream possible. A CNN trained on ImageNet learns, in its early and middle layers, general-purpose visual features — edges, textures, shapes — that are useful far beyond the 1000 ImageNet classes. So you can take a pretrained CNN, lop off its final classification layer, and reuse the rest as a feature extractor for a new task with far less data. Pretrain on a big generic dataset, then adapt to a specific task became the dominant paradigm. Every model from here on is some version of this idea.

By 2015 vision was solved well enough that researchers asked: can a model describe an image in words, or answer questions about it? This required, for the first time, joining a vision model to a language model.

Image captioning: CNN sees, RNN speaks

The template, set by Show and Tell (Vinyals et al., 2015), was an encoder–decoder. A pretrained CNN encodes the image into a fixed-length feature vector; that vector initializes a recurrent neural network (RNN) — specifically an LSTM — which generates the caption one word at a time, each step conditioning on the previous word and its hidden state. Training maximizes the likelihood of the human-written caption:

\mathcal{L} = -\sum_{t=1}^{T} \log p_\theta\!\left(w_t \mid w_{1:t-1},\, \text{CNN}(I)\right).

Show, Attend and Tell (Xu et al., 2015) added the idea that would soon eat the entire field: attention. Rather than squashing the image into one vector, keep the CNN’s spatial grid of feature vectors and, at each word, compute a weighted average over grid locations — letting the model “look at” the relevant part of the image as it generates each word. When it says “frisbee,” it attends to the frisbee. This is the first appearance of soft, learned attention in a multi-modal setting.

Visual Question Answering

VQA (Antol et al., 2015) framed a harder task: given an image and a free-form question (“What color is the bus?”), produce an answer. The standard architecture encoded the image with a CNN, the question with an LSTM, fused the two (often just by element-wise multiplication of their vectors), and classified into a fixed answer vocabulary. VQA mattered because it forced genuine joint reasoning over both modalities and gave the field a benchmark (VQAv2) that drove progress for years. It also exposed how models cheat — answering “tennis” to “What sport?” from language priors alone, ignoring the image — which motivated more careful multi-modal training.

These models worked, but they were brittle and task-specific. The next two inventions — the transformer and contrastive pretraining — changed the substrate entirely.

Part 3 — The transformer: one architecture to rule them all

Everything modern runs on the transformer, so we need its mechanics. It comes from “Attention Is All You Need” (Vaswani et al., 2017), a paper about machine translation that accidentally provided the universal backbone for AI.

Self-attention, concretely

The input is a sequence of vectors (tokens) $x_1, \dots, x_n$ . The transformer’s core operation, self-attention, lets every token gather information from every other token. From each token we compute three projections — a query, a key, and a value — via learned matrices:

\mathbf{q}_i = W_Q x_i, \qquad \mathbf{k}_i = W_K x_i, \qquad \mathbf{v}_i = W_V x_i.

Token $i$ then decides how much to attend to token $j$ by the similarity of $i$ ‘s query and $j$ ‘s key, normalized by softmax over all $j$ , and reads out a weighted sum of values. Stacked into matrices $Q, K, V$ :

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V.

The $\sqrt{d_k}$ divisor keeps the dot products from growing large and saturating the softmax. This is run several times in parallel with different projection matrices — multi-head attention — so different heads can specialize (one tracks syntax, another long-range references). Around each attention block and each position-wise feed-forward block sit a residual connection (yes, the ResNet idea) and layer normalization.

Reader question: why was this such a big deal versus RNNs? RNNs process a sequence step by step, so token 100 can only see token 1 through a long chain of hidden states — information decays, and the sequential dependency means you cannot parallelize across positions during training. Self-attention connects any two positions in a single step ( $O(1)$ path length) and processes all positions in parallel. That parallelism is exactly what lets you train on internet-scale data on thousands of GPUs. The cost is that attention is $O(n^2)$ in sequence length — quadratic — which is why “efficient attention” is still an active research area.

Because attention is permutation-invariant (it has no inherent notion of order), transformers add positional encodings to the input so the model knows token order. The original used fixed sinusoids; modern models use learned or rotary (RoPE) encodings.

Why this matters for multi-modality

The transformer does not care what its tokens mean. Words, image patches, audio frames, robot actions — anything you can turn into a sequence of vectors, it will happily attend over. This modality-agnosticism is precisely why the field converged on it. The remaining question was: how do you turn an image into tokens?

Vision Transformer (ViT): images as sequences of patches

ViT (Dosovitskiy et al., 2020) gave the brutally simple answer. Cut the image into a grid of fixed-size patches (e.g. $16\times16$ pixels), flatten each patch, linearly project it into a vector (“patch embedding”), add a positional encoding, and feed the resulting sequence of patch-tokens into a standard transformer. A $224\times224$ image becomes $14\times14 = 196$ tokens. Prepend a special [CLS] token whose final representation is used for classification.

ViT’s headline finding: with enough data, a transformer with no convolutions and almost no visual prior matches or beats the best CNNs. On small datasets CNNs still won (their built-in priors about locality help when data is scarce), but pretrained on ~300M images (Google’s JFT-300M), ViT-Huge surpassed ResNets while using less compute to train. The lesson echoed AlexNet: given enough data, learned-and-general beats designed-and-specific.

This was the unlock for multi-modality. Once images and text are both just “sequences of tokens fed to transformers,” joining them becomes natural.

Part 4 — CLIP: aligning vision and language by agreement

Now the inflection point that created the modern multi-modal era.

The problem with labels

Every model so far was trained on a fixed label set — ImageNet’s 1000 classes, VQA’s answer vocabulary. To recognize a new concept you needed new labeled data. Meanwhile, the internet is full of images paired with natural-language captions — essentially free, web-scale, open-vocabulary supervision. The question: can we train on raw image–caption pairs instead of curated labels?

Contrastive language–image pretraining

CLIP (Radford et al., 2021) did exactly that, on 400 million image–text pairs scraped from the web. It has two encoders: an image encoder (ViT or ResNet) and a text encoder (a transformer). Each maps its input to a vector in a shared embedding space. Training uses a contrastive objective: within a batch of $N$ pairs, there are $N$ correct image–text matches and $N^2 - N$ incorrect ones. CLIP maximizes the cosine similarity of the $N$ correct pairs and minimizes it for the rest, via a symmetric cross-entropy (InfoNCE) loss. For image embeddings $I_i$ and text embeddings $T_j$ , with temperature $\tau$ :

\mathcal{L}_{\text{img}\to\text{txt}} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(\langle I_i, T_i\rangle / \tau)}{\sum_{j=1}^{N}\exp(\langle I_i, T_j\rangle / \tau)},

and symmetrically for text→image; the total loss averages the two. In plain terms: pull a picture and its true caption together; push it away from every other caption in the batch.

Reader question: why does this produce something useful? Because to tell the right caption from 32,000 wrong ones (CLIP used batch sizes up to 32,768), the model must learn what images and text actually mean, not surface statistics. The shared space ends up organized semantically: photos of dogs cluster near the text “a photo of a dog.”

Zero-shot classification, for free

The killer trick: CLIP can classify into any set of categories you describe in words, with no task-specific training. To classify an image among “cat,” “dog,” “car,” you embed the prompts “a photo of a cat,” “a photo of a dog,” “a photo of a car,” embed the image, and pick the caption with highest cosine similarity. CLIP matched the accuracy of a fully-supervised ResNet-50 on ImageNet (76.2% top-1) without seeing a single ImageNet training label — and, crucially, it was far more robust to distribution shift (sketches, adversarial renditions) than supervised models, which tend to latch onto dataset-specific shortcuts.

ALIGN (Jia et al., 2021) showed the same recipe scales with noisier data: 1.8 billion image–alt-text pairs with minimal filtering, matching or beating CLIP. The lesson: scale of weakly-aligned data beats careful curation.

CLIP’s frozen image encoder became the de-facto “eyes” bolted onto language models in the VLM era. When you hear that a model “uses a CLIP vision encoder,” this is what it means.

Part 5 — Learning to see without labels: the self-supervised detour

Before VLMs, one more thread — it is the direct ancestor of JEPA, so we cannot skip it. CLIP needs paired text. Can a model learn good visual features from images alone, with no labels and no captions? This is self-supervised learning (SSL): invent a task where the supervision comes from the data itself. Two families emerged, and the tension between them sets up the whole world-model debate.

Family 1: contrastive and self-distillation

SimCLR (Chen et al., 2020) and MoCo (He et al., 2019): take an image, make two random augmented “views” of it, and train so the two views of the same image have similar embeddings while views of different images are pushed apart — contrastive learning, but image-only (no text).
DINO (Caron et al., 2021) and DINOv2 (Oquab et al., 2023): self-distillation with no labels. A “student” network is trained to match the output of a “teacher” network (an exponential moving average of the student) on different views. Remarkably, DINO’s attention maps segment objects with no segmentation supervision at all. DINOv2 scaled this to produce general-purpose visual features that rival CLIP without any text — and these DINO features are exactly what several world models (DINO-WM, V-JEPA’s relatives) build on.

Family 2: masked reconstruction (the generative approach)

BEiT (Bao et al., 2021) and Masked Autoencoders (MAE) (He et al., 2021): borrow the trick that made BERT work in NLP. Mask out a large fraction of image patches (MAE masks 75%) and train the model to reconstruct the missing pixels. MAE uses an asymmetric design: a heavy encoder sees only the visible 25% of patches; a lightweight decoder reconstructs the rest. It is simple, scalable, and learns strong features.

Hold onto this distinction, because it is the crux of Part 8. MAE reconstructs pixels. Yann LeCun’s critique — which motivates JEPA — is that predicting pixels wastes capacity modeling unpredictable, irrelevant detail (the exact texture of grass, the precise leaves on a tree). His alternative: predict in representation space, not pixel space. That is the seed of the world-model program.

Part 6 — Vision-Language Models (VLMs): giving an LLM eyes

Now the modern era most people have actually used: GPT-4V, Gemini, Claude with vision, Qwen-VL, LLaVA. These are vision-language models — large language models that can also see.

First, what is an LLM? (the one-paragraph version)

A large language model is a giant transformer (Part 3) trained to do one thing: predict the next token in a sequence of text. A token is a chunk of text — a common word, a word-piece, or a character — drawn from a vocabulary of ~50k–250k entries. Trained on trillions of tokens of internet text with the objective $\mathcal{L} = -\sum_t \log p_\theta(w_t \mid w_{1:t-1})$ (maximize the probability of the actual next token), a model with billions of parameters acquires, as a side effect of getting good at prediction, a startling amount of knowledge, reasoning, and instruction-following ability. At generation time it samples tokens one at a time, feeding each back in — this is autoregressive decoding. The 2020 finding that this scales predictably — bigger model + more data + more compute = reliably better, per the “scaling laws” — is why the field poured resources into ever-larger LLMs.

Reader question: how does a model that only predicts text learn to “reason” or “follow instructions”? Two stages after pretraining. Supervised fine-tuning (SFT) on curated instruction–response examples teaches the format of being a helpful assistant. Reinforcement learning from human feedback (RLHF) — or from verifiable rewards (RLVR) for math/code — then optimizes the model’s outputs against a reward signal that scores quality. The base model supplies the knowledge; these stages shape it into something useful and steerable.

The core VLM idea: graft vision onto a language model

A VLM combines three parts: (1) a vision encoder (usually a CLIP ViT) that turns an image into a set of patch-feature vectors; (2) a connector/projector that maps those vision features into the language model’s token-embedding space; (3) the LLM itself, which now receives a mix of text tokens and image tokens and generates text. The image is, in effect, “spoken” to the LLM as a handful of tokens it can attend to.

There are two schools of how to build the connector, and the difference is the main axis of variation in VLM research.

School A — frozen-LLM connectors (the efficient route)

Flamingo (Alayrac et al., 2022) from DeepMind was the trailblazer. It kept a powerful pretrained LLM frozen and inserted new gated cross-attention layers that let text tokens attend to visual features from a frozen vision encoder. A “Perceiver Resampler” compressed variable numbers of image features into a fixed small set of tokens. Crucially, Flamingo handled interleaved image-and-text sequences, which gave it in-context few-shot learning for vision: show it a few image→answer examples in the prompt and it generalizes to a new image. The 80B model set state-of-the-art on many VQA/captioning benchmarks with no task-specific fine-tuning.
BLIP-2 (Li et al., 2023) introduced the Q-Former (Querying Transformer): a small transformer with a fixed set of learnable query tokens that extract the most useful ~32 visual features from a frozen image encoder and feed them to a frozen LLM. By keeping both big models frozen and training only the lightweight Q-Former, BLIP-2 reached strong performance at a tiny fraction of the compute. It descends from BLIP (Li et al., 2022), which unified vision-language understanding and generation with a caption-and-filter data engine.

School B — fine-tuned, simple-projector VLMs (the route everyone copied)

LLaVA (Liu et al., 2023) showed you might not need anything fancy. Its connector is a single linear projection (later a small MLP) from CLIP features into the LLM’s embedding space. The insight was about data, not architecture: they used GPT-4 (text-only) to generate visual instruction-tuning data — synthetic conversations about images — then fine-tuned the projector and the LLM (Vicuna) on it. LLaVA-1.5 (Liu et al., 2023) swapped in an MLP connector and better data and became a strong, cheap, reproducible baseline that the open-source world standardized on. InstructBLIP (Dai et al., 2023) made BLIP-2’s Q-Former instruction-aware in a similar spirit.

The two schools above bolt vision onto a finished LLM (“late fusion”). The frontier instead trains a single transformer on both modalities from early on (“early fusion”), so vision and language share parameters throughout.

Chameleon (Meta, 2024) tokenizes images into discrete tokens (via a learned image quantizer) and trains one transformer over a single interleaved stream of text and image tokens — a genuinely unified token space that can both read and generate images.
Fuyu (Adept, 2023) ditched the separate vision encoder entirely: it linearly projects raw image patches straight into the transformer, treating them like text tokens. Architecturally radical in its simplicity.
PaLI (Chen et al., 2022) scaled a unified image+text transformer to 17B and emphasized multilingual multimodality. PaLM-E (Driess et al., 2023) injected not just images but robot sensor data into a 562B LLM, making it an embodied multi-modal model that outputs robot action plans — an important bridge toward the world-model section.
The production VLM families — Qwen-VL (Bai et al., 2023) and Qwen2-VL (Wang et al., 2024), with its “naive dynamic resolution” letting one model handle arbitrary image sizes — plus the Llama 3 (Dubey et al., 2024) vision adapters and proprietary GPT-4V and Gemini, are where most real-world multi-modal use happens today. (For a deep dive on how one frontier lab does joint text–vision training, see the companion post on Kimi K2.5.)

How do we know any of these are good? Benchmarks.

Modern VLMs are scored on a battery of benchmarks: VQAv2 (open-ended visual QA), TextVQA (reading text in images), DocVQA (documents), ChartQA (charts), MMMU (college-level multi-discipline reasoning across text+images, the current hard benchmark — strong 2024–2025 models score in the 60s%, expert humans ~88%), MathVista (visual math), and MMBench. The trajectory is steep: a 2021 model could barely read a chart; 2025 frontier VLMs do competitive document understanding, multi-step visual reasoning, and GUI/agentic control. But all of these test perception and description. None of them test whether the model understands that if you push a glass off a table, it falls and breaks. That gap is the motivation for the final act.

One branch deserves mention because it sets up world models. Text-to-image and text-to-video generators learn to synthesize pixels from a description. DALL·E (Ramesh et al., 2021) did it autoregressively over discrete image tokens; Latent Diffusion / Stable Diffusion (Rombach et al., 2021) did it via diffusion — start from pure noise and iteratively denoise toward an image, guided by text — operating in a compressed latent space for efficiency. Scaled to video, Sora (OpenAI, 2024) produces minute-long coherent clips and was explicitly pitched by its authors as a step toward “world simulators.” The catch: a video generator that produces plausible pixels is not necessarily one that understands physics — it can hallucinate objects appearing and vanishing, because its objective rewards looking right, not being right. This pixel-prediction-versus-understanding tension is exactly what JEPA was designed to resolve.

Part 7 — World models: from perceiving to predicting

What is a world model, and why do we need one?

A world model is a learned internal simulator of an environment: given the current state (what the agent perceives) and a candidate action, it predicts the next state. Formally, it learns a transition function $\hat{s}_{t+1} = f_\theta(s_t, a_t)$ — often in a latent representation rather than raw observations. With such a model, an agent can plan by imagination: roll out “what would happen if I did this?” entirely inside its head, evaluate the outcomes, and pick the best action — without expensive or dangerous trial-and-error in the real world.

Reader question: how is this different from a VLM? A VLM is, at heart, reactive and descriptive — it perceives and produces text. A world model is predictive and causal — it models how the world evolves under actions. A VLM can tell you a ball is on a slope; a world model can tell you where the ball will be in two seconds, and how that changes if you nudge it. Many researchers (LeCun foremost) argue that this predictive, action-conditioned understanding — not next-token prediction over internet text — is the missing ingredient for human-like intelligence, because it is how animals and infants learn: by observing and predicting the consequences of interaction.

The reinforcement-learning lineage

The world-model idea is older than the current hype, born in reinforcement learning (RL), where an agent learns by maximizing reward.

“World Models” (Ha & Schmidhuber, 2018) was the paper that named the field. It compressed observations with a variational autoencoder (V), modeled their dynamics with an RNN (M), and used a tiny controller (C) that could even be trained entirely inside the model’s dream — the agent learned to play a racing game by practicing in its own imagined simulation.
PlaNet (Hafner et al., 2018) and the Dreamer line — Dreamer (2019), DreamerV2 (2020), DreamerV3 (2023) — built Recurrent State-Space Models that learn a compact latent dynamics model and train the agent’s policy purely on imagined latent rollouts. DreamerV3 was a landmark: a single set of hyperparameters mastered over 150 diverse tasks, and — famously — it was the first to collect diamonds in Minecraft from scratch with no human data, a long-horizon exploration problem that had defeated prior methods.
MuZero (Schrittwieser et al., 2019) from DeepMind learned a world model implicitly: it never reconstructs observations, only learns to predict the quantities that matter for planning (reward, value, policy), then plans with Monte-Carlo Tree Search. It mastered Go, chess, shogi, and Atari without being told the rules. This “predict only what matters for the task” philosophy directly anticipates JEPA’s “predict in representation space.”

Large-scale generative world models

Recently, “world model” has come to also mean large generative models of video conditioned on actions — interactive, controllable simulators.

GAIA-1 (Hu et al., 2023) from Wayve is a generative world model for autonomous driving: feed it video, text, and actions, and it generates realistic future driving scenarios for training and testing.
Genie (Bruce et al., 2024) from DeepMind learned, from unlabeled internet gameplay videos, to generate playable 2D worlds — it infers a latent action space with no action labels, so you can “press a button” and the generated world responds. Genie 2 (2024) and Genie 3 (2025) extended this to controllable, consistent 3D environments.
Cosmos (NVIDIA, 2025) is a “World Foundation Model” platform aimed at Physical AI — pretrained video world models you fine-tune for robotics and driving, generating action-conditioned futures to train embodied agents.
Large World Model (LWM) (Liu et al., 2024) pushed multi-modal context length to a million-plus tokens, so a single model can reason over a full hour-long video or a large codebase — a different sense of “world,” but part of the same impulse to model long, rich context.

These are powerful, but they share the pixel-prediction objective that LeCun argues is wasteful. The JEPA program is the counter-thesis.

Part 8 — JEPA: predicting in representation space

This is the destination. JEPA stands for Joint-Embedding Predictive Architecture, and it is the centerpiece of Yann LeCun’s research program, laid out in his 2022 position paper “A Path Towards Autonomous Machine Intelligence.” The thesis: the road to human-level AI is self-supervised, predictive world models that operate in abstract representation space, not generative models that predict pixels and not LLMs that predict text tokens.

The core idea, and the collapse problem

A JEPA takes two related inputs — say a context (visible part of an image/video) $x$ and a target (a masked-out or future part) $y$ . It encodes both with encoders into representations $s_x, s_y$ , and trains a predictor to predict the target’s representation from the context’s representation:

\hat{s}_y = g_\phi(s_x), \qquad \mathcal{L} = \big\lVert\, \hat{s}_y - \mathrm{sg}(s_y)\,\big\rVert^2,

where $\mathrm{sg}$ is a stop-gradient. The point: it predicts in the latent space $s_y$ , not the pixel space $y$ . Unpredictable details (exact texture, precise pixel values) can be discarded by the encoder, so the model spends capacity only on predictable, semantic structure. This is the formal statement of “predict, don’t reconstruct.”

Reader question: if you train an encoder and a predictor jointly to make $\hat{s}_y \approx s_y$ , what stops the encoder from cheating by mapping everything to the same constant vector? This is the representation collapse problem, and it is the central difficulty of JEPA. If the encoder outputs a constant, the prediction loss is trivially zero and the representation is useless. Every JEPA’s hardest engineering problem is preventing collapse. The methods differ: a momentum/EMA target encoder plus stop-gradient (so the target is a stable, slowly-moving copy), asymmetric architecture, and explicit variance/covariance regularizers that force the embeddings to spread out and decorrelate. This anti-collapse machinery is exactly what LeWorldModel later simplifies away.

The JEPA family

I-JEPA (Image-JEPA) (Assran et al., 2023) is the first instantiation. Given a context block of an image, predict the representations of several target blocks elsewhere in the image. Critically, prediction happens in feature space (unlike MAE’s pixel reconstruction). I-JEPA learns strong features without hand-crafted augmentations and trains far more efficiently than pixel-reconstruction or heavy-augmentation methods, validating the core thesis on images.
V-JEPA (Video-JEPA) (Bardes et al., 2024) extends this to video: mask large spatio-temporal regions and predict their representations from the visible context, learning motion and interaction features from 2M videos with a frozen backbone — strong on action recognition (e.g. 72.2% on Something-Something-v2) purely from feature prediction, no pixels, no text, no labels.
V-JEPA 2 (Assran et al., 2025) is the scaled, action-capable version, and the most important recent result. Pretrained on over 1 million hours of internet video, it reaches state-of-the-art motion understanding (77.3% top-1 on Something-Something-v2) and human-action anticipation (39.7 recall-at-5 on Epic-Kitchens-100); aligned with an LLM it hits strong video-QA (e.g. 84.0 on PerceptionTest) at the 8B scale. The headline is V-JEPA 2-AC (action-conditioned): post-trained on just 62 hours of unlabeled robot video, it was deployed zero-shot on real Franka robot arms in two different labs to pick and place objects via planning toward image goals — no reward, no task-specific data, no data collected in those labs. That is a world model used for real robot control, exactly as the program promised.
Two relatives in the same spirit: DINO-WM (Zhou et al., 2024) builds a world model on frozen DINOv2 features and plans zero-shot by optimizing action sequences toward a goal representation; and Navigation World Models (Bar et al., 2024) — LeCun among the authors — learns a controllable model that predicts future visual states from past observations and navigation actions, scoring candidate trajectories by how close their imagined endpoint lands to the goal.

LeWorldModel: the JEPA the field had been trying to build

The frustration with JEPAs was always the collapse-prevention machinery: stacks of regularizers, EMA teachers, careful schedules — fragile, hard to tune, hard to scale end-to-end from raw pixels. LeWorldModel (LeWM) (Maes et al., 2026), from a team including LeCun, Mila, and AMI Labs, is the first JEPA that trains stably end-to-end straight from pixels with essentially one regularizer, and it is a strikingly clean result to end on.

What makes it notable:

Two loss terms, not six. Prior JEPA objectives stacked up to five anti-collapse terms (variance, covariance, etc.) plus the prediction loss. LeWM uses just the next-embedding prediction loss plus a single regularizer, SIGReg, which pushes the distribution of latent embeddings toward an isotropic Gaussian. An embedding space that is provably spread out as a Gaussian cannot collapse to a point — so SIGReg dissolves the collapse problem with one principled term and cuts tunable loss hyperparameters from six to one.
Tiny and fast. At roughly 15 million parameters, LeWM trains on a single GPU in a few hours, yet plans up to 48× faster than world models built on large video foundation models, while staying competitive across diverse 2D and 3D control tasks. It is a direct rebuke to the assumption that world models must be enormous generative video models.
It learns real physics. Probing and “violation-of-expectation” tests (show the model a physically impossible event and check whether its predictions register surprise) confirm the latent space encodes genuine physical structure, not just visual statistics — the property a true world model is supposed to have, and exactly what pixel-prediction video models struggle to guarantee.

LeWorldModel is the cleanest demonstration so far of the entire thesis this post has been building toward: a small, stable, efficient model that learns to predict the consequences of actions in an abstract representation space, and can plan with that model on real control problems. It is the AlexNet-style “the simple right idea, finally made to work” moment for the world-model program.

Part 9 — Stepping back: the throughline and the open questions

Look at the whole arc and one pattern dominates: the field keeps replacing hand-designed structure with learned, general structure, and keeps pushing the prediction target from the concrete toward the abstract.

AlexNet: learn features instead of designing them.
Transformer / ViT: use one general sequence architecture instead of modality-specific ones.
CLIP: learn an aligned multi-modal space from raw web pairs instead of fixed labels.
VLMs: compose pretrained vision and language models into systems that perceive and converse.
World models / JEPA: predict the consequences of actions in representation space instead of describing or reconstructing the present.

Each step abstracted further from raw pixels and labels toward semantic, predictive, action-grounded understanding.

A few honest open questions, because the story is far from over:

Will world models scale like LLMs did? LLMs had clean scaling laws. It is not yet established that JEPA-style world models improve as predictably with scale — V-JEPA 2 is encouraging, LeWM suggests small can be enough for control, and the two findings are not obviously reconcilable yet.
Generative vs. non-generative. The Sora/Genie/Cosmos camp predicts pixels and bets that scale yields physical understanding as a byproduct; the JEPA camp predicts representations and bets that pixel prediction is a fundamental waste. The field has not settled which is right, and the answer may be task-dependent.
How do these merge with LLMs? The most capable systems will likely need both the broad knowledge and language of an LLM and the grounded, predictive physical understanding of a world model. How to fuse them — V-JEPA 2’s LLM alignment is one early hint, PaLM-E another — is wide open.
Evaluation. We have good benchmarks for perception (MMMU, VQA) but immature ones for understanding dynamics and physics. Violation-of-expectation tests (LeWM) and physical-reasoning benchmarks (V-JEPA 2’s) are early steps; the field still lacks an “ImageNet moment” for world models.

If the 2012–2021 decade was about teaching machines to see, and 2021–2025 about teaching them to see and talk, the bet now on the table — from Dreamer to V-JEPA 2 to LeWorldModel — is that the next decade is about teaching them to imagine: to carry a predictive model of the world in their heads and use it to plan. Whether that bet pays off, and which of the competing recipes wins, is the most interesting open question in the field.

Reference timeline

Year	Model	Contribution	Paper
2012	AlexNet	Deep CNNs win ImageNet	NeurIPS 2012
2014	VGG / Inception	Depth and efficiency	1409.1556 / 1409.4842
2015	ResNet	Residual connections, 152 layers	1512.03385
2015	Show, Attend and Tell	Attention for captioning	1502.03044
2017	Transformer	Self-attention, the universal backbone	1706.03762
2020	ViT	Images as patch-token sequences	2010.11929
2021	CLIP	Contrastive image–text alignment	2103.00020
2021	MAE	Masked pixel reconstruction SSL	2111.06377
2022	Flamingo	Frozen-LLM few-shot VLM	2204.14198
2023	BLIP-2 / LLaVA	Q-Former / visual instruction tuning	2301.12597 / 2304.08485
2023	I-JEPA	Predict image representations, not pixels	2301.08243
2023	DreamerV3	General world-model RL, Minecraft diamonds	2301.04104
2024	V-JEPA / Genie	Video feature prediction / playable worlds	2404.08471 / 2402.15391
2025	V-JEPA 2	Action-conditioned world model, zero-shot robots	2506.09985
2026	LeWorldModel	Stable end-to-end JEPA from pixels, one regularizer	2603.19312

If you want to go deeper on the attention mechanism that underpins all of this, see the companion posts on sparse attention and attention residuals.