Test-Time Training

Yining Hong

Level 3  —  Evolution
Natural selection continuously shapes the learning algorithm
timescale: millions of years  ·  continuous adaptation integrates into the species' prior
Level 2  —  Individual Life
Continuous TTT across a lifetime
timescale: decades  ·  experience continuously integrates into the personal world model
Level 1  —  Episode
Hippocampus ↔ Neocortex
timescale: milliseconds to hours  ·  fast encode → selective consolidation
Hippocampus
Fast learner
Episodic trace
Single-pass binding
High plasticity
consolidation
/ prior
Neocortex
Slow learner
Statistical schema
World model
Durable structure
Self-supervised signal
Prediction error  ·  sensorimotor contingency  ·  temporal contiguity  ·  spatial co-occurrence
Episode
ms – hours
Life
decades
Development & culture
generations
I level

The Episodic Loop

Hippocampus ↔ Neocortex  ·  Complementary Learning Systems  ·  Self-Supervised

At every moment, the world model updates itself continuously — with no external teacher and no labeled data. The hippocampus rapidly encodes each new experience as an episodic trace; the neocortex slowly consolidates statistical regularities across experiences into durable structure — much of it during sleep.

This is why you remember where you parked this morning (hippocampal episode) separately from how parking garages generally work (neocortical schema). Complementary Learning Systems theory formalizes exactly this dual-rate architecture.

The world model updates are self-supervised — the signal comes entirely from the structure of experience itself: prediction error (what did I expect vs. what happened?), sensorimotor contingency (how does the world respond to my actions?), temporal contiguity (what tends to follow what?), and spatial co-occurrence (what appears together?). What gets encoded: object permanence, causal structure, affordances, and regularities of other agents' behavior.

Each experience updates the world model — but selectively. The brain solves: what latent structure is worth consolidating into the slow neocortical weights, and what can be discarded after use?

II level

The Individual Life

Personalized continual learning  ·  no train–test boundary  ·  Self-Supervised

In machine learning, the train–test split is taught in the first lecture of any introductory course: "do not train on the test set." But consider your commute this morning. It is simultaneously testing — because you care about arriving right now — and training — because you are gaining experience for future commutes. There is no boundary. The world model updates itself continuously, whether or not you intend to learn.

Every person develops a unique world model — shaped by their unique, continuous stream of lived experience. It updates itself self-supervisedly, for a lifetime, and no shared fine-tuning procedure can replicate it: your model of how your city works, how people in your life behave, what your body can do.

What the world model encodes across a lifetime is overwhelmingly self-supervised: perceptual invariances (object constancy, face recognition), physics intuitions (gravity, solidity, continuity), language (grounded in embodied interaction, not labeled corpora), social cognition (theory of mind bootstrapped from contingent interaction), and motor programs (learned through self-generated movement and proprioceptive feedback). The curriculum is the world itself.

3D perception is itself an emergent ability at this level — not a veridical readout of geometry but a constructed approximation built up through months of reaching, crawling, and acting in space. Marr's 2.5D sketch, binocular disparity, shading and occlusion cues: these are heuristics learned from sensorimotor experience, not measurements. The hollow-face illusion and the Ames room expose the seams — the prior overrides the geometry. Under predictive coding (Helmholtz, Gregory, Clark), depth perception is Bayesian inference; under enactivism, 3D space is enacted through movement — remove the ability to act and depth collapses into ambiguity. There is no perfect 3D. There never was.

III level

Evolution

Natural selection  ·  The Embodied Mind (Varela, Thompson, Rosch)  ·  Adaptation across generations

Evolution is the outermost layer of this process — but not a separate loop. What each organism learns across its lifetime integrates into the population's prior through reproduction and cultural transmission. Natural selection acts on that prior: it continuously filters which initial world model structures, which sensory architectures, which update rules are worth passing forward.

The enactivist framework from The Embodied Mind (Varela, Thompson, Rosch) gives this the right framing: the organism and environment co-constitute each other. The world model is not a passive representation — it is enacted through sensorimotor coupling. Evolution shapes the terms of that coupling: which signals count as errors, which regularities become innate, which hypotheses the organism is born already committed to.

The concrete result is visible in the human body: 3-channel color vision tuned to daylight (no thermal, no UV), hearing bounded to ~20 Hz–20 kHz, pain and proprioception as privileged self-supervised error signals, a social brain pre-wired for faces and gaze from birth, and eyes positioned frontally with the right interocular distance to make binocular disparity computationally useful — the hardware prerequisite for 3D perception to emerge at all. These are not neutral engineering choices — they are hypotheses baked into the prior by billions of years of selection pressure. The body is the inductive bias. Evolution optimizes not behavior, but the structure of the world model every new human is born with.

Related Work
My work
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, et al.
ICLR 2025 Spotlight  ·  arXiv:2410.23277
Dual-speed learning for long video generation: slow pre-training learns world dynamics; fast inference-time LoRA stores episodic memory. Directly instantiates the Level I hippocampus–neocortex architecture in a video model.
TTT foundations
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun, Xinhao Li, Karan Dalal, et al.
2024  ·  arXiv:2407.04620
TTT layers whose hidden states are themselves machine learning models, updated by self-supervised learning on the test sequence. The hidden state update rule is a gradient step — making the world model self-updating at sequence-processing time.
One-Minute Video Generation with Test-Time Training
Karan Dalal*, Daniel Koceja*, Gashon Hussein*, Jiarui Xu*, …, Yu Sun, Xiaolong Wang
CVPR 2025  ·  arXiv:2504.05298
TTT layers inserted into a pretrained Diffusion Transformer enable coherent one-minute, multi-scene video generation — far beyond the context of fixed-weight models. The world model extends itself self-supervisedly over the generated sequence.
Learning to Discover at Test Time (TTT-Discover)
Mert Yuksekgonul*, Daniel Koceja*, Xinhao Li*, Federico Bianchi*, …, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun
2026  ·  arXiv:2601.16175
Reinforcement learning at test time on a single problem instance — mathematics, GPU kernels, algorithm design, biology. TTT as targeted specialization: the world model commits fully to one test instance rather than generalizing, achieving state-of-the-art on open scientific problems.
World models for action
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao
2026  ·  arXiv:2603.16666
The value of video prediction in World Action Models comes from training-time supervision, not test-time imagination. At inference, Fast-WAM skips future video generation entirely — it processes the current observation once, and actions are generated directly from latent world representations in a single forward pass. The latents have already internalized the world model; re-executing it is unnecessary.
World Action Models are Zero-shot Policies (DreamZero)
Seonghyeon Ye*, Yunhao Ge*, …, Jan Kautz, Yuke Zhu, Linxi "Jim" Fan, Joel Jang
2026  ·  arXiv:2602.15922
A 14B World Action Model built on a pretrained video diffusion backbone, jointly predicting future frames and actions. By learning from heterogeneous, non-repetitive robot data, it achieves zero-shot generalization to novel tasks and environments — the world model's video-grounded latents transfer across embodiments without task-specific demonstrations.
Theoretical grounding
The Embodied Mind: Cognitive Science and Human Experience
Francisco J. Varela, Evan Thompson, Eleanor Rosch
MIT Press, 1991  ·  Revised edition 2016
The foundational text for enactivism: cognition is not representation retrieval but the ongoing enactment of a world through sensorimotor coupling. The organism and environment co-constitute each other. Grounds the Level III argument that the world model is not a passive internal map but an active, embodied process — and that evolution shapes the very terms of that coupling.