Yining Hong
At every moment, the world model updates itself continuously — with no external teacher and no labeled data. The hippocampus rapidly encodes each new experience as an episodic trace; the neocortex slowly consolidates statistical regularities across experiences into durable structure — much of it during sleep.
This is why you remember where you parked this morning (hippocampal episode) separately from how parking garages generally work (neocortical schema). Complementary Learning Systems theory formalizes exactly this dual-rate architecture.
The world model updates are self-supervised — the signal comes entirely from the structure of experience itself: prediction error (what did I expect vs. what happened?), sensorimotor contingency (how does the world respond to my actions?), temporal contiguity (what tends to follow what?), and spatial co-occurrence (what appears together?). What gets encoded: object permanence, causal structure, affordances, and regularities of other agents' behavior.
Each experience updates the world model — but selectively. The brain solves: what latent structure is worth consolidating into the slow neocortical weights, and what can be discarded after use?
In machine learning, the train–test split is taught in the first lecture of any introductory course: "do not train on the test set." But consider your commute this morning. It is simultaneously testing — because you care about arriving right now — and training — because you are gaining experience for future commutes. There is no boundary. The world model updates itself continuously, whether or not you intend to learn.
Every person develops a unique world model — shaped by their unique, continuous stream of lived experience. It updates itself self-supervisedly, for a lifetime, and no shared fine-tuning procedure can replicate it: your model of how your city works, how people in your life behave, what your body can do.
What the world model encodes across a lifetime is overwhelmingly self-supervised: perceptual invariances (object constancy, face recognition), physics intuitions (gravity, solidity, continuity), language (grounded in embodied interaction, not labeled corpora), social cognition (theory of mind bootstrapped from contingent interaction), and motor programs (learned through self-generated movement and proprioceptive feedback). The curriculum is the world itself.
3D perception is itself an emergent ability at this level — not a veridical readout of geometry but a constructed approximation built up through months of reaching, crawling, and acting in space. Marr's 2.5D sketch, binocular disparity, shading and occlusion cues: these are heuristics learned from sensorimotor experience, not measurements. The hollow-face illusion and the Ames room expose the seams — the prior overrides the geometry. Under predictive coding (Helmholtz, Gregory, Clark), depth perception is Bayesian inference; under enactivism, 3D space is enacted through movement — remove the ability to act and depth collapses into ambiguity. There is no perfect 3D. There never was.
Evolution is the outermost layer of this process — but not a separate loop. What each organism learns across its lifetime integrates into the population's prior through reproduction and cultural transmission. Natural selection acts on that prior: it continuously filters which initial world model structures, which sensory architectures, which update rules are worth passing forward.
The enactivist framework from The Embodied Mind (Varela, Thompson, Rosch) gives this the right framing: the organism and environment co-constitute each other. The world model is not a passive representation — it is enacted through sensorimotor coupling. Evolution shapes the terms of that coupling: which signals count as errors, which regularities become innate, which hypotheses the organism is born already committed to.
The concrete result is visible in the human body: 3-channel color vision tuned to daylight (no thermal, no UV), hearing bounded to ~20 Hz–20 kHz, pain and proprioception as privileged self-supervised error signals, a social brain pre-wired for faces and gaze from birth, and eyes positioned frontally with the right interocular distance to make binocular disparity computationally useful — the hardware prerequisite for 3D perception to emerge at all. These are not neutral engineering choices — they are hypotheses baked into the prior by billions of years of selection pressure. The body is the inductive bias. Evolution optimizes not behavior, but the structure of the world model every new human is born with.