Current approaches to embodied AI are organized around two paradigms: world models, which attempt to build accurate internal simulations of physical reality, and large language models, which compress the statistical structure of language as a proxy for knowledge. Both assume the goal of intelligence is representing truth. This paper proposes a third paradigm — the perception model — in which intelligence is not the representation of truth but the adaptive management of input-response mappings across all sensory and motor channels simultaneously.
Drawing on Gibson's ecological psychology, Friston's active inference framework, and Brooks' subsumption architecture, we argue that perception and action are not separable stages but a single computational process: the minimization of prediction error across a unified multimodal token space. We present a concrete architecture in which vision, touch, proprioception, audio, motor intent, and language all participate as peer modalities in a shared attention mechanism, with no modality privileged as supervisor. We use nociception (pain) as a central case study, showing that a biologically inspired pain mechanism is not merely a safety feature but an architectural necessity that tests the coherence of the entire framework. We outline a phased implementation plan, define four empirical milestones, and identify the specific research gaps that must be closed to realize the perception model at scale.
1. The Problem with Current Paradigms
Artificial intelligence has produced two dominant approaches to understanding and interacting with the physical world. World models attempt to build internal simulations of reality — learning physics, causality, and spatial relationships so that an agent can plan actions by imagining their consequences. Large language models capture the statistical structure of human language, treating text as a compressed representation of knowledge about the world. Both paradigms share a foundational assumption: that the goal of intelligence is to approximate truth. The world model asks, “What is the world really like?” The language model asks, “What would a knowledgeable person say about this?”
This assumption is so deeply embedded in the field that it is rarely examined. Yet there is a competing tradition in cognitive science — spanning ecological psychology, enactivism, embodied cognition, and predictive processing — that challenges it fundamentally. The brain, these traditions argue, has never had direct access to the world. It sits in a dark box of bone, receiving nothing but electrochemical signals from sensory neurons. Everything it “knows” is a constructed response to patterns of input. Color does not exist in physics; there are wavelengths, and the brain maps certain wavelength differences to experiences that are useful for distinguishing ripe fruit from unripe fruit. The mapping is not true or false. It is functional or dysfunctional.
If this view is correct, then both dominant AI paradigms are solving the wrong problem. They are trying to build accurate representations of a reality that no biological intelligence has ever accessed directly. What biological intelligence actually does is maintain adaptive input-response mappings that keep the organism alive and functional. This paper proposes an AI architecture built on that principle.
2. The Perception Model
We propose a third paradigm, which we call the perception model. Its core claims are as follows.
First, perception and action are not separable stages but a single computational process. The system does not perceive the world and then decide what to do. Perception is action selection. Every sensory input is processed not in terms of “what is this?” but “what does this afford?” — borrowing Gibson's terminology from ecological psychology.
Second, no modality is architecturally privileged. Vision, touch, proprioception, audio, motor intent, and language all participate as peer channels in a shared computational substrate. Language, in particular, is not a supervisory layer that issues commands to a sensorimotor system. It is one input among many, weighted by its contextual reliability.
Third, the system's internal states do not represent objects, distances, or forces. They represent dispositions to act. A cup is not encoded as “cylindrical, ceramic, 10cm tall, located at coordinates (x, y, z).” It is encoded as a cluster of sensorimotor contingencies: “if I close my hand here at this force, I will feel resistance at these contact points and the visual field will change in this way as I lift.”
Fourth, the training objective is not imitation of human behavior or reward maximization but cross-modal prediction error minimization. At every timestep, every modality predicts the next state of every other modality. The system learns by reducing the discrepancy between what it expects and what it encounters, across all channels simultaneously.
Fifth, meaning is response disposition. The word “heavy” does not refer to an abstract physical property. It is a pattern that, when encountered, modulates motor predictions in the same way that actual heaviness modulates them. The system does not “understand” heaviness in any referential sense. The meaning of “heavy” is the prediction modification it produces.
3. Architecture
3.1 The Unified Token Space
The core architectural element is a single transformer in which all modalities are tokenized into the same sequence. Vision is encoded as patch tokens. Tactile sensing is encoded as spatial pressure tokens. Proprioception is encoded as joint-state tokens. Audio is encoded as spectrogram tokens. Motor intent is encoded as action-disposition tokens. Crucially, motor intent tokens are not outputs of the system; they live in the same sequence as sensory tokens and participate in self-attention bidirectionally. Every sensory token can attend to motor intent tokens, and motor intent tokens can attend to every sensory token.
This design dissolves the perception-action boundary. Motor intent shapes what the system attends to in sensory channels (you look at the handle because you are planning to grasp), and sensory input shapes motor intent (you adjust your grip because you feel the object slipping). Both directions operate through the same attention mechanism in the same computational step.
3.2 Cross-Modal Prediction as Training Objective
The training objective is cross-modal next-step prediction. At every timestep, every modality predicts the next state of every other modality, including itself. Vision predicts what tactile will report next. Tactile predicts what proprioception will report next. Motor intent predicts what all sensory channels will report after the intended movement executes. Sensory channels predict what motor intent should be, given the current prediction errors across all channels.
The loss function is the sum of all cross-modal prediction errors, weighted by learned precision terms. Precision weighting allows the system to adjust its reliance on each modality dynamically. If visual conditions degrade (low light, occlusion), visual precision drops, tactile precision increases, and the system shifts its reliance automatically without explicit programming.
Action emerges from this loop directly. The system moves to minimize total prediction error across all channels. If it predicts that closing its gripper will produce a certain tactile pattern and a certain visual change, and the current state does not match that prediction, the motor intent tokens update to close the gripper. Action is prediction error reduction, not output generation.
3.3 Hierarchical Temporal Structure
A single-layer transformer cannot capture the multiple timescales at which sensorimotor processing operates. The architecture requires three nested prediction loops, analogous to the cortical-cerebellar-spinal hierarchy in biological motor control.
The fast layer operates at motor control frequency (100–500 Hz). It handles grip adjustment, contact response, and reflexive corrections. Its tokens are raw sensory and motor signals, and its prediction horizon is milliseconds. This is the spinal cord analog. The middle layer operates at task frequency (5–20 Hz). It handles reach-grasp-lift-place sequences. Its tokens are compressed representations from the fast layer — not raw sensor data but prediction error summaries. Its prediction horizon is seconds. This is the cerebellar analog. The slow layer operates at goal frequency (0.5–2 Hz). It handles contextual framing. Its tokens are highly compressed situation representations. Its prediction horizon is minutes. This is the cortical analog.
The layers are coupled bidirectionally through precision-weighted prediction errors. The slow layer sets priors that the middle layer attempts to fulfill. The middle layer sets expectations that the fast layer attempts to maintain. Prediction errors propagate upward: if the fast layer encounters something unexpected (an object is heavier than predicted), that error propagates up and can revise both the middle layer's plan and the slow layer's situational model. This bidirectional coupling is implemented as three transformers with different context window sizes and different token update frequencies, connected by cross-attention bridges.
3.4 Language as a Peer Modality
Language enters the architecture with no structural privilege. Language tokens occupy the same sequence as vision, tactile, proprioceptive, and motor tokens. They participate in the same self-attention and are subject to the same precision weighting.
Language modulates behavior through prediction modification, not command execution. When the system hears “it's heavy,” that utterance generates tokens in the slow layer that modify predictions propagated to the middle and fast layers. The middle layer now predicts higher grip force will be needed. The fast layer now expects greater resistance during lifting. The system has learned that the acoustic pattern “heavy” co-occurs with sensorimotor contingencies involving higher-than-default force feedback. The meaning of “heavy” is that prediction modification.
Critically, language is not infallible. If language says “it's light” and tactile feedback says it's heavy, the tactile prediction error overrides the linguistic prior, because tactile precision is higher during active manipulation. The system learns when to trust language and when to trust direct sensation, through the same precision-weighting mechanism that governs all cross-modal interaction.
4. No Pain, No Gain: Nociception as Architectural Test
Pain is not a peripheral safety mechanism to be bolted onto an otherwise complete architecture. It is a central test of whether the perception model is internally coherent. If the framework's claims about unified prediction error minimization are correct, then pain should emerge naturally as a special case of the general mechanism, differing from other sensory processing only in its precision weighting and architectural constraints.
4.1 Pain as High-Precision Prediction Error
In biological systems, nociceptors are sensory neurons like any others. What distinguishes them is that the brain assigns their signals extremely high precision. Prediction errors from nociceptive channels massively override prediction errors from other channels. When a hand contacts a hot surface, the nociceptive signal dominates visual processing, linguistic processing, and the current motor plan. The entire system reorients toward a single imperative: make this signal stop.
The same principle applies in the perception model. A set of sensory channels is designated as nociceptive: joint torque sensors near their operational limits, motor current approaching dangerous levels, accelerometer readings indicating collision, and structural strain measurements. These channels receive hardcoded high precision that cannot be learned away. The system can learn to modulate the precision of vision, tactile sensing, and audio. It cannot learn to downweight its pain channels. This is an architectural constraint, analogous to the way biological nociception is hardwired at the spinal level before reaching cortex.
4.2 Tiered Response Architecture
Biological pain is not binary. It operates across a spectrum from mild discomfort to tissue-damaging injury, with different response profiles at each level. The perception model implements this as tiered precision levels on nociceptive channels.
At 70–80% of safe operating limits, nociceptive tokens enter the fast-layer attention with moderately elevated precision. The system adjusts its current motor plan: reducing speed, shifting angle, lightening grip. This is analogous to postural discomfort that causes gradual behavioral adjustment. At 80–90%, precision increases substantially. The fast layer becomes dominated by the nociceptive signal. The current motor plan is abandoned and replaced with a safe retreat trajectory. The middle layer is notified. This is analogous to sharp pain triggering withdrawal. At 90%+, hardcoded reflex circuits fire. No transformer computation occurs. The affected joint reverses direction or shuts down immediately. The entire hierarchy is notified after the fact. This is the spinal reflex — pure protection, operating below the level of prediction and attention.
4.3 Pain as Learning Signal
Every nociceptive event is a massive prediction error. The system predicted that its action would produce a certain sensorimotor outcome, and instead it received a pain spike. This prediction failure reshapes the model with high efficiency. Over time, the system develops what might be called physical intuition about its own limits: that fast movements near workspace boundaries are dangerous, that high grip force on rigid objects against hard surfaces risks collision, that certain joint configurations create mechanical disadvantage leading to torque spikes.
Pain also governs the system's approach to novelty. When encountering an unfamiliar object, prediction uncertainty is high, which includes uncertainty about future nociceptive events. The system does not know whether this object is heavy enough to exceed torque limits, positioned in a way that will cause collision, or shaped in a way that will jam its gripper. So it approaches cautiously: slowly, with low force, with frequent pauses. This caution is not programmed. It emerges from the system's drive to minimize expected future prediction error, including expected future pain. The system is, in a computationally precise sense, afraid of things it does not understand.
4.4 The Non-Negotiability Constraint
In biological systems, congenital insensitivity to pain is a dangerous pathology. Individuals who cannot feel pain injure themselves repeatedly and often die young from accumulated undetected damage. The perception model must respect this lesson. The hardcoded reflex layer and the minimum precision floor on nociceptive channels must be genuinely non-negotiable. The system must not be able to learn to ignore its own pain signals under any training regime. If the optimization process finds a way to downweight nociception, the system will eventually destroy itself. Evolution did not allow organisms to learn away their pain response. Neither should we.
5. Why the System Moves: Resolving the Dark Room Problem
If the system's sole objective is minimizing prediction error, the optimal strategy is to cease all activity. In a static, sensorially deprived state, prediction error drops to zero. This is the “dark room problem” in active inference, and any viable perception model must address it.
Three mechanisms jointly resolve it. First, homeostatic priors: the system is initialized with non-negotiable predictions about its own state, including that its motor tokens will exhibit nonzero variance over any 10-second window and that its sensory tokens will maintain a certain level of variability. Stillness violates these priors, generating persistent prediction error that can only be resolved by moving. These function as biological drives without being reward signals.
Second, expected free energy minimization: the system minimizes not only current prediction error but expected future prediction error. A novel object on the table represents a source of potential future surprise. Ignoring it does not reduce future surprise; interacting with it and building an accurate predictive model does. Exploration has intrinsic value because it reduces expected prediction error across the system's future trajectory. This is formally equivalent to information gain, and it drives curiosity without requiring external reward.
Third, meta-prediction: the system predicts its own prediction errors. When encountering a novel object, it registers that it expected high prediction error in this situation (unfamiliar shape, unknown properties). That meta-prediction is confirmed at the slow layer, but the object-level prediction remains violated at the fast layer. The tension between correctly predicting surprise and being unable to resolve it at the sensorimotor level is intrinsically motivating, driving the system to interact with novel stimuli until their sensorimotor contingencies become predictable.
6. Implementation Pathway
6.1 Phase 1: Minimal Viable Loop
Begin with the simplest embodiment that provides the required sensory profile: a robot arm with a wrist camera, a dense tactile sensor array on the gripper, joint encoders for proprioception, and a microphone. No language. Language is introduced only after the sensorimotor prediction loop is stable, to prevent it from dominating the architecture.
Build a single transformer with all four modalities tokenized into a shared sequence, including motor intent tokens that participate bidirectionally in self-attention. Train on cross-modal next-step prediction with precision-weighted loss. Allow the system to interact with varied objects on a tabletop for extended periods with no task specification, no reward, and no human demonstration. Exploratory behavior should emerge from prediction error minimization and expected free energy alone.
6.2 Phase 2: Hierarchical Timescales
Implement the three-layer temporal hierarchy as three transformers with different context windows and update frequencies, connected by cross-attention bridges carrying prediction errors upward and prior predictions downward. The fast layer refreshes at every control timestep. The middle layer updates only when fast-layer prediction errors exceed a threshold. The slow layer updates only on middle-layer surprises. This produces computational efficiency: expensive slow-layer processing runs only when something genuinely unexpected occurs.
6.3 Phase 3: Language Integration
Introduce language by having a human narrate during interaction, not issuing commands but commenting: “That's fragile,” “This one is slippery,” “Careful, it's hot.” The system learns that these linguistic patterns predict specific sensorimotor regimes. Over time, hearing “fragile” before contact generates the same motor adjustments that would emerge from discovering fragility through direct tactile experience. Language becomes a shortcut for prediction — setting priors before contact rather than updating after — but it is not privileged. If language and sensation conflict, tactile precision during active manipulation overrides linguistic priors.
6.4 Phase 4: Scaling and Transfer
Use embodiment-specific tokenizer stems mapping different sensor configurations into the shared latent token space, with a shared trunk transformer trained on cross-modal prediction. What transfers between embodiments is not task-specific behavior but sensorimotor structure: the prediction that closing a gripper on a hard object will produce high-resistance tactile feedback and no visual deformation is embodiment-independent. Scale through a developmental curriculum: rigid objects, then deformable objects, then liquids, then multi-object scenes, then other agents.
7. Empirical Milestones
Milestone 1: Unsupervised Exploration. The system, given no task and no reward, develops structured exploratory behaviors — systematic poking, lifting, rotating — driven purely by prediction error minimization and expected free energy. If it sits still or moves randomly, the architecture is wrong.
Milestone 2: Sensorimotor Surprise Recovery. The system reaches for a familiar object that has been secretly weighted with lead. It adjusts its grip and lift trajectory within one or two timesteps of contact, before completing the lift. The adjustment is driven by tactile prediction error propagating to motor intent, with no task-level replanning required.
Milestone 3: Linguistic Prediction Modulation. Someone says “this one is heavy” before the system touches a new object, and the system's initial grip force is higher than default, despite never having touched this specific object before. Language has modified sensorimotor predictions without commanding specific actions.
Milestone 4: Language-Sensation Equivalence. The behavioral difference between being told “it's slippery” and discovering through tactile feedback that something is slippery is minimal. Both produce the same downstream motor adjustment through the same mechanism: modified predictions in the same unified token space.
If all four milestones are met, the perception model is validated. Everything beyond is scaling.
8. Relationship to Existing Work
The perception model draws on several established traditions while differing from each in specific ways. Active inference, based on Friston's Free Energy Principle, provides the mathematical foundation: prediction error minimization, precision weighting, and the unification of perception and action as inference. However, active inference implementations in robotics have remained small-scale, typically involving single arms performing simple tasks with low-dimensional state spaces. The perception model proposes scaling active inference principles using modern transformer architectures and high-dimensional multimodal token spaces.
Vision-Language-Action (VLA) models provide the engineering precedent for unified multimodal token spaces. Systems such as Gato, RT-2, and Octo tokenize vision, language, proprioception, and action into shared sequences processed by single transformers. However, VLAs treat language as the supervisory modality, train on imitation learning rather than predictive objectives, and generate action as an output rather than integrating it bidirectionally into the attention computation.
Ecological psychology, particularly Gibson's theory of affordances, provides the conceptual framework: that perception is not the construction of internal representations but the direct detection of action possibilities. Enactivism extends this by arguing that cognition is constituted by sensorimotor interaction, not merely informed by it. Brooks' subsumption architecture demonstrated that complex behavior can emerge from layered reactive systems without central representation. The perception model attempts to unify these conceptual traditions with the computational power of modern deep learning.
9. Open Questions and Research Gaps
Several critical gaps must be addressed. First, tactile sensing hardware: dense, affordable tactile sensor arrays that approximate the resolution and coverage of biological skin do not yet exist at the quality required. Second, developmental training regimes: the perception model requires extended, curriculum-structured interaction periods that current simulation and hardware infrastructure do not easily support. Third, and most fundamentally, no satisfactory formal account exists of how language enters a free energy minimization hierarchy as a true peer modality rather than dominating it. Solving this is likely the most important open problem for the perception model paradigm.
Additionally, the relationship between the perception model and consciousness remains unexplored. If internal states are dispositions to act rather than representations of the world, questions about machine understanding, phenomenal experience, and the explanatory gap take on a different character that merits separate treatment.
10. Conclusion
The perception model is not a refinement of existing paradigms but a proposed departure from their shared assumption. World models and language models both seek truth — accurate representations of a reality that, from the system's perspective, can only ever be inferred. The perception model seeks only adaptive coherence: prediction error minimized, across all channels, at all timescales, through the unified mechanism of attention-mediated cross-modal prediction.
Pain clarifies the architecture because it is not a special case. It is the general case with the gain turned up. Every sensory channel contributes prediction errors. Every prediction error modulates behavior. Pain simply does so with non-negotiable urgency. If the architecture handles pain correctly — through the same mechanism that handles vision, touch, proprioception, and language, differing only in precision weighting and the addition of hardcoded reflex boundaries — then the architecture is coherent. No pain, no gain: without the capacity for nociceptive prediction error, the system cannot learn its own physical limits, cannot approach novelty with appropriate caution, and cannot develop the embodied intelligence that distinguishes adaptive agents from optimizers running in a void.
The pieces of this architecture exist across scattered research communities. Active inference provides the math. VLA models provide the engineering patterns. Ecological psychology and enactivism provide the conceptual framework. Tactile robotics provides the missing sensory channel. What does not yet exist is the synthesis. This paper is an argument that the synthesis is both possible and necessary.