The Perception Revolution

Genuinely multimodal AI is pushing the frontier of machine intelligence

Apr 17, 2025

In the early 1920s, Soviet filmmaker Lev Kuleshov conducted a remarkable experiment. He juxtaposed identical footage of an actor's expressionless face with different images—a bowl of soup, a child's coffin, a woman lounging on a divan—and showed these sequences to audiences. Despite seeing the same neutral expression each time, viewers insisted they saw hunger, grief, or desire in the actor's face. This "Kuleshov effect" revealed something profound: human perception is inherently temporal and contextual. We do not simply perceive isolated moments but interpret them through their relationship to what came before and after.

The Kuleshov effect demonstrates what Martin Heidegger would later articulate in his phenomenology: understanding is fundamentally temporal in nature. For Heidegger, our being-in-the-world exists not as static snapshots but as a continuous unfolding through time and space—our comprehension of any present moment is inseparable from our memory of the past and anticipation of the future. Meaning emerges not from isolated perceptions but from their situatedness within our lived experience.

This temporal nature of understanding exposes a fundamental limitation in traditional artificial intelligence. For decades, AI systems operated in a detemporalized vacuum, processing static inputs without any sense of before or after. Text-based models analyzed sentences in isolation, image recognition systems classified individual pictures, and speech recognition transcribed audio without visual context.

Today's multimodal AI systems, which process multiple sensory streams simultaneously, begin to address this fundamental gap. By integrating vision, language, audio, and temporal understanding, they move closer to the contextual, time-embedded perception that Heidegger saw as essential to intelligence. This evolution raises a significant opportunity to push AI forward towards general purpose systems that adapt to the world of meaning humans embody.

From Symbols to Senses: The Evolution of AI Perception

The earliest AI systems of the 1950s through 1980s inhabited a remarkably impoverished sensory world. These systems manipulated abstract symbols according to formal rules, operating entirely in the realm of text or logical propositions. They could play chess by manipulating board positions represented as data structures, or engage in simple text conversations through pattern matching, but they couldn't see, hear, or integrate different types of information. Even the concept of context was limited to whatever could be explicitly encoded in symbols.

This approach reflected what computer scientist Allen Newell and cognitive psychologist Herbert Simon called the "physical symbol system hypothesis"—the idea that intelligent behavior could be achieved solely through symbol manipulation. But as philosopher Hubert Dreyfus pointed out in his critique "What Computers Can't Do," this fundamentally misunderstood human intelligence, which emerges from our physical embodiment and direct engagement with the world.

Computers process symbols, but humans inhabit situations. Drawing on Heidegger and Merleau-Ponty, Dreyfus insisted that true intelligence requires presence in a world—something early AI systems conspicuously lacked. His critique seemed validated when early AI efforts repeatedly hit walls in basic tasks like vision and locomotion.

By the 1980s, researchers began to acknowledge these limitations. Hans Moravec articulated what became known as "Moravec's Paradox"—the observation that high-level reasoning required relatively little computation, while low-level sensorimotor skills that humans take for granted required enormous computational resources. Rodney Brooks at MIT, inspired by Dreyfus, responded with his "subsumption architecture," building robots that prioritized direct sensing and action over abstract reasoning. "The world is its own best model," Brooks argued, suggesting that intelligence should grow from interaction with the environment rather than symbolic manipulation.

The neural network revolution of the 1990s and 2000s accelerated this shift. Rather than programming explicit rules, AI systems began to learn patterns from data—including sensory data like images and sound. This approach produced more flexible systems capable of processing real-world inputs, but most still focused on single modalities: computer vision systems that couldn't hear, speech recognition systems that couldn't see.

The Transformer revolution that kicked off after 2017 accelerated the applicability of single modality algorithms, and opened up pathways towards multimodality with innovations like OpenAI’s CLIP, a vision-text model, in 2021. Meta AI pushed further in 2023 with ImageBind, learning a joint embedding space for six modalities at once—images, text, audio, depth, thermal signals, and inertial measurements. Unlike previous systems that typically paired only two modalities (such as text-image or audio-image), ImageBind can establish relationships between modality pairs it never explicitly learned to connect during training. For example, after training on text-image and audio-image pairs separately, ImageBind can establish audio-text relationships without ever seeing direct audio-text pairs, demonstrating a form of cross-modal inferential capability that parallels human associative cognition.

Today's leading systems, like TwelveLabs for video understanding or Google's Gemini, represent the latest phase of this evolution—AI systems that can process and integrate multiple sensory streams within unified models, producing responses that reflect an understanding of the relationships between what they see, hear, and read.

The Multimodal Mind: How Today's AI Perceives the World

The evolution of AI perception reflects a gradual recognition of what Heidegger understood: while language may be "the house of Being," that house exists in a world of sight, sound, and temporal flow. Many leading AI research labs continue to follow what might be called the "text is all you need" paradigm. Large language models like GPT processed billions of text documents, achieving remarkable linguistic capabilities while remaining blind, deaf, and temporally flat. These systems could generate eloquent paragraphs about sunsets without ever seeing light, discuss music without hearing a note, and describe actions without any sense of their duration or physical reality.

This linguistic focus was not without philosophical justification. Language does encode vast knowledge about the world, and Heidegger himself emphasized that "language speaks us" rather than the reverse. Yet this approach fundamentally misses Heidegger's broader insight: language is meaningful precisely because it emerges from our embodied, temporal existence in the world. Words gain their significance through their connection to lived experience across multiple sensory dimensions.

The transition to multimodal AI began with simple pairings—models that could match images with captions or transcribe speech to text. These early systems still processed each modality separately before combining their outputs. Now unified architectures can process multiple sensory streams within a single framework, creating shared representations across modalities.

What makes this possible is the transformer architecture, which treats all inputs as sequences of tokens. In multimodal applications, transformers create a unified computational framework where images, audio, and text—despite their fundamental differences—can be processed using the same mathematical operations. The self-attention mechanism calculates weighted relationships between all elements in a sequence, allowing the model to focus on relevant connections across modalities while maintaining their contextual relationships. This creates a form of artificial synesthesia, where the concept "cat" activates similar patterns regardless of whether the input is visual, textual, or auditory.

Perhaps most significant from a Heideggerian perspective are models that incorporate temporality. Video understanding systems like TwelveLabs don't just analyze static frames but track entities and relationships through time. They can answer queries like "When does the chef start kneading the dough?" by recognizing not just objects and actions but their temporal sequence—a primitive analog to the situated temporality that Heidegger saw as fundamental to human understanding.

The most ambitious systems connect perception to action in embodied platforms. Google's PaLM-E augments a language model with continuous sensor inputs from robot systems, allowing it to perceive its environment, interpret instructions, and generate physical actions. This represents a rudimentary form of "being-in-the-world"—the system doesn't just process symbols but interacts with physical reality through multiple sensory channels.

These advances suggest that AI is moving beyond the "text is all you need" paradigm toward a more Heideggerian recognition that understanding emerges from embodied, multimodal, temporal engagement with the world. Yet significant limitations remain.

Time and Motion: AI's Growing Temporal Awareness

Martin Heidegger placed temporality at the center of understanding, arguing that our knowledge is inseparable from our existence in time—our memories of the past, engagement with the present, and anticipation of future possibilities. In his analysis, temporality is not just one aspect of intelligence but its very foundation. We understand entities not as timeless objects but as embedded in temporal contexts—the hammer is not merely an object with certain properties but a tool that is used in particular moments and spaces.

Modern multimodal AI, particularly video understanding systems, explicitly incorporate temporal dimensions. They don't just analyze isolated frames but process sequences unfolding through time, developing representations of motion, causality, and narrative structure. This temporal awareness represents another step toward the kind of situated intelligence that Heidegger described.

This pathway towards intelligence, with focus and investment, can bring about genuine novel possibilities from AI models. A video understanding model that sees and hears what is happening, notices the details in the background, and reads the text in each frame, remembering what happened before, can more reliably predict what is going to happen next than something that focuses on one component alone.

This temporal awareness enables projection—the way humans constantly envision possibilities of the future based on their understanding of the present. An AI that predicts a pedestrian will continue crossing the street isn't just classifying pixels; it's projecting that entity onto one of its possibilities. While far more limited than human projection, this represents a primitive analog to how humans navigate temporal reality.

Temporality also enables more sophisticated context understanding. A video model can disambiguate situations that would be unclear from static images alone. A person raising their hand might be asking a question, waving hello, or blocking their face—distinctions that become clear only by observing the sequence of movements and their context. Similarly, emotions become more legible when tracked across time—the progression from neutral expression to smile to laughter tells a different story than any single frame.

AI has moved beyond the static, snapshot reasoning that characterized earlier approaches. The ability to track entities through time, recognize causal relationships between events, and maintain context across sequences represents a significant advance toward more human-like understanding. While not experiencing time in the Heideggerian sense, these systems at least represent time in their computations—a necessary foundation for any intelligence operating in our dynamic world.

The Road Ahead: Challenges and Possibilities

Despite remarkable progress, multimodal AI faces substantial technical challenges and philosophical limitations on the path toward more human-like intelligence. Current approaches often process different inputs through specialized encoders before combining them—an architecture that might miss subtle cross-modal relationships. Truly unified multimodal processing, where low-level features from different sensory streams influence each other from the beginning, remains elusive. Similarly, while models handle short temporal sequences well, maintaining coherence over longer durations—like a full movie or an extended conversation—taxes current memory mechanisms.

Beyond these technical issues lie deeper philosophical questions about the nature of machine intelligence. The question of genuine embodiment remains central. Rodney Brooks argues that "the world grounds regress"—that is, physical embodiment in a real environment forces systems to deal with the complexity and unpredictability of reality rather than idealized abstractions. Some researchers envision tighter integration between multimodal AI and robotics, creating systems that learn through physical interaction rather than passive observation. Projects like Google's PaLM-E and DeepMind's Robotics Transformers represent early steps in this direction, connecting perception to action in embodied platforms. But it remains an open question whether the tight coupling between “body” and “mind” is possible.

Despite these challenges, the trajectory of multimodal AI suggests a future where artificial systems perceive and interact with the world in increasingly human-like ways. We might soon see personal AI assistants that can see, hear, and converse naturally across multiple contexts, or specialized systems that can analyze medical imaging while incorporating patient records and verbal symptoms. In creative domains, multimodal AI might generate integrated content spanning text, images, audio, and video based on high-level descriptions.

What seems clear is that multimodal integration represents not just a technical advance but a philosophical shift in how we approach artificial intelligence. By acknowledging the multi-sensory, contextual, and temporal nature of human understanding, AI research has moved beyond the limitations of purely symbolic approaches toward systems that engage with the world in richer, more situated ways. It is unknown whether this trajectory eventually leads to artificial general intelligence, but it has already produced systems that perceive and reason in ways that have so much runway to broaden and deepen.

The (Ge)Narrative

Discussion about this post