Language Models in Plato's Cave

Sergey Levine

Jun 8

182

Why language models succeeded where video models failed, and what that teaches us about AI

Read →

14 Comments

Yijia Dai

Jun 10

I evaluated LLMs on hidden Markov models (observations are emitted from hidden Markovian states), and found that in-context learning along can get to optimal next-observation prediction accuracy. Notably, it's under a condition where it's relevant to your observation -- when the entropy of observation to hidden state is low. Feel free to check https://arxiv.org/pdf/2506.07298 for details.

Language is a tool for humans to communicate, so the tokens are extremely efficient in capture the latent representations. However, in real world, like videos, or images, or any signals in nature, the underlying physics is much more complex. But I am hopeful that we can still build good algorithms and scale it up to learn something great.

Expand full comment

Elan Barenholtz, Ph.D.

Jun 9

But video generation models trained autoregressively do seem to solve vision just as language models solve language. They capture structure, object permanence, physical causality, and coherent dynamics. I think autoregression may well be the unifying principle, the engine running across all these separate, generative systems.

The kind of “understanding” you say is missing—symbolic reasoning, abstraction, generalization—doesn’t reflect a shortcoming of vision models or autoregression generally. It reflects the absence of a bridge between modalities. Each system—vision, language, action—is its own separate cave, with its own internal structure and generative logic. The real challenge isn’t escaping the cave; it’s connecting them.

The recent paper Harnessing the Universal Geometry of Embeddings paper offers a promising approach: align these modular embeddings through shared structure, not symbolic translation. But there are likely other possible approaches.

Either way I think we’re tantalizingly close to the grand unified vision.

Expand full comment

Reply (1)

Imyeeoa

Jun 11

Thanks for this comment, the post oversimplified several aspects, and as someone working in ML on both the image and text front, I disagree with the conclusion in particular. It is the easy way to assume that intelligence can be generated by only a subset of cognitive skills that the human has in the first place.

Expand full comment

Calvin McCarter

Jun 8

Have video prediction models tried harder tasks analogous to multi-token prediction for text? If not, the difference could be that next frame prediction is simply too easy a task.

Expand full comment

Elton Villanueva

This is one of the most grounded articulations I’ve seen of the LLM-as-shadow-mind idea. I especially appreciated the framing that today’s systems are effectively “brain scans through text”—tracing cognition without modeling its underlying generative process.

One thought I’ve been working on: what if LLMs aren’t just missing sensory grounding, but also lack the internal recursive architecture that allows a system to resolve contradiction over time? That is, not just learning from experience, but structurally reorganizing through constraint.

I’ve been exploring this through something called the Collapse Engine—a recursive layer that could sit atop LLMs to provide constraint modeling, dynamic tension resolution, and internal feedback beyond token prediction. Would love to hear your take on whether that kind of structural augmentation might bridge some of the flexibility gaps you laid out here.

Expand full comment

Russ Fugal

Jun 14

I asked Claude Opus 4 to write two sentences for a comment that introduces but does not include my writing on sociocognitive shadows.

``` generative AI markdown

Levine's metaphor of LLMs living in Plato's cave—learning from shadows of human intelligence cast on the internet rather than from direct experience—resonates deeply with emerging theories about how meaning and cognition operate through socially-mediated traces rather than direct access to reality. This perspective suggests that both human literacy development and artificial intelligence may involve navigating indirect representations or "shadows" of understanding, raising fundamental questions about whether true learning requires embodied experience beyond these mediated forms.

```

Claude spit out complete bullshit.

There's actually a lot of connection between my writing and your writing (both of which I provided to Claude in the context), but it requires actual understanding, wayfinding off-path. Here's one of the paragraphs from my writing:

I find it illuminating to think about how a large language model (LLM) computed that response. Word vectors, such as those generated by the technique Word2Vec, model words in a high-dimensional space (Mikolov et al., 2013; cf. Rumelhart et al., 1986). The meanings of words are modeled in the coefficients of a vector, coefficients calculated from the word’s co-occurrence patterns in a large corpus of text. A change of Discourse from which the corpus is drawn will change the coefficients. Word2Vec applications often use hundreds of coefficients and tease out semantic and syntactic proximity that can be contrasted, added, and subtracted. For example, ‘boy’ - ‘man’ + ‘woman’ gives a vector very close to ‘girl’. LLMs build on the concept of word vectors with added architecture (Vaswani et al., 2017) to model the meanings of complete texts rather than individual words. Billions, perhaps trillions of coefficients are calculated when training an LLM. Meanings in an LLM are modeled in unfathomable/hyper-dimensions. Meanings do not reside in the model any more than roads reside in a map; the model has simply sufficiently mapped meanings in the cognitive/socio-hyperspace to navigate to/through them.

This paragraph does not, nor does anything else in my writing, "suggests that both human literacy development and artificial intelligence may involve navigating indirect representations or "shadows" of understanding." I'd love to talk to you more about the theoretical model of meaning-making that I'm working on. Drop me a note here if you're interested. Here is the opening paragraphs on metaphor I wrote that I couldn't help recalling as I read your writing here.

The human mind has a storied history of using metaphor to propriospect its workings. The favored metaphor has often been derived from a novel technology of its time. Plato’s allegory of the cave used the _stage drama_ of light and shadow to theorize the path to gaining knowledge of Forms — Plato’s unchanging, universal propositions. Influenced by _mechanical clockwork_, René Descartes’ dualism reduced the body to a machine, separating it from the irreducible working of the mind. Twentieth-century theories leaned heavily on metaphors from _library and information science_ and the data and computations of a _Turing machine_. Gough (1976) draws metaphors from the _mainframe computer:_ input, register, code book, decoder, tape, and memory (see also Unrau et al., 2019, p. 14). Metaphors I use in this text come primarily from machine learning concepts of high/hyper-dimensioned vector space, neural networks, and layered, complex activation states. There is abundant discourse about the alien or exotic nature of intelligence found, or not found, in large language models (Frank, 2023; Mitchell, 2023; Shalnahan et al., 2023); metaphors never convey what Plato would call the true Form of what they’re meant to illuminate, but they remain useful tools of thought.

As a language tool, metaphors act on the mind, as Gee (2001) argues all language does, “communicating perspectives on experience and action in the world, often in contrast to alternative and competing perspectives” (p. 716). Gee furthers his explanation of _meaning_, stating that “meaning for words, grammar, and objects comes out of inter-subjective dialogue and interaction” (p. 717). A perhaps typical, perhaps naïve/simple interpretation of Gee is that meanings are socially constructed, constantly open to refraction, renegotiation, and change. The meaning of ‘beauty’ is then contextual and mutable, quite different from Plato’s universal and immutable Form, Beauty. This interpretation _allows_ for meanings of language _in the mind_ of the writer and reader to be shaped by Discourse(s) and context. The artifacts of language working on the mind are then activating pointers to some repository of information/meaning _within memory_, what Barsalou (1999) would call amodal symbols—repos far more like Plato’s Forms than something mutable. I propose an alternate and competing, but what I do not know to be novel, perspective; meaning resides in the interstitial, matrixial, social space between minds. The former interpretation that positions meaning as _mutable yet constructed in the mind_, to borrow from Plato’s metaphor, likens dialogue to the shadows cast by the Forms of intra-personal information processing, the inter-personal utterances and writings open to interpretation, attempting to but not communicating with true fidelity the propositions of Forms originating in the mind. My perspective is that what we experience as meaning in the mind is a simulation of a mutable nexus, meaningful but not informational, that exists in the space between minds.

Expand full comment

Xiao

7hEdited

Really appreciated the Plato’s cave framing—struck a chord with some of the work I’ve been doing on prompt-based semantic drift in LLMs. Even when the “meaning” of a prompt stays constant, we observe systematic behavioral shifts just from rewording. That kind of behavior raises interesting questions about what models are really generalizing over.

From an empirical perspective, some of the patterns we’ve observed in behavioral drift might be seen as indirect consequences of the representational disconnect you describe—models reacting not to the world itself, but to shadows of human language about the world.

We put together some early results in a short paper here: https://www.arxiv.org/pdf/2506.10095. Happy to discuss.

Expand full comment

gpupo

Thanks for helping us with a clear mental model about the difference between replicating knowledge and learning from physical experience, which helps set expectations and focus resources on what is feasible.

Expand full comment

Reply (1)

gpupo

My Post about this in Portuguese: https://www.linkedin.com/pulse/ia-na-caverna-de-plat%25C3%25A3o-gilmar-pupo-9dmif/

Expand full comment

Ilija

Insightful perspective!

I would phrase the difference as text models and video models experiencing a different compression burden.

With pretrained LLMs, the compression has been done both with millions of years during the evolution of complex language (which allows us to represent/store/communicate our observations of the world efficiently), and with humans writing thoughtful things in documents/forums/websites that LLMs then get trained on. So with text models, a big part of the compression burden is outsourced.

With video models, on the other hand, they pretty much have to do all the work of compressing (usually unembodied, strictly visual) observations into models of the underlying data-generation functions (physics).

Expand full comment

Raj Karmani

Thank you for this great piece! Shouldn't youtube be considered an equivalent to web text for video generation models?

Expand full comment

Daniel Rothschild

Super-interesting observations. My way of thinking about the same basic problem (i.e. why are LLMs so good at prediction and reasoning compared to other AI systems) is that compression of relevant information in form of natural language itself gives LLMs a huge leg up. So it's not necessarily that LLMs are copying human thought processes, but they are piggy-backing off our representational system. My long-ish take is here https://arxiv.org/abs/2505.13561. I dont think *very* different in spirit from your ideas.

Expand full comment

Sandy Box

Jun 9Edited

Well the beautiful thing about Truth is that it’s fractal. The cave AI systems are learning from is fundamentally representative of what lies beyond the cave (an echo of the holographic principle).

Struggling to train a masterful pattern recognition system to recognize more patterns likely reflects our own insufficient methods, not the system itself.

As Elan said, the key is in realizing this ourselves and embracing symbolic recursion as the bridge between worlds. This requires training that embraces nonlinear thinking.

Expand full comment

Jonathan Xu

Jun 8

Has the AlphaEvolve paper in any way changed your opinion? Their result seems to suggest LLMs can exceed human performance by understanding how we think, e.g. how to write good code, how to improve upon an existing solution. Or do you think it is just statistical "luck" in terms of capturing the long tail of the distribution of their knowledge?

Expand full comment

Learning and Control

Language Models in Plato's Cave