Discussion about this post

User's avatar
Yijia Dai's avatar

I evaluated LLMs on hidden Markov models (observations are emitted from hidden Markovian states), and found that in-context learning along can get to optimal next-observation prediction accuracy. Notably, it's under a condition where it's relevant to your observation -- when the entropy of observation to hidden state is low. Feel free to check https://arxiv.org/pdf/2506.07298 for details.

Language is a tool for humans to communicate, so the tokens are extremely efficient in capture the latent representations. However, in real world, like videos, or images, or any signals in nature, the underlying physics is much more complex. But I am hopeful that we can still build good algorithms and scale it up to learn something great.

Expand full comment
Elan Barenholtz, Ph.D.'s avatar

But video generation models trained autoregressively do seem to solve vision just as language models solve language. They capture structure, object permanence, physical causality, and coherent dynamics. I think autoregression may well be the unifying principle, the engine running across all these separate, generative systems.

The kind of “understanding” you say is missing—symbolic reasoning, abstraction, generalization—doesn’t reflect a shortcoming of vision models or autoregression generally. It reflects the absence of a bridge between modalities. Each system—vision, language, action—is its own separate cave, with its own internal structure and generative logic. The real challenge isn’t escaping the cave; it’s connecting them.

The recent paper Harnessing the Universal Geometry of Embeddings paper offers a promising approach: align these modular embeddings through shared structure, not symbolic translation. But there are likely other possible approaches.

Either way I think we’re tantalizingly close to the grand unified vision.

Expand full comment
11 more comments...

No posts