Discussion about this post

User's avatar
Yijia Dai's avatar

I evaluated LLMs on hidden Markov models (observations are emitted from hidden Markovian states), and found that in-context learning along can get to optimal next-observation prediction accuracy. Notably, it's under a condition where it's relevant to your observation -- when the entropy of observation to hidden state is low. Feel free to check https://arxiv.org/pdf/2506.07298 for details.

Language is a tool for humans to communicate, so the tokens are extremely efficient in capture the latent representations. However, in real world, like videos, or images, or any signals in nature, the underlying physics is much more complex. But I am hopeful that we can still build good algorithms and scale it up to learn something great.

Expand full comment
Calvin McCarter's avatar

Have video prediction models tried harder tasks analogous to multi-token prediction for text? If not, the difference could be that next frame prediction is simply too easy a task.

Expand full comment
5 more comments...

No posts