- โข
Video generation lacks causal world understanding - current models like Sora produce impressive visuals but fail to grasp 3D physics, object permanence, or the consequences of specific actions over long time scales.
โThe reality is that although the visuals do look fantastic, those visuals actually aren't accompanied by an understanding of the 3D world, understanding how objects can move, what the consequences of different actions are, and that's what's really needed for spatial intelligence.โ
- โข
Symbolic structure provides a massive efficiency shortcut - while raw pixel data is abundant, integrating a semantic abstraction layer allows models to learn world rules with up to five orders of magnitude less data than brute-force scaling.
โIf there are ways in which you can work with five orders of magnitude, less data than people working purely from pixels, you're going to be able to make a lot more progress a lot more quickly. And that's the bet here.โ
- โข
Interactive data is the critical bottleneck for robotics - standard observational video lacks the action-consequence loops necessary for embodied intelligence, creating a massive demand for simulated worlds where agents can learn through trial and error.
โOn our way to, let's call it, embody general intelligence, models need to learn the consequences behind their actions, which means that they need interactive data. The demand for those types of data are growing exponentially.โ
