- โข
Action-conditioned models are necessary for spatial intelligence
โThe reality is that although the visuals do look fantastic, those visuals actually aren't accompanied by an understanding of the 3D world, understanding how objects can move, what the consequences of different actions are, and that's what's really needed for spatial intelligence. So, I mean, a term we sometimes use is that you need action conditioned world models, that you only actually have a world model if you can predict, given some action is taken, what is going to change in the world because of it.โ
- โข
Prioritize structural abstraction over raw pixel scaling
โI think it's fair to say that, you know, vision understanding sort of stalled out; you got to object recognition and then progress just wasn't being made. There's really an interesting research question as to why that is, and at heart, the ideas behind Moonlake are an attempt to answer that, believing that there can be a really rich connection between a more symbolic layer of abstracted understanding of visual domains, which aren't in the mainstream vision models, which are still trying to operate on the surface level of pixels.โ
- โข
Synthetic data matches real-world utility for multimodal training
โWhen I was actually working with Nvidia on the Synthetic Data Foundation Model Training Project, we were actually generating a lot of these synthetic data and showing that these synthetic data are actually as useful as real-world data when it comes to multimodal pre-training. But then, there's a lot of dollars being paid out to external vendors or other folks to manually curate these types of data.โ
- โข
Models should mimic human task-directed semantic abstractions
โAll of the evidence from neuroscience and psychology is that most of what comes into people's eyes is never processed. You're doing fairly fine-grained processing of exactly what you're focusing on, but as soon as it's away from that, you've sort of only processing top-down this very abstracted semantic description of the world around you. Human beings are working with semantic abstractions.โ
- โข
True world models require long-horizon consequence prediction
โIf you're simply, you know, trying to predict the next video frame, that's not so difficult. But what you actually want to do is understand the consequences, likely consequences of actions minutes into the future. And to do that, you actually need much more of an abstracted semantic model of the world.โ
