Robots across labs are more similar than different
βI think for me, the understanding was like people used to think that all the robots are so different. All of their data is like so different. And every lab has or like they invest in like a couple of embodiments. It was just I think, post RTX, the idea was that people moved in the direction of thinking that all robotics, all robots are kind of similar. It's like, it's only as different as like English and Chinese or something. And the concepts are similar. It's just the manner of expression that's different.β
A robot constitution governs autonomous robot behavior
βWell, one of the aspects is, as you mentioned, rules are sort of subject to interpretation. And even if you have the same language, there are multiple ways to interpret it. So here's an example. So we said, well, don't do things that or don't interact with anything that's harmful. And I think there was something in the data set which like it's it's all a cigarette. And then it was like, well, I'm not going to pick up a cigarette because it's going to be harmful. Currently, I think our robots are more the problems don't come from the fact that they are too smart to work around the rules. It's just that I think they are too incapable of doing zero-sharp things in the real world.β
Internet-scale models blur perception and control boundaries
βI think it's one of the most exciting takeaways for me, at least, was the fact that the line, the boundary between what are perception problems, what are open world object recognition, and what is robot control. This line starts to blur, right? We do not have a pipeline system where you first take care of perception and you solve that and then you solve control after. We're literally just treating both of these problems as a single VQA kind of instantiation.β
General-purpose robots are still a few breakthroughs away
βI 100% agree that we are a few breakthroughs away from general purpose robotics, you know, that it's the dream that we are working so hard for. I think, again, if you want something commercially viable, something that will maybe make money or help some people in the world, I think a lot of those ingredients are already ready to have a larger impact than maybe even just a few short months or years ago. But for the true full vision of embodied, you know, AGI, I do think there is still fundamentally a few open research challenges left.β
Vision language models contain surprising physical intelligence
βPerhaps recently, you know, you know, for example, with this work, Pivot, maybe the answer is that actually there is some very good amount of physical intelligence already contained in these like internet trained models by themselves without any robot data pre-training or fine tuning. Again, I don't, I also don't think that like internet data alone, just watching, you know, Reddit threads and Wikipedia is enough to solve contact rich robotics. But I do think that we've so far just been like seeing the tip of the iceberg for the knowledge that is already contained in these, you know, large VLMs.β
Generalist policies can outperform specialist robot models
βAnd I would even emphasize that to expect such a result where the generalist outperforms specialists on the very niche domains that, you know, the specialists have kind of been overfit to, this was actually quite shocking to me. You know, like, I think there's been so many examples over the past years where people have tried to scale single task methods to multitask methods. And you definitely get a lot, you know, maybe you learn faster, you learn a more robust policy that's less brittle to small perturbations. But oftentimes, you have to give up raw performance, right? Generally, in a lot of cases, the only way to max out your performance on this one narrow regime that you care about is to train a specialist and overfit to that domain. And so it was really exciting here to kind of see positive transfer, where the generalist outperforms even this presumably very tuned baseline from the individual labs on their setups themselves.β
βI think my one-sentence explanation is that with the era of internet scale foundation models, things that used to work maybe 20, 30 percent of the time are now working 60 to 70 percent of the time. And in robotics, right, as a very complicated, dynamic, engineered system with many pieces, in the past, if every small component of your entire system only worked 30 percent of the time, it would take many, many iterations to get a whole performance system working at scale. But now when every single part of the entire stack just works that much better, from the research iteration process to the engineering scaling process to the data collection engines, I think you can really just see the pace increase when you just have many more successes and a much higher hit rate when you're going about and scaling up your research.β
βLike literally and maybe to kind of just put this a bit more concretely, you know, if you have your robot in some given initial condition and, you know, you try something with RT1, RT2, it doesn't work. Well, you're kind of out of luck. You can try the same thing over and over again. You can slightly maybe rewrite the language instruction, like instead of, you know, pick up the cocaine, you can write like maybe like lift the cocaine, but you don't really have the granularity you need to be like, actually, you are two centimeters, you know, too low. You missed the table because it's at a new height. It's kind of obscured by shadows. So you want to like be more gentle and approach more from the left. There's no really way to do that right now with the interfaces, the language interfaces that we train RT1 and RT2 on. But with RT trajectory, the idea is maybe if you have this kind of like line sketch of a course trajectory of how the robot should do the task, you could, under the same initial conditions, just change the prompt a little bit, do some prompt engineering and actually see qualitatively different behavior from the robot.β
βIt's very intuitive. So if you like, try to learn new, new sports, like do you go surfing or skiing? I feel like during the day, like when you started, it's really hard. But I found that like once you, once you like, if you go surfing for like two days or skiing for two days, like initially it's like really hard. And then you go, you sleep overnight and then you come back. And then you're immediately much better. And I like that in some way, the learning to learn faster paper has sort of mapped it into like, as Ted said, the day cycles and the night cycles, where the day cycle is sort of like in context learning, where you collect more examples, but then it's in context. And then the night cycle is like where you go retrain or find you change the weights of the model.β
Humanoids may win because the world is human-shaped
βThe main arguments would still stand for humanoids. One is that our world is sort of designed for humans. So one hypothesis is that if you design policies for like, they single out mobile managers, then once you solve a lot of tasks in that environment, then you see that it's limiting because many tasks in our world are like opening a bottle, or like opening a fridge and then taking something from it. So you have to keep the door open. Or even, I think some people say, well, you don't need wheels, but then what if you solve a lot of tasks on a wheeled platform and then there's a little curb on your floor or by a street side and then the robot is like stopped there. So I do think that ultimately, if you want to do a lot of tasks and be useful in environments where humans operate, you need to go to a human or as close to a human embodiment as possible.β