Video generation models are currently being used for creating short clips and meme videos, but their implications might go a lot further.
VC Marc Andreessen has said that the increased capabilities of video generation models like OpenAI’s Sora could help in the creation of robotic worlds like Westworld. Westworld is a TV show in which human characters can enter game areas populated entirely by robots who are indistinguishable from humans.
”If you look at these videos coming out from Sora, if you look at them carefully,basically what you see is like multiplesources of lighting in different parts in 3D space coming together. You see reflections coming off of reflective surfaces that are actually correct. You see translucency that’s correct,” Andreessen said in an interview.
“And then you get combinations of these factors. For example, if you (make) something where like man walking through a puddle at night, you’ll get the splashing effects of the water. The water has to splash in a way that is physically realistic. The light has to come through, refract through the water droplets in the correct way. The water droplets have to reflect the image of the man’s shoe in the right way. The AI term for (all) this is world model,” Andreessen explained.
“This is not only a model for video. This is a model that actually understands the real world. It understands 3D reality, light, surfaces and textures and materials and motion and gravity,” Andreessen said.
“The implication of that is that we may have basically just solved the fundamental challenge of robotics. The fundamental challenge is how do you get a 3D model? A physical robot to navigate the real world without screwing everything up. So how do you get a robot waiter to navigate through a busy restaurant without stepping on anybody’s foot, without tripping over anything, without spilling water on the table. (It has) to understand everything that’s happening in real time and to be able to adapt,” Andreessen went on.
“And it turns out one of the things you need to do to do that is you need a world model — the robot needs to have a comprehensive understanding of physical reality so that it can understand what’s happening. The robot is seeing primarily visually, right? So you have to map the visual into an internal representation of the 3D world. Up until now, building a world model like that has been difficult or impossible. And it now appears that that’s actually starting to, that’s actually starting to work. So Westworld, (it’s happening) 2028-2030,” Andreessen smiled.
Westworld might not actually be a reality in three years, but there’s little doubt that video generation models are becoming increasingly sophisticated. Just over a year ago, AI generated video was glitchy and with awkward movements, but the latest models produce outputs that might not be very long, but are increasingly indistinguishable from reality. And if this progress continues, this ability to generate realistic worlds could end up ushering in a new era in robotics development.