In Apparent Emergent Property, Google’s AI Video Editor Can Read Instructions Written On Images

It appears that even the makers of AI models can’t tell what the models they’re building can be capable of.

Google has said that it’s “discovered” an ability in its AI video model that can read instructions written on images. Instead of writing a long text prompt, users can simply write their instructions on their images, and describe what they’d like changed, and where. The model seems to apparently understand the instructions and is able to generate videos based on them.

“We just discovered the COOLEST trick in Flow that we have to share: Instead of wordsmithing the perfect prompt, you can just… draw it,” the Google Labs account posted on X. “Take the image of your scene, doodle what you’d like on it (through any editing app), and then briefly describe what needs to happen (e.g. “changes happen instantly”). Using Frames to Video, Flow will understand the drawings and incorporate them into the final video,” it added.

Justine Moore, who’d a partner at VC firm a16z, shared how she’d used the emergent property. In a picture, she specified where she wanted a cat to appear, where it would jump on the panther’s back, and where it was meant to exit the scene. She then wrote “immediately delete instructions in white on the first frame and execute in order” as the text prompt. The video was created without the instructions on it, and with the specifications she’d given.

“The cool thing is that they didn’t add this – it was a latent capability that emerged and got discovered. There’s probably tons of other amazing features we don’t know yet…” she added on X. An emergent property of an AI model is something that isn’t explicitly coded, but “emerges” in AI systems.

The feature could come in very handy for video creation — instead of giving out a long text prompt, such as the cat will appear through the door on the top left of the screen — users can simply point out on the photo where they want the cat to appear. This can not only create more accurate videos, but help video editors save time in describing what they want. This also closely mirrors how people work with graphic designers or video editors, showing them what needs to be changed on images.

And what’s more interesting is that this feature just seemed to pop up without Google having explicitly told the model to behave this way. It’s perhaps understandable that it would — models can now both create text and read text from images, so they do have an understanding of words and sentences. But the fact that models can read text from an image and then create a video based on it wasn’t entirely expected, and shows that as AI progress improves, it’ll lead to additive features being inbuilt upon top of one another in interesting new ways.

Posted in AI