Thus far, generating video from text has been as bit clunky. It was hard to maintain character consistency, and it was hard to put different scenes together to feel like a coherent movie. But a new technique from US-based researchers aims to change all that.
A team of NVIDIA, Stanford University, UCSD, UC Berkeley and UT Austin researchers have published a paper that can generate 1-minute videos from storyboards. Titled “One-Minute Video Generation with Test-Time Training”, the paper created several 1-minute Tom and Jerry clips as a demonstration of its abilities.

“TTT (Test-Time Training) layers enable a pre-trained Diffusion Transformer to generate one-minute videos from text storyboards,” the paper says. “We use Tom and Jerry cartoons as a proof of concept. The videos tell complex stories with coherent scenes composed of dynamic motion. Every video is produced directly by the model in a single shot, without editing, stitching, or post-processing. Every story is newly created,” the paper adds.
The results are pretty incredible. As shared on X, the videos look like legitimate Tom and Jerry cartoons from yesteryear. There are no glitches, and the scenes seem to transition seamlessly into one another. The prompts are extremely detailed, and require several paragraphs for a few seconds of video. Here, for instance, is a prompt:
<start_scene>The living room features pale walls, a large brown armchair positioned comfortably at the center, and a soft cream-colored rug. Tom, the blue-gray cat, sits relaxed in the armchair, happily eating a cookie. Jerry, the brown mouse, sits beside Tom on the armrest, cheerfully nibbling a small block of yellow cheese. The camera steadily captures their joyful expressions as they enjoy their snacks side by side. The living room has soft pale walls, a comfortable brown armchair at the center, and a cream-colored rug. Tom sets his cookie aside and smiles warmly at Jerry. Jerry puts down his cheese and stands on the armrest, facing Tom with a happy grin. Tom raises his paw toward Jerry, and they share a playful and friendly high-five, celebrating their renewed friendship. The camera slowly pulls back, capturing their joyful moment, and gradually fades to black.<end_scene>
Several such scenes, which detail both what’s going on the screen and the required camera movements, are put together to create every 1-minute video.
The paper uses Test-Time training layers, whose hidden states are neural networks themselves, which can hold more information. “Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards,” the paper says.
It seems like a pretty interesting approach to generating videos. It helped that the researchers had access to plenty of Tom and Jerry cartoons, and were able to train the model into get a sense of the characters and the animation style. This approach could come in handy for long-running franchises — if there’s sufficient training data, it appears that now entire scenes and shots can be generated with just text prompts, and it’s likely that this could be used in creating sequels of popular movies and shows. Andrej Karpathy had famously said that the hottest new programming language was English. It appears that the hottest new language of filmmaking could be English as well.