How Text-to-Video Models From Veo to Sora Compare as Hollywood Production Tools

Illustration of a robot's hands framing out a video screen
Illustration: Cheyne Gateley/Variety VIP+

In this article

  • How video generation model development is expanding, with a table examining how leading AI models compare
  • Main criteria for evaluating the quality of outputs from video generation models
  • Present limitations and potential uses of video generation models in Hollywood film and TV productions

Video generation research and development is strongly advancing and expanding with new entrants. Powerful new text-to-video (T2V) models have emerged in recent months following models earliest to market from Runway and Pika Labs and the seismic introduction of OpenAI’s Sora in February.

The latest among newcomer is Veo, Google DeepMind’s most powerful video generation model presented last week at the Google I/O developer conference. More such models are coming from developers including Irreverent Labs.

With Veo, Google took a similar approach to OpenAI with Sora, initially making Veo available in research preview to select artists, including Donald Glover and his creative agency Gilga, to gather first feedback on the capabilities and limitations of the model before extending release to other groups.

Both companies are reportedly engaging the creative community in Hollywood. In March, OpenAI CEO Sam Altman met with execs at major studios, including Paramount, Warner Bros. and Universal, and several A-list directors were among Sora’s initial testers.

Commercial release strategies for Veo and Sora are unknown but expected to proceed in 2024. Given their massive compute requirements, to say nothing of disinformation risk, it’s possible OpenAI and Google would defer public releases in favor of operating as enterprise licensed software catered to creative industries.

Video generation models have been analogized to a new camera technology. More accurately, they would represent the next phase of computer-generated graphics (CGI). Judging by its product demo videos and responses from testers and the larger AI research community, Sora achieved an impressive leap forward in video quality. Veo’s demo videos display equally strong capabilities across similar dimensions commonly used to evaluate the quality of outputs from video models.

Yet as impressive as they already are, video generation models (sometimes called “world models”) are still rudimentary in how they approximate the real world, a fact that will affect their immediate usability on Hollywood productions. In their current state, video models aren’t yet adequate to wholly replace production methods with physical cameras, virtual production or traditional VFX. The copyrightability and commercial distribution of AI video is also in question, with AI-generated outputs at this point tantamount to public domain works.

Two main limitations of video generation models will constrain their use on major productions:

  • Quality: Despite enormous improvements in image resolution and consistency, AI video often still contains visual inaccuracies, such as spontaneous artifacts, occlusions, morphing and wrong anatomy. AI developers tended to agree such artifacts will be substantially minimized but may never be completely resolvable such that a model would be able to perfectly simulate how the 3D world behaves. That’s because a model’s “understanding” comes from large quantities of video training data, and video itself is only a 2D representation of 3D space, with no working knowledge of the physical laws that govern how all matter acts and interacts.
  • Controllability: More controllability is also needed, as at present these tools don’t offer enough to meet the extreme levels needed by artists and directors in Hollywood. Repeatedly, visual and VFX artists and AI leaders alike confided to VIP+ that the primary challenge and pitfall of today’s video generation models is their unpredictability and lack of fine-grained control over what AI outputs in response to a text prompt. In many cases for mainstream projects, it’s likely to be more effective to shoot with a camera rather than make hopeful requests of a model that doesn’t fundamentally understand the mechanics of the 3D world or make hundreds of iterations that don’t fit the exact need.

Though this is a developing area, there is little way to maintain consistency of characters, objects, settings or aesthetics from one prompt generation to the next. Until model outputs can be controlled at the point they’re generated, they can be modified with additional application of AI editing features.

More control may also emerge as video generation models progress to a second state of prompt agnosticism, whereby new videos can be generated by conditioning models with images or videos as well as text. Developers expect this to be a hard challenge but view a weakness of text prompts as the inadequacy of language to accurately describe the exact contents of a video.

“For live-action video, there’s still a lot of work to be done to make it controllable in the specific way you want,” said Cristóbal Valenzuela, co-founder and CEO at Runway. “We’re working closely with creatives and artists to get more feedback on how to improve that control. For us, it’s one of the top priorities we’re working on right now.”

“Lack of granular control is one of the limiting factors for video generation seeing adoption in Hollywood,” said Matt Panousis, co-founder and COO at MARZ. “The quality will also need to lift slightly. It’s certainly not at VFX standards yet, but it’s very good.”

Nuanced actor performances wouldn’t easily or reliably emerge from a video model. Models would also likely struggle with shot-reverse-shot where shooting with actors on set would still be more direct and effective. Some models will also have restrictive parameters to prevent outputs depicting violence, sex, likeness and IP. Video outputs are also soundless, though developers are beginning to integrate models allowing users to add speech and contextual sound effects to video.

Nevertheless, these models are already powerful, have improved significantly in a short time and will continue to advance. With existing diffusion and diffusion transformer (DiT) architectures, model performance results from scaled data, scaled compute and optimized architecture. With breakthroughs in novel, efficient architecture, developers expect next generations of video models to be equally or more performant while requiring substantially less compute and data.

As they advance, video generation models could be increasingly viable and flexible tools in film and TV production. Even with present limitations, video models might allow productions to cheaply and easily make B-roll or insert shots used in place of stock footage, establishing shots as an alternative to drone footage or pickup shots as short extensions of existing footage after a shoot.

Video models could further enable “footage” otherwise unachievable with cameras or traditional VFX, such as highly complex or outlandish concepts and camera moves that would be physically impossible with camera systems or hard to imagine. For example, through his efforts to put Sora through a series of stress tests, filmmaker Paul Trillo discovered a technique he described as “infinite zoom,” with the model producing “extremely fast camera moves but done with what looks like 8mm or 16mm film.”

Where traditional production methods can’t go, video models would simply provide another creative tool in the arsenal. For filmmakers contemplating these tools now, the question underneath any future experiments and validated use of video generation models might become: What can these models do that cameras cannot, and vice versa?

VIP+ Explores Gen AI From All Angles — Pick a Story