Authors:
(1) Dan Kondratyuk, Google Research and with Equal contribution;
(2) Lijun Yu, Google Research, Carnegie Mellon University and with Equal contribution;
(3) Xiuye Gu, Google Research and with Equal contribution;
(4) Jose Lezama, Google Research and with Equal contribution;
(5) Jonathan Huang, Google Research and with Equal contribution;
(6) Grant Schindler, Google Research;
(7) Rachel Hornung, Google Research;
(8) Vighnesh Birodkar, Google Research;
(9) Jimmy Yan, Google Research;
(10) Krishna Somandepalli, Google Research;
(11) Hassan Akbari, Google Research;
(12) Yair Alon, Google Research;
(13) Yong Cheng, Google DeepMind;
(14) Josh Dillon, Google Research;
(15) Agrim Gupta, Google Research;
(16) Meera Hahn, Google Research;
(17) Anja Hauth, Google Research;
(18) David Hendon, Google Research;
(19) Alonso Martinez, Google Research;
(20) David Minnen, Google Research;
(21) Mikhail Sirotenko, Google Research;
(22) Kihyuk Sohn, Google Research;
(23) Xuan Yang, Google Research;
(24) Hartwig Adam, Google Research;
(25) Ming-Hsuan Yang, Google Research;
(26) Irfan Essa, Google Research;
(27) Huisheng Wang, Google Research;
(28) David A. Ross, Google Research;
(29) Bryan Seybold, Google Research and with Equal contribution;
(30) Lu Jiang, Google Research and with Equal contribution.
3. Model Overview and 3.1. Tokenization
3.2. Language Model Backbone and 3.3. Super-Resolution
4. LLM Pretraining for Generation
5. Experiments
5.2. Pretraining Task Analysis
5.3. Comparison with the State-of-the-Art
5.4. LLM’s Diverse Capabilities in Video Generation and 5.5. Limitations
6. Conclusion, Acknowledgements, and References
Text-to-Video (T2V). Table 2 shows zero-shot text-tovideo evaluation results on the common MSR-VTT (Xu et al., 2016) and UCF-101 (Soomro et al., 2012) datasets. Our model performs favorably in terms of CLIP similarity and FVD scores on MSR-VTT and UCF-101. The pretrained foundation model already achieves competitive performance on all metrics. After finetuned on high-quality subset of text-video pairs, VideoPoet achieves even better CLIPSIM on MSR-VTT. More details on the evaluation settings can be found in Appendix A.5.4.
Human Evaluations with Text-to-Video (T2V). We analyze VideoPoet using human raters and compare with other recent models: Show-1 (Zhang et al., 2023a),
VideoCrafter (Chen et al., 2023a), Phenaki (Villegas et al., 2022), Pika (Pika, 2023), Gen2 (Runway, 2023) and Lumiere (Bar-Tal et al., 2024). Show-1, VideoCrafter, Pika, Gen2 and Lumiere are video diffusion models while Phenaki is a token-based model using masked token modeling (Chang et al., 2022). We ran the most up-to-date model versions as of January 2024.
We first develop a unified evaluation prompt bank consisting of ∼ 250 selected prompts from a variety of categories and styles. Our prompts are sourced from published prompt sets (e.g., Show-1, Video LDM (Blattmann et al., 2023b)). We select the prompts prior to generating videos and fix these choices after initial selection. We also select preferentially for prompts that contain an explicit mention of motion so that the evaluation would not be biased for models that generate high quality videos that are almost still (e.g., “person jumping off of a chair” over “person standing on a chair”). Note that due to time constraints, our experiments for Pika and Gen2 were run on a subset of 50 prompts due to having to submit these manually via their web interface. These 50 prompts were pre-selected (before any evaluations were run) so as to be representative of the entire set.
For this user study we use the fine-tuned version of VideoPoet as discussed in Section 4.2 and compare against alternative models in a side-by-side fashion for each prompt. Raters are shown videos generated by two models at a time (in randomized order so as to not bias raters). Not all methods generate videos at the same size or aspect ratio, and we resize each video to a fixed area while maintaining its original aspect ratio. Raters are then asked to compare the videos along 5 dimensions and for each dimension to report which video is better. The 5 dimensions are: (1) text fidelity (which video follows the text prompt most faithfully), (2) video quality, (3) motion “interestingness”, (4) motion realism, and (5) temporal consistency. Raters are required to go through a collection of training examples for each of these 5 dimensions before they begin.
Our findings are summarized in Fig. 4, where green and pink bars represent the proportion of trials where VideoPoet was preferred or less preferred over an alternative, respectively. We observe that VideoPoet outperforms all baseline models along almost all of the dimensions. More specifically, VideoPoet achieves significant wins along the motion categories (motion interestingness and realism, temporal consistency) and Lumiere (Bar-Tal et al., 2024) which is diffusion based and concurrent to our work, is the only model that outperforms VideoPoet on Video Quality.
This paper is available on arxiv under CC BY 4.0 DEED license.