Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length.
We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch (check details below).
Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15\(\times\) (11.5\(\times\)) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation.
LinGen replaces self-attention layers with a MATE block, which inherits linear complexity from its two branches: MA-branch and TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch introduces a novel Temporal Swin Attention block designed to capture correlations between spatially adjacent tokens and temporally medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly.
LumaLabs
Runway Gen3
Kling 1.5
LinGen (Ours)
LumaLabs
Runway Gen3
Kling 1.5
LinGen (Ours)
T2V-Turbo
LinGen (Ours)
CogVideoX-5B
LinGen (Ours)
OpenSora V1.2
LinGen (Ours)
Loong
LinGen (Ours)*
*LinGen supports multiple aspect ratios, here we follow the baseline's setup to generate squared videos.
PA-VDM
LinGen (Ours)
LinGen w/o TESA and RMS
LinGen w/o RMS
LinGen
LinGen w/o review tokens
LinGen w/ review tokens
LinGen w/o Hybrid Training
LinGen w/ Hybrid Training
LinGen w/o Quality-Tuning
LinGen w/ Quality-Tuning
@article{wang2024lingen,
author = {Wang, Hongjie and Ma, Chih-Yao and Liu, Yen-Cheng and Hou, Ji and Xu, Tao and Wang, Jialiang and Juefei-Xu, Felix and Luo, Yaqiao and Zhang, Peizhao and Hou, Tingbo and Vajda, Peter and Jha, Niraj K. and Dai, Xiaoliang},
title = {LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity},
journal = {arXiv preprint arXiv:2412.09856},
year = {2024},
}