LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

Demo Videos

Minute-Length Videos

17-second Videos

Abstract

Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length.

We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch (check details below).

Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15\(\times\) (11.5\(\times\)) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation.

Framework Overview

LinGen replaces self-attention layers with a MATE block, which inherits linear complexity from its two branches: MA-branch and TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch introduces a novel Temporal Swin Attention block designed to capture correlations between spatially adjacent tokens and temporally medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly.

Exceptional Efficiency with Linear Compuatational Complexity

Comparisons with Existing Works

State-of-the-Art Commercial Models

Prompt: A fish swimming into a coffee shop and trying to order.

LumaLabs

Runway Gen3

Kling 1.5

LinGen (Ours)

Prompt: Camera zoom in. A chef chopping vegetables with speed.

LumaLabs

Runway Gen3

Kling 1.5

LinGen (Ours)

Typical Open-Source Models

Prompt: A dog wearing VR goggles on a boat.

T2V-Turbo

LinGen (Ours)

Prompt: Elderly artist painting by the sea.

CogVideoX-5B

LinGen (Ours)

Prompt: Noir street: neon, shadows, solitary walker.

OpenSora V1.2

LinGen (Ours)

Minute-Length Trials

Prompt: Aerial view of Santorini during the blue hour.

Loong

LinGen (Ours)*

*LinGen supports multiple aspect ratios, here we follow the baseline's setup to generate squared videos.

PA-VDM does not provide their prompts, so we find a similar video that is generated by LinGen

PA-VDM

LinGen (Ours)

Ablation Experiments

After 30K Pre-Training Steps at the 256p Resolution and the 17-Second Length

LinGen w/o TESA and RMS

LinGen w/o RMS

LinGen

After 2K Pre-Training Steps at the 512p Resolution and the 68-Second Length

LinGen w/o review tokens

LinGen w/ review tokens

Showing a Failure Case in which Consistency is Abnormally Bad at 256p Resolution

LinGen w/o Hybrid Training

LinGen w/ Hybrid Training

Showing a Failure Case in which Quality is Abnormally Bad at 512p Resolution

LinGen w/o Quality-Tuning

LinGen w/ Quality-Tuning

BibTeX

@inproceedings{wang2025lingen,
  title={Lingen: Towards high-resolution minute-length text-to-video generation with linear computational complexity},
  author={Wang, Hongjie and Ma, Chih-Yao and Liu, Yen-Cheng and Hou, Ji and Xu, Tao and Wang, Jialiang and Juefei-Xu, Felix and Luo, Yaqiao and Zhang, Peizhao and Hou, Tingbo and others},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={2578--2588},
  year={2025}
}