AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Jiazhi Guan^1,2   Kaisiyuan Wang²   Zhiliang Xu²   Quanwei Yang⁵   Yasheng Sun⁶   Shengyi He²
Borong Liang²   Yukang Cao³   Yingying Li²   Haocheng Feng²   Errui Ding²   Jingdong Wang²
Youjian Zhao^1,4†   Hang Zhou^2†   Ziwei Liu^3†

1. DCST, Tsinghua University, 2. Baidu Inc., 3. S-Lab, Nanyang Technological University,
4. Zhongguancun Laboratory, 5. University of Science and Technology of China, 6. KAUST.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

Abstract

Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. 1) Firstly, an audio-conditioned Holistic Human DiT architecture is proposed to directly drive the movements of any human body with vivid gesture dynamics. 2) Then to enhance hand and face details that are well-knownly difficult to handle, a Regional Refinement DiT leverages regional 3D fitting as the bridge to reform the signals, producing the final results. Extensive experiments demonstrate that our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details.

Results

Comparions with S2G

Comparions with Audio2Gesture2Video Baselines

Comparions with Vlogger's Demo Videos

Comparions with Cyberhost's Demo Videos

Materials

Paper

Citation

@inproceedings{guan2024resyncer,
  title = {AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers},
  author = {Guan, Jiazhi and  Wang, Kaisiyuan and Xu, Zhiliang and Yang, Quanwei and Sun, Yasheng and He, Shengyi and Liang, Borong and Cao, Yukang and Li, Yingying and Feng, Haocheng and Ding, Errui and Wang, Jingdong and Zhao, Youjian and Zhou, Hang and Liu, Ziwei},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2025}
}