AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Jiazhi Guan1,2   Kaisiyuan Wang2   Zhiliang Xu2   Quanwei Yang5   Yasheng Sun6   Shengyi He2  
Borong Liang2   Yukang Cao3   Yingying Li2   Haocheng Feng2   Errui Ding2   Jingdong Wang2  
Youjian Zhao1,4†   Hang Zhou2†   Ziwei Liu3†  
1. DCST, Tsinghua University,   2. Baidu Inc.,   3. S-Lab, Nanyang Technological University,
4. Zhongguancun Laboratory,   5. University of Science and Technology of China,   6. KAUST.
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

Abstract


Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. 1) Firstly, an audio-conditioned Holistic Human DiT architecture is proposed to directly drive the movements of any human body with vivid gesture dynamics. 2) Then to enhance hand and face details that are well-knownly difficult to handle, a Regional Refinement DiT leverages regional 3D fitting as the bridge to reform the signals, producing the final results. Extensive experiments demonstrate that our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details.

Results

Comparions with S2G


Comparions with Audio2Gesture2Video Baselines


Comparions with Vlogger's Demo Videos


Comparions with Cyberhost's Demo Videos

Materials

Citation

@inproceedings{guan2024resyncer,
  title = {AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers},
  author = {Guan, Jiazhi and  Wang, Kaisiyuan and Xu, Zhiliang and Yang, Quanwei and Sun, Yasheng and He, Shengyi and Liang, Borong and Cao, Yukang and Li, Yingying and Feng, Haocheng and Ding, Errui and Wang, Jingdong and Zhao, Youjian and Zhou, Hang and Liu, Ziwei},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2025}
}