ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Jiazhi Guan1,2*   Zhiliang Xu2*   Hang Zhou2†   Kaisiyuan Wang2   Shengyi He2   Zhanwang Zhang2  
Borong Liang2   Haocheng Feng2   Errui Ding2   Jingtuo Liu2   Jingdong Wang2   Youjian Zhao1,3†   Ziwei Liu4  
1. BNRist, DCST, Tsinghua University,   2. Baidu Inc.,
3. Zhongguancun Laboratory,   4. S-Lab, Nanyang Technological University.
European Conference on Computer Vision (ECCV) 2024

Abstract


Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.

Demo Video


Our Results

Feel free to download our results for use in your comparisons. If you would like to include our method into your comparisons using your video-audio pairs, you can send me a request via email (guanjz20 [at] mails.tsinghua.edu.cn). Please ensure:
  1. Template videos are from open-source datasets and in 25 FPS.
  2. Driving audios are from open-source datasets and in .wav format.

Cross-Driven Results (Generic Model)

Template Video

Driving Video (Audio)

StyleSync

ReSyncer

Template Video

Driving Video (Audio)

StyleSync

ReSyncer

Template Video

Driving Video (Audio)

StyleSync

ReSyncer

Template Video

Driving Video (Audio)

StyleSync

ReSyncer

ReSyncer (Style-Trans)

Self-Driven Results w/ Swapped Identity (Generic Model)

Template Video

Image 1

Identity Source

StyleSwap

ReSyncer

Template Video

Image 1

Identity Source

StyleSwap

ReSyncer

Template Video

Image 1

Identity Source

StyleSwap

ReSyncer

Cross-Driven Results w/ Swapped Identity (Generic Model)

Template Video

Identity&Driving Source

StyleSwap+StyleSync

ReSyncer

Template Video

Identity&Driving Source

StyleSwap+StyleSync

ReSyncer

Cross-Driven Results (Personalized Model)

Template Video

Driving Video (Audio)

StyleSync

ReSyncer

Template Video

Driving Video (Audio)

StyleSync

ReSyncer

Template Video

Driving Video (Audio)

StyleSync

ReSyncer

Template Video

Driving Video (Audio)

StyleSync

ReSyncer

Materials

Citation

@inproceedings{guan2024resyncer,
  title = {ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer},
  author = {Guan, Jiazhi and Xu, Zhiliang and Zhou, Hang and Wang, Kaisiyuan and He, Shengyi and Zhang, Zhanwang and Liang, Borong and Feng, Haocheng and Ding, Errui and Liu, Jingtuo and Wang, Jingdong and Zhao, Youjian and Liu, Ziwei},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year = {2024}
}
@inproceedings{guan2023stylesync,
  title = {StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator},
  author = {Guan, Jiazhi and Zhang, Zhanwang and Zhou, Hang and HU, Tianshu and Wang, Kaisiyuan and He, Dongliang and Feng, Haocheng and Liu, Jingtuo and Ding, Errui and Liu, Ziwei and Wang, Jingdong},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2023}
}