TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model

Jiazhi Guan1   Quanwei Yang4   Kaisiyuan Wang2   Hang Zhou2†   Shengyi He2   Zhiliang Xu2  
Haocheng Feng2   Errui Ding2   Jingdong Wang2   Hongtao Xie4   Youjian Zhao1,3†   Ziwei Liu5  
1. DCST, BNRist, Tsinghua University,   2. Baidu Inc.,   3. Zhongguancun Laboratory,
4. University of Science and Technology of China,   5. S-Lab, Nanyang Technological University.
SIGGRAPH Asia Conference Proceedings (SIGGRAPH Asia), 2024

Abstract


Recently, 2D speaking avatars have increasingly participated in everyday scenarios due to the fast development of facial animation techniques. However, most existing works neglect the explicit control of human bodies. In this paper, we propose to drive not only the faces but also the torso and gesture movements of a speaking figure. Inspired by recent advances in diffusion models, we propose the Motion-Enhanced Textural-Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework, which enables high-fidelity avatar reenactment from only short footage of monocular video. Our key idea is to enhance the textural awareness with explicit motion guidance in diffusion modeling. Specifically, we carefully construct 2D and 3D structural information as intermediate guidance. While recent diffusion models adopt a side network for control information injection, they fail to synthesize temporally stable results even with person-specific fine-tuning. We propose a Motion-Enhanced Textural Alignment module to enhance the bond between driving and target signals. Moreover, we build a Memory-based Hand-Recovering module to help with the difficulties in hand-shape preserving. After pre-training, our model can achieve high-fidelity 2D avatar reenactment with only 30 seconds of person-specific data. Extensive experiments demonstrate the effectiveness and superiority of our proposed framework.

Demo Video


Our Results

Feel free to download our results for use in your comparisons.

Cross-Driven Results

Driving Video (Pose)

Image 1

Reference Frame

Driven Video

Image 1

Reference Frame

Driven Video

Driving Video (Pose)

Image 1

Reference Frame

Driven Video

Image 1

Reference Frame

Driven Video

Driving Video

Image 1

Reference Frame

Driven Video

Driving Video

Image 1

Reference Frame

Driven Video

Self-Driven Results

Driving Video

Image 1

Reference Frame

Driven Video

Driving Video

Image 1

Reference Frame

Driven Video

Driving Video

Image 1

Reference Frame

Driven Video

Driving Video

Image 1

Reference Frame

Driven Video

Materials

Citation

@inproceedings{guan2024talkact,
  title = {TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model},
  author = {Guan, Jiazhi and Yang, Quanwei and Wang, Kaisiyuan and Zhou, Hang  and He, Shengyi and Xu, Zhiliang and Feng, Haocheng and Ding, Errui and Wang, Jingdong and Xie, Hongtao and Zhao, Youjian and Liu, Ziwei},
  booktitle = {SIGGRAPH Asia 2024 Conference Papers},
  year = {2024}
}