UniFLG: Unified Facial Landmark Generator from Text or Speech

preprint: arxiv:2302.14337
accepted for INTERSPEECH 2023
Kentaro Mitsui, Yukiya Hono, Kei Sawada
rinna Co., Ltd.

Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces from text, and a speech-driven framework, which generates talking faces from speech. To integrate these frameworks, this paper proposes a unified facial landmark generator (UniFLG). The proposed system exploits end-to-end text-to-speech not only for synthesizing speech but also for extracting a series of latent representations that are common to text and speech, and feeds it to a landmark decoder to generate facial landmarks. We demonstrate that our system achieves higher naturalness in both speech synthesis and facial landmark generation compared to the state-of-the-art text-driven method. We further demonstrate that our system can generate facial landmarks from speech of speakers without facial video data or even speech data.


1. Generated speech and facial landmark quality

Description

Samples of a speaker with paired speech and talking face video data (Paired dataset)

Emotion AVTacotron2 UniFLG-P UniFLG-T UniFLG-S
Normal
Happy
Sad

Samples of a speaker with speech-only data (Unpaired dataset)

Emotion AVTacotron2 UniFLG-P UniFLG-T UniFLG-S
Normal
Happy
Sad

2. Facial landmark generation for unseen speakers

Description

Female speaker samples

Emotion STL-D UniFLG-AS-S
Normal
Happy
Sad

Male speaker samples

Emotion STL-D UniFLG-AS-S
Normal
Happy
Sad