Talking face generation has been extensively investigated owing to its wide applicability.
The two primary frameworks used for talking face generation comprise a text-driven framework,
which generates synchronized speech and talking faces from text, and a speech-driven framework,
which generates talking faces from speech.
To integrate these frameworks, this paper proposes a unified facial landmark generator (UniFLG).
The proposed system exploits end-to-end text-to-speech not only for synthesizing speech but also
for extracting a series of latent representations that are common to text and speech, and feeds
it to a landmark decoder to generate facial landmarks.
We demonstrate that our system achieves higher naturalness in both speech synthesis and facial
landmark generation compared to the state-of-the-art text-driven method.
We further demonstrate that our system can generate facial landmarks from speech of speakers
without facial video data or even speech data.
1. Generated speech and facial landmark quality
AVTacotron2 [Abdelaziz et al., 2021] is a
conventional autoregressive audiovisual speech synthesizer that generates speech and facial landmarks
jointly from text.
UniFLG-P uses the proposed system but first synthesizes speech from text and then
generates facial landmarks in a speech-driven manner.
UniFLG-T is the text-driven inference of the proposed system, which simultaneously
generates speech and facial landmarks from text.
UniFLG-S is the speech-driven inference of the proposed system, which generates facial
landmarks from speech. (Thus the ground-truth speech is used.)
Samples of a speaker with paired speech and talking face video data (Paired dataset)
Samples of a speaker with speech-only data (Unpaired dataset)
2. Facial landmark generation for unseen speakers
STL-D directly feeds the amplitude spectrogram to the landmark decoder.
UniFLG-AS-S uses the proposed system for speech-driven facial landmark generation.
Unlike UniFLG, speakers are represented by utterance-level latent variables, while emotions are
represented by one-hot encodings.