End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue
preprint: arXiv:2206.12040
accepted for INTERSPEECH 2022
Kentaro Mitsui*, Tianyu Zhao*, Kei Sawada*, Yukiya Hono**, Yoshihiko Nankaku**, Keiichi Tokuda**
*rinna Co., Ltd., **Nagoya Institute of Technology
The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied.
This study aims to realize a TTS that closely resembles human dialogue.
First, we record and transcribe actual spontaneous dialogues.
Then, the proposed dialogue TTS is trained in two stages:
first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS is trained, which introduces an utterance-level latent variable into variational inference with adversarial learning for end-to-end text-to-speech (VITS), a recently proposed end-to-end TTS model.
A style encoder that extracts a latent speaking style representation from speech is trained jointly with TTS.
In the second stage, a style predictor is trained to predict the speaking style to be synthesized from dialogue history.
During inference, by passing the speaking style representation predicted by the style predictor to VAE/GMVAE-VITS, speech can be synthesized in a style appropriate to the context of the dialogue.
Subjective evaluation results demonstrate that the proposed method outperforms the original VITS in terms of dialogue-level naturalness.