End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue
accepted for INTERSPEECH 2022 Kentaro Mitsui*, Tianyu Zhao*, Kei Sawada*, Yukiya Hono**, Yoshihiko Nankaku**, Keiichi Tokuda**
*rinna Co., Ltd., **Nagoya Institute of Technology
The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied.
This study aims to realize a TTS that closely resembles human dialogue.
First, we record and transcribe actual spontaneous dialogues.
Then, the proposed dialogue TTS is trained in two stages:
first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS is trained, which introduces an utterance-level latent variable into variational inference with adversarial learning for end-to-end text-to-speech (VITS), a recently proposed end-to-end TTS model.
A style encoder that extracts a latent speaking style representation from speech is trained jointly with TTS.
In the second stage, a style predictor is trained to predict the speaking style to be synthesized from dialogue history.
During inference, by passing the speaking style representation predicted by the style predictor to VAE/GMVAE-VITS, speech can be synthesized in a style appropriate to the context of the dialogue.
Subjective evaluation results demonstrate that the proposed method outperforms the original VITS in terms of dialogue-level naturalness.
VAE-oracle: Speech synthesized using proposed variational autoencoder (VAE)-VITS, where style representation is extracted from recorded speech
GMVAE-oracle: Speech synthesized using proposed Gaussian mixture variational autoencoder (GMVAE)-VITS, where style representation is extracted from recorded speech
VAE-predicted: Speech synthesized using proposed VAE-VITS, where style representation is predicted by style predictor
GMVAE-predicted: Speech synthesized using proposed GMVAE-VITS, where style representation is predicted by style predictor
1. Utterance-level speech samples
Speaker 1: 「あ、なんかそういう打楽器って専門が決まってるわけじゃないんだ。」
Speaker 2: 「うふふふふ！きな粉の量多すぎない？あれ。」
2. Dialogue-level speech samples
To generate dialogue-level speech samples, we first computed text-speech alignment over recorded speech using monotonic alignment search (MAS) algorithm [Kim et al., 2020].
Then we used the alignment result for synthesis to align the start and end time of speech with those of recorded speech.
3. Sampling style representations from GMVAE-VITS prior
We randomly sampled one style representation from each latent class of the GMVAE-VITS prior (above figure (a)).
Then we synthesized the same text with those different style representations.
Note that some latent classes (in these examples, class 1, 3, 6, 7) produced unnatural speech for the first text.
This is probably because these classes were used to model training data whose text and speech does not match exactly.