End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue

preprint: arXiv:2206.12040
accepted for INTERSPEECH 2022
Kentaro Mitsui*, Tianyu Zhao*, Kei Sawada*, Yukiya Hono**, Yoshihiko Nankaku**, Keiichi Tokuda**
*rinna Co., Ltd., **Nagoya Institute of Technology

The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS is trained, which introduces an utterance-level latent variable into variational inference with adversarial learning for end-to-end text-to-speech (VITS), a recently proposed end-to-end TTS model. A style encoder that extracts a latent speaking style representation from speech is trained jointly with TTS. In the second stage, a style predictor is trained to predict the speaking style to be synthesized from dialogue history. During inference, by passing the speaking style representation predicted by the style predictor to VAE/GMVAE-VITS, speech can be synthesized in a style appropriate to the context of the dialogue. Subjective evaluation results demonstrate that the proposed method outperforms the original VITS in terms of dialogue-level naturalness.


Description

1. Utterance-level speech samples

Speaker 1: 「あ、なんかそういう打楽器って専門が決まってるわけじゃないんだ。」

ReferenceVITSVAE-oracle
GMVAE-oracleVAE-predictedGMVAE-predicted

Speaker 2: 「うふふふふ!きな粉の量多すぎない?あれ。」

ReferenceVITSVAE-oracle
GMVAE-oracleVAE-predictedGMVAE-predicted

2. Dialogue-level speech samples

To generate dialogue-level speech samples, we first computed text-speech alignment over recorded speech using monotonic alignment search (MAS) algorithm [Kim et al., 2020].
Then we used the alignment result for synthesis to align the start and end time of speech with those of recorded speech.

Speaker 1: 「なんか、ハマってる沼とかありますかー?」
Speaker 2: 「うん。」
Speaker 2: 「沼かー。もうでも沼っていうほどつかってる、」
Speaker 1: 「うん。」
Speaker 1: 「つかってる。」
Speaker 2: 「あの分野は今はないかも。」

ReferenceVITSVAE-oracle
GMVAE-oracleVAE-predictedGMVAE-predicted

Speaker 1: 「あははははは!そうだよねえ。」
Speaker 2: 「そう知らない情報もやっぱ2倍聞けるしおんなじ時間でも。」
Speaker 1: 「うんうんうんうん!」
Speaker 1: 「そっかあ。」
Speaker 2: 「そうそれがねなんかね良かったことだなあ、最近だと。」
Speaker 1: 「そうだよなんか、こんなに楽しいっけみたいな。なるよね?」

ReferenceVITSVAE-oracle
GMVAE-oracleVAE-predictedGMVAE-predicted

3. Sampling style representations from GMVAE-VITS prior

We randomly sampled one style representation from each latent class of the GMVAE-VITS prior (above figure (a)).
Then we synthesized the same text with those different style representations.

Note that some latent classes (in these examples, class 1, 3, 6, 7) produced unnatural speech for the first text.
This is probably because these classes were used to model training data whose text and speech does not match exactly.

Normal text: 「やー、良かったことだ本当にじゃあ。うーん。」

Speaker 1
12345
678910
Speaker 2
12345
678910


Aizuchi: 「うんうーん。」

Speaker 1
12345
678910
Speaker 2
12345
678910


Laughter: 「あっはははは!」

Speaker 1
12345
678910
Speaker 2
12345
678910