Towards human-like spoken dialogue generation between AI agents from written dialogue

preprint: arxiv:2310.01088
Kentaro Mitsui, Yukiya Hono, Kei Sawada
rinna Co., Ltd.

The advent of large language models (LLMs) has made it possible to generate natural written dialogues between two agents. However, generating human-like spoken dialogues from these written dialogues remains challenging. Spoken dialogues have several unique characteristics: they frequently include backchannels and laughter, and the smoothness of turn-taking significantly influences the fluidity of conversation. This study proposes CHATSCHatty Agents Text-to-Speech ― a discrete token-based system designed to generate spoken dialogues based on written dialogues. Our system can generate speech for both the speaker side and the listener side simultaneously, using only the transcription from the speaker side, which eliminates the need for transcriptions of backchannels or laughter. Moreover, CHATS facilitates natural turn-taking; it determines the appropriate duration of silence after each utterance in the absence of overlap, and it initiates the generation of overlapping speech based on the phoneme sequence of the next utterance in case of overlap. Experimental evaluations indicate that CHATS outperforms the text-to-speech baseline, producing spoken dialogues that are more interactive and fluid while retaining clarity and intelligibility.

1. Test-set samples

We extracted written dialogues consisting of 10 turns each from the test set and generated the corresponding spoken dialogue segments using the proposed system. Below, we present two samples of the generated audio, along with the corresponding written dialogues (in Japanese) and their automatic translations (in English). For comparison, we also present the recorded audio (Ground Truth), resynthesized audio (Resynthesized), and audio files generated by the dialogue generative spoken language model (dGSLM) [Nguyen et al., 2023], as well as a baseline system that operates two text-to-speech models in an alternating manner (Baseline). For dGSLM, we used 30 s segments of the recorded speech preceding these dialogues as prompts, and generated 30 s continuations for each one.

Sample 1

Input written dialogue

Original (Japanese) Translated (English)
A: 見たりしますね
B: え、すごい、実写かぁ、えっ、エフェクトつける
B: ってことはあれだよねー、あのー、編集して、実際の
B: 動きは人間がやって、
B: なんかやってみた感
A: もうなんかこう、光をこう、ラケットとボールが当たる瞬間にこう入れてみたりとか
A: なんかそのー、ボールがそのー、えー、コートに着地した時に、その着地したところが崩れるエフェクトがあって、なんか穴がコートに開くみたいな
B: うわっ
B: そこまでやっちゃうんだ
A: そうなんですよ
A: I do watch it.
B: Oh, that's cool, it's live-action, huh, with effects.
B: So that means, um, editing it, the actual
B: movements are done by humans,
B: kind of giving it a try.
A: I just, like, tried adding light, like, at the moment the racket hits the ball,
A: like, when the ball, um, lands on the court, there's an effect where the landing spot crumbles, like a hole opens up in the court.
B: Woah
B: You go that far.
A: Yes, that's right.

Generated spoken dialogue

Ground Truth Resynthesized dGSLM Baseline Proposed

Sample 2

Input written dialogue

* (LAU) indicates laughter

Original (Japanese) Translated (English)
B: なかなかないよね
A: ふーん、自分で行く、よね、それこそファーストフード
A: くらい、よ
B: (LAU)
A: お安い、回転寿司の方が落ち着くし
A: ねー、いっぱい食べれるしね(LAU)、そうなのよ、結局ね、結局そうなんですよ、結局、そうなる、そこに行くんです
A: やっぱりすごいです
B: うん、チェーン店は、偉大ということで
B: はい、一旦これで、おわりでいい?
A: はい、いいですかね
B: It's pretty rare, isn't it?
A: Hmm, you'd go there yourself, right, especially for fast food.
A: At least, right.
B: (LAU)
A: It's cheaper, and I feel more at ease at conveyor belt sushi places.
A: Right? You can eat a lot (LAU), exactly, in the end, that's what it comes down to, eventually, that's where we go.
A: It's really amazing.
B: Yeah, chain stores are, in a sense, remarkable.
B: Alright, can we conclude this for now?
A: Yes, is that okay?

Generated spoken dialogue

Ground Truth Resynthesized dGSLM Baseline Proposed

2. LLM samples

Using the written dialogues from the test set as prompts, we generated continuations of those dialogues, consisting of 6 turns, using GPT-4. Then, we split each sentence at punctuation marks ("。" and "!"). Finally, we generated the corresponding spoken dialogue segments using the proposed system. Below, we present two samples of the generated audio, along with the corresponding written dialogues (in Japanese) and their automatic translations (in English).

Sample 1

Original (Japanese) Translated (English) Generated spoken dialogue (Proposed)
B: あはは、サンタさんのシステムがよくできてるよね。
B: 子供たちが信じてるうちは来るっていう。
A: 本当に、信じる心が大事だよね。
A: 大人になってもその気持ちを持ち続けたい。
B: そうそう、でも実は大人になると、サンタさんの役割を果たすこともあるんだよね。
A: うん、それがまた一つの魔法のようなものだよね。
A: 子供たちの夢を守るための。
B: 確かに。
B: 自分が子供の頃、サンタさんからのプレゼントを待ってた気持ちを思い出して、それを次の世代にも伝えたい。
A: それが一番のクリスマスの魔法だね。
B: Haha, the system of Santa is well thought out, isn’t it?
B: He comes as long as the children believe.
A: Really, having a believing heart is important.
A: I want to keep that feeling even as an adult.
B: Exactly, but actually, when we become adults, sometimes we play the role of Santa.
A: Yeah, that’s like a kind of magic, isn’t it?
A: For protecting the dreams of the children.
B: Definitely.
B: I remember the feeling of waiting for presents from Santa when I was a child, and I want to pass that on to the next generation.
A: That’s the real magic of Christmas, isn’t it?

Sample 1 with different speaker pairs

Original (Japanese) Translated (English) Generated spoken dialogue (Proposed)
B: あはは、サンタさんのシステムがよくできてるよね。
B: 子供たちが信じてるうちは来るっていう。
A: 本当に、信じる心が大事だよね。
A: 大人になってもその気持ちを持ち続けたい。
B: そうそう、でも実は大人になると、サンタさんの役割を果たすこともあるんだよね。
A: うん、それがまた一つの魔法のようなものだよね。
A: 子供たちの夢を守るための。
B: 確かに。
B: 自分が子供の頃、サンタさんからのプレゼントを待ってた気持ちを思い出して、それを次の世代にも伝えたい。
A: それが一番のクリスマスの魔法だね。
B: Haha, the system of Santa is well thought out, isn’t it?
B: He comes as long as the children believe.
A: Really, having a believing heart is important.
A: I want to keep that feeling even as an adult.
B: Exactly, but actually, when we become adults, sometimes we play the role of Santa.
A: Yeah, that’s like a kind of magic, isn’t it?
A: For protecting the dreams of the children.
B: Definitely.
B: I remember the feeling of waiting for presents from Santa when I was a child, and I want to pass that on to the next generation.
A: That’s the real magic of Christmas, isn’t it?

Sample 2

Original (Japanese) Translated (English) Generated spoken dialogue (Proposed)
A: お菓子とかお茶とか買って自分へのご褒美してるの?
B: そうそう、そういう感じ!
B: 特に気に入ってるスイーツとか、新しい飲み物が出てたら試してみたり。
A: それいいね。
A: 自分へのリワードって大切だと思う。
A: 私も試験とか勉強がんばった時に、好きな映画を家で見たりしてる。
B: ああ、それもいいね。
B: やっぱり自分を褒めてあげることって大事。
B: 自分の努力を認めて、次へのモチベーションにもつながるから。
A: 本当に。
A: それに、自分で自分を褒めることで、他人の評価に左右されなくなる気がするよね。
B: うん、それもそうだね。
B: 自分の価値を自分で見つけることができれば、心も安定するし、自信も持てるようになる。
A: Are you treating yourself with sweets or tea or something?
B: Yeah, exactly, that kind of feeling!
B: Especially if there are sweets I like, or if there's a new drink, I like to try it.
A: That's nice.
A: I think rewarding oneself is important.
A: When I work hard on exams or studying, I watch my favorite movies at home.
B: Oh, that's also nice.
B: It really is important to praise oneself.
B: Recognizing one's own effort also connects to motivation for the next task.
A: Absolutely.
A: Also, by praising oneself, I feel like we become less affected by other people's evaluations.
B: Yeah, that's true.
B: If you can find your own value, you become more stable mentally, and you can also gain confidence.

Sample 2 with different speaker pairs

Original (Japanese) Translated (English) Generated spoken dialogue (Proposed)
A: お菓子とかお茶とか買って自分へのご褒美してるの?
B: そうそう、そういう感じ!
B: 特に気に入ってるスイーツとか、新しい飲み物が出てたら試してみたり。
A: それいいね。
A: 自分へのリワードって大切だと思う。
A: 私も試験とか勉強がんばった時に、好きな映画を家で見たりしてる。
B: ああ、それもいいね。
B: やっぱり自分を褒めてあげることって大事。
B: 自分の努力を認めて、次へのモチベーションにもつながるから。
A: 本当に。
A: それに、自分で自分を褒めることで、他人の評価に左右されなくなる気がするよね。
B: うん、それもそうだね。
B: 自分の価値を自分で見つけることができれば、心も安定するし、自信も持てるようになる。
A: Are you treating yourself with sweets or tea or something?
B: Yeah, exactly, that kind of feeling!
B: Especially if there are sweets I like, or if there's a new drink, I like to try it.
A: That's nice.
A: I think rewarding oneself is important.
A: When I work hard on exams or studying, I watch my favorite movies at home.
B: Oh, that's also nice.
B: It really is important to praise oneself.
B: Recognizing one's own effort also connects to motivation for the next task.
A: Absolutely.
A: Also, by praising oneself, I feel like we become less affected by other people's evaluations.
B: Yeah, that's true.
B: If you can find your own value, you become more stable mentally, and you can also gain confidence.