MSR-NV: Neural Vocoder Using Multiple Sampling Rates

preprint: arXiv:2109.13714
Kentaro Mitsui, Kei Sawada
rinna Co., Ltd.

The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of waveforms. However, conventional NVs target a single sampling rate and require re-training when applied to different sampling rates. A suitable sampling rate varies from application to application due to the trade-off between speech quality and generation speed. In this study, we propose a method to handle multiple sampling rates in a single NV, called the MSR-NV. By generating waveforms step-by-step starting from a low sampling rate, MSR-NV can efficiently learn the characteristics of each frequency band and synthesize high-quality speech at multiple sampling rates. It can be regarded as an extension of the previously proposed NVs, and in this study, we extend the structure of Parallel WaveGAN (PWG). Experimental evaluation results demonstrate that the proposed method achieves remarkably higher subjective quality than the original PWG trained separately at 16, 24, and 48 kHz, without increasing the inference time. We also show that MSR-NV can leverage speech with lower sampling rates to further improve the quality of the synthetic speech.


0. Speech waveforms at different sampling rates

The waveforms generated using the proposed method (left) and target waveforms (right) at multiple sampling rates are listed below.
*Some browsers are not able to play samples of 1 or 2 kHz, so those samples are upsampled to 4 kHz using sox.

Sampling rateGenerated waveformTarget waveform
1 kHz
2 kHz
4 kHz
8 kHz
16 kHz
24 kHz
48 kHz

1. Comparison to baseline in terms of quality

Description

Single-speaker speech samples

A total of 14,375 utterances (approximately 8 h) uttered in a normal speaking style by a female Japanese speaker were used.

Sampling rateRefPWGMSR-PWG
16kHz
24kHz
48kHz

Multi-speaker speech samples

A total of 31,936 utterances (approximately 25 h) by 26 Japanese speakers (18 female and 8 male), including three speaking styles (normal, happy, and sad) were used.
Reference speech is not included here due to some rights reasons.

Female speaker, normal style

Sampling ratePWGMSR-PWG
16k
24k
48k

Female speaker, happy style

Sampling ratePWGMSR-PWG
16k
24k
48k

Female speaker, sad style

Sampling ratePWGMSR-PWG
16k
24k
48k

Male speaker, normal style

Sampling ratePWGMSR-PWG
16k
24k
48k

Male speaker, happy style

Sampling ratePWGMSR-PWG
16k
24k
48k

Male speaker, sad style

Sampling ratePWGMSR-PWG
16k
24k
48k

2. Training data amount and synthesis quality

We varied the amount of training data from 1 minute to 8 hours in the single-speaker experiment and compared the synthesis quality of MSR-PWG.

Training data amountAudio sample
1 min
3 min
5 min
10 min
30 min
8 h (full data)

3. Use of speech data with low sampling rates

We trained MSR-PWG with following three training set in the single-speaker experiment:
Training setAudio sample
1min
3×1min
3min