Calib-StyleSpeech: A Zero-shot Approach In Voice Cloning Of High Adaptive Text To Speech System With Imbalanced Dataset

Authors: Nguyen Tan Dat, Lam Quang Tuong, Nguyen Duc Dung

Paper link: Calib-StyleSpeech: A Zero-Shot Approach in Voice Cloning of High Adaptive Text to Speech System with Imbalanced Dataset.

Official open-sourced implementation: "signofthefour/calibstylespeech".

Note: HiFi-GAN is used as vocoder. We trained this HiFi-GAN with our dataset of 2 standard voice for TTS for 425000 steps. Also, listen to the audios using headphones for better experience.

Abstract

Voiced by our Calib-StyleSpeech model using HiFi-GAN vocoder:

In this article, we propose Calib-StyleSpeech, a novel end-to-end Text-to-Speech (TTS) system, which achieves a Zero-shot learning approach in cloning a new voice without using the meta-learning method. Our model uses a block to extract the style from the reference mel-spectrogram, embedding this style vector into the acoustic model to condition the hidden features. And then, training it with a multi-speaker dataset that still contains noise and resonant utterances. We also extract a content vector from this mel-spectrogram simultaneously with the style vector. This vector will imitate the hidden vector extracted from the phoneme sequence by the encoder of the acoustic model. The extracted content vector takes responsibility on calibrate the style vector by the Mutual Information (MI) constraint in order to eliminate the dependency between content and style representation space. Calib-StyleSpeech can synthesize a high fidelity voice with only 36 utterances (about 3 minutes) in training data. Furthermore, the experimental results prove that model can learn to speak in a new voice with only one small reference record (about 5 seconds) of the target voice without any fine-tune stage and get a competitive similarity score with another state-of-the-art while maintaining naturalness and intelligence.

Illustration of mel-spectrogram and audio for seen speaker in training phase:

Note: Calib-StyleSpeech model is considered. In this section we show the result of our proposed model after 400k training steps on seen voice.

Text: Here are the match lineups for the Colombia Haiti match.

Seen voice during training

Note: Female voice appear at training time but the utterance is unseen

Text	Mel-spectrogram	Groundtruth audio	Calib-StyleSpeech + HiFi-GAN
Ông cho rằng sự chuyển động của các vật thể trên mặt đất và các vật thể trên bầu trời bị chi phối bởi các định luật tự nhiên giống nhau.
Một số người cho rằng có thể Ma Kết quá phô trường khi bày tỏ tình cảm rầm rộ
Sau khi học nghề người lao động phải sống được bằng nghề đã được học
Đang được điều trị tích cực
Hai thí sinh đình chỉ một trăm hai mươi ba thí sinh

Custom voice

Note that each of following reference voice is unseen during training phase. In this section, with each voice, we compare result with StyleSpeech and MetaStyleSpeech on one-shot task. Mean, there is no fine-tuning step in synthesizing the required voice.

Note: Because of the imbalanced of dataset, the demo of man of old man is lack of quality.

Model:	Groundtruth	StyleSpeech	Meta-StyleSpeech	Calib-StyleSpeech
Female 1
Female 1
Female 2
Female 2

Speed Controllable

Note: The smaller speed ratio the faster utterance

Text:	Chúng tôi cũng đề xuất một mô đun có khả năng căn chỉnh thông tin giọng nói nhờ véc tơ nội dung này	Chỉ với một đoạn âm thanh khoảng 5 giây của giọng nói mong muốn	Giọng nói này có điểm tương đồng với giọng nói thật khá cao
Speed ratio = 1.1
Speed ratio = 0.7
Speed ratio = 1.5