The THU-HCSI Multi-Speaker Multi-Lingual Few-Shot Voice Cloning System for LIMMITS'24 Challenge

Click here to flash read.

arXiv:2404.16619v1 Announce Type: cross
Abstract: This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder and flow-based decoder with Transformer blocks. In addition, we denoise the few-shot data, mix up them with pre-training data, and adopt a speaker-balanced sampling strategy to guarantee effective fine-tuning for target speakers. The official evaluations in track 1 show that our system achieves the best speaker similarity MOS of 4.25 and obtains considerable naturalness MOS of 3.97.

Click here to read this post out

ID: 822825; Unique Viewers: 0

Unique Voters: 0

Total Votes: 0

Votes:

Latest Change: April 26, 2024, 7:32 a.m. Changes:

/u/anonymous

Dictionaries:

Words:

Spaces:

CC:
No creative common's license

Comments: