Click here to flash read.
Accent transfer aims to transfer an accent from a source speaker to synthetic
speech in the target speaker's voice. The main challenge is how to effectively
disentangle speaker timbre and accent which are entangled in speech. This paper
presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based
on the main structure of VITS, Accent-VITS makes substantial improvements to
enable effective and stable accent transfer.We leverage a hierarchical CVAE
structure to model accent pronunciation information and acoustic features,
respectively, using bottleneck features and mel spectrums as
constraints.Moreover, the text-to-wave mapping in VITS is decomposed into
text-to-accent and accent-to-wave mappings in Accent-VITS.In this way, the
disentanglement of accent and speaker timbre becomes be more stable and
effective.Experiments on multi-accent and Mandarin datasets show that
Accent-VITS achieves higher speaker similarity, accent similarity and speech
naturalness as compared with a strong baseline.
No creative common's license