Accent-VITS:accent transfer for end-to-end TTS. (arXiv:2312.16850v1 [cs.SD])

Click here to flash read.

Accent transfer aims to transfer an accent from a source speaker to synthetic
speech in the target speaker's voice. The main challenge is how to effectively
disentangle speaker timbre and accent which are entangled in speech. This paper
presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based
on the main structure of VITS, Accent-VITS makes substantial improvements to
enable effective and stable accent transfer.We leverage a hierarchical CVAE
structure to model accent pronunciation information and acoustic features,
respectively, using bottleneck features and mel spectrums as
constraints.Moreover, the text-to-wave mapping in VITS is decomposed into
text-to-accent and accent-to-wave mappings in Accent-VITS.In this way, the
disentanglement of accent and speaker timbre becomes be more stable and
effective.Experiments on multi-accent and Mandarin datasets show that
Accent-VITS achieves higher speaker similarity, accent similarity and speech
naturalness as compared with a strong baseline.

Click here to read this post out

ID: 647636; Unique Viewers: 0

Unique Voters: 0

Total Votes: 0

Votes:

Latest Change: Dec. 31, 2023, 7:32 a.m. Changes:

/u/anonymous

Dictionaries:

Words:

Spaces:

CC:
No creative common's license

Comments: