User: osman-m-hegazy's Post: AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment. (arXiv:2305.04476v3 [eess.AS] UPDATED)

AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment. (arXiv:2305.04476v3 [eess.AS] UPDATED)

Click here to flash read.

The speech-to-singing (STS) voice conversion task aims to generate singing
samples corresponding to speech recordings while facing a major challenge: the
alignment between the target (singing) pitch contour and the source (speech)
content is difficult to learn in a text-free situation. This paper proposes
AlignSTS, an STS model based on explicit cross-modal alignment, which views
speech variance such as pitch and content as different modalities. Inspired by
the mechanism of how humans will sing the lyrics to the melody, AlignSTS: 1)
adopts a novel rhythm adaptor to predict the target rhythm representation to
bridge the modality gap between content and pitch, where the rhythm
representation is computed in a simple yet effective way and is quantized into
a discrete space; and 2) uses the predicted rhythm representation to re-align
the content based on cross-attention and conducts a cross-modal fusion for
re-synthesize. Extensive experiments show that AlignSTS achieves superior
performance in terms of both objective and subjective metrics. Audio samples
are available at https://alignsts.github.io.

Click here to read this post out

ID: 130009; Unique Viewers: 0

Voters: 0

Latest Change: May 16, 2023, 7:32 a.m. Changes:

Dictionaries:

Words:

Spaces:

Comments:

Newcom