Self-Distillation Network with Ensemble Prototypes: Learning Robust Speaker Representations without Supervision. (arXiv:2308.02774v2 [eess.AS] UPDATED)

Click here to flash read.

Training speaker-discriminative and robust speaker verification systems
without speaker labels is still challenging and worthwhile to explore. Previous
studies have noted a substantial performance disparity between self-supervised
and fully supervised approaches. In this paper, we propose an effective
Self-Distillation network with Ensemble Prototypes (SDEP) to facilitate
self-supervised speaker representation learning. A range of experiments
conducted on the VoxCeleb datasets demonstrate the superiority of the SDEP
framework in speaker verification. SDEP achieves a new SOTA on Voxceleb1
speaker verification evaluation benchmark ( i.e., equal error rate 1.94\%,
1.99\%, and 3.77\% for trial Vox1-O, Vox1-E and Vox1-H , respectively),
discarding any speaker labels in the training phase. Code will be publicly
available at https://github.com/alibaba-damo-academy/3D-Speaker.

Click here to read this post out

ID: 338342; Unique Viewers: 0

Unique Voters: 0

Total Votes: 0

Votes:

Latest Change: Aug. 22, 2023, 7:32 a.m. Changes:

/u/anonymous

Dictionaries:

Words:

Spaces:

CC:
No creative common's license

Comments: