Click here to flash read.
Training speaker-discriminative and robust speaker verification systems
without speaker labels is still challenging and worthwhile to explore. Previous
studies have noted a substantial performance disparity between self-supervised
and fully supervised approaches. In this paper, we propose an effective
Self-Distillation network with Ensemble Prototypes (SDEP) to facilitate
self-supervised speaker representation learning. A range of experiments
conducted on the VoxCeleb datasets demonstrate the superiority of the SDEP
framework in speaker verification. SDEP achieves a new SOTA on Voxceleb1
speaker verification evaluation benchmark ( i.e., equal error rate 1.94\%,
1.99\%, and 3.77\% for trial Vox1-O, Vox1-E and Vox1-H , respectively),
discarding any speaker labels in the training phase. Code will be publicly
available at https://github.com/alibaba-damo-academy/3D-Speaker.
No creative common's license