USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models. (arXiv:2309.08023v1 [eess.AS])

Click here to flash read.

We introduce a multilingual speaker change detection model (USM-SCD) that can
simultaneously detect speaker turns and perform ASR for 96 languages. This
model is adapted from a speech foundation model trained on a large quantity of
supervised and unsupervised data, demonstrating the utility of fine-tuning from
a large generic foundation model for a downstream task. We analyze the
performance of this multilingual speaker change detection model through a
series of ablation studies. We show that the USM-SCD model can achieve more
than 75% average speaker change detection F1 score across a test set that
consists of data from 96 languages. On American English, the USM-SCD model can
achieve an 85.8% speaker change detection F1 score across various public and
internal test sets, beating the previous monolingual baseline model by 21%
relative. We also show that we only need to fine-tune one-quarter of the
trainable model parameters to achieve the best model performance. The USM-SCD
model exhibits state-of-the-art ASR quality compared with a strong public ASR
baseline, making it suitable to handle both tasks with negligible additional
computational cost.

Click here to read this post out

ID: 408042; Unique Viewers: 0

Unique Voters: 0

Total Votes: 0

Votes:

Latest Change: Sept. 18, 2023, 7:31 a.m. Changes:

/u/anonymous

Dictionaries:

Words:

Spaces:

CC:
No creative common's license

Comments: