The FlySpeech Audio-Visual Speaker Diarization System for MISP Challenge 2022. (arXiv:2307.15400v1 [cs.SD])

Click here to flash read.

This paper describes the FlySpeech speaker diarization system submitted to
the second \textbf{M}ultimodal \textbf{I}nformation Based \textbf{S}peech
\textbf{P}rocessing~(\textbf{MISP}) Challenge held in ICASSP 2022. We develop
an end-to-end audio-visual speaker diarization~(AVSD) system, which consists of
a lip encoder, a speaker encoder, and an audio-visual decoder. Specifically, to
mitigate the degradation of diarization performance caused by separate
training, we jointly train the speaker encoder and the audio-visual decoder. In
addition, we leverage the large-data pretrained speaker extractor to initialize
the speaker encoder.

Click here to read this post out

ID: 301664; Unique Viewers: 0

Unique Voters: 0

Total Votes: 0

Votes:

Latest Change: July 31, 2023, 7:30 a.m. Changes:

/u/anonymous

Dictionaries:

Words:

Spaces:

CC:
No creative common's license

Comments: