Click here to flash read.
Speech recognition systems are a key intermediary in voice-driven
human-computer interaction. Although speech recognition works well for pristine
monologic audio, real-life use cases in open-ended interactive settings still
present many challenges. We argue that timing is mission-critical for dialogue
systems, and evaluate 5 major commercial ASR systems for their conversational
and multilingual support. We find that word error rates for natural
conversational data in 6 languages remain abysmal, and that overlap remains a
key challenge (study 1). This impacts especially the recognition of
conversational words (study 2), and in turn has dire consequences for
downstream intent recognition (study 3). Our findings help to evaluate the
current state of conversational ASR, contribute towards multidimensional error
analysis and evaluation, and identify phenomena that need most attention on the
way to build robust interactive speech technologies.