Click here to flash read.
Dialect identification is a critical task in speech processing and language
technology, enhancing various applications such as speech recognition, speaker
verification, and many others. While most research studies have been dedicated
to dialect identification in widely spoken languages, limited attention has
been given to dialect identification in low-resource languages, such as
Romanian. To address this research gap, we introduce RoDia, the first dataset
for Romanian dialect identification from speech. The RoDia dataset includes a
varied compilation of speech samples from five distinct regions of Romania,
covering both urban and rural environments, totaling 2 hours of manually
annotated speech data. Along with our dataset, we introduce a set of
competitive models to be used as baselines for future research. The top scoring
model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%,
indicating that the task is challenging. We thus believe that RoDia is a
valuable resource that will stimulate research aiming to address the challenges
of Romanian dialect identification. We publicly release our dataset and code at
https://github.com/codrut2/RoDia.
No creative common's license