Click here to flash read.
Recent studies focus on developing efficient systems for acoustic scene
classification (ASC) using convolutional neural networks (CNNs), which
typically consist of consecutive kernels. This paper highlights the benefits of
using separate kernels as a more powerful and efficient design approach in ASC
tasks. Inspired by the time-frequency nature of audio signals, we propose
TF-SepNet, a CNN architecture that separates the feature processing along the
time and frequency dimensions. Features resulted from the separate paths are
then merged by channels and directly forwarded to the classifier. Instead of
the conventional two dimensional (2D) kernel, TF-SepNet incorporates one
dimensional (1D) kernels to reduce the computational costs. Experiments have
been conducted using the TAU Urban Acoustic Scene 2022 Mobile development
dataset. The results show that TF-SepNet outperforms similar state-of-the-arts
that use consecutive kernels. A further investigation reveals that the separate
kernels lead to a larger effective receptive field (ERF), which enables
TF-SepNet to capture more time-frequency features.
No creative common's license