Integrating Uncertainty into Neural Network-based Speech Enhancement. (arXiv:2305.08744v1 [eess.AS])
Click here to flash read.
Supervised masking approaches in the time-frequency domain aim to employ deep
neural networks to estimate a multiplicative mask to extract clean speech. This
leads to a single estimate for each input without any guarantees or measures of
reliability. In this paper, we study the benefits of modeling uncertainty in
clean speech estimation. Prediction uncertainty is typically categorized into
aleatoric uncertainty and epistemic uncertainty. The former refers to inherent
randomness in data, while the latter describes uncertainty in the model
parameters. In this work, we propose a framework to jointly model aleatoric and
epistemic uncertainties in neural network-based speech enhancement. The
proposed approach captures aleatoric uncertainty by estimating the statistical
moments of the speech posterior distribution and explicitly incorporates the
uncertainty estimate to further improve clean speech estimation. For epistemic
uncertainty, we investigate two Bayesian deep learning approaches: Monte Carlo
dropout and Deep ensembles to quantify the uncertainty of the neural network
parameters. Our analyses show that the proposed framework promotes capturing
practical and reliable uncertainty, while combining different sources of
uncertainties yields more reliable predictive uncertainty estimates.
Furthermore, we demonstrate the benefits of modeling uncertainty on speech
enhancement performance by evaluating the framework on different datasets,
exhibiting notable improvement over comparable models that fail to account for
uncertainty.