Click here to flash read.
A popular approach to unveiling the black box of neural NLP models is to
leverage saliency methods, which assign scalar importance scores to each input
component. A common practice for evaluating whether an interpretability method
is faithful has been to use evaluation-by-agreement -- if multiple methods
agree on an explanation, its credibility increases. However, recent work has
found that saliency methods exhibit weak rank correlations even when applied to
the same model instance and advocated for the use of alternative diagnostic
methods. In our work, we demonstrate that rank correlation is not a good fit
for evaluating agreement and argue that Pearson-$r$ is a better-suited
alternative. We further show that regularization techniques that increase
faithfulness of attention explanations also increase agreement between saliency
methods. By connecting our findings to instance categories based on training
dynamics, we show that the agreement of saliency method explanations is very
low for easy-to-learn instances. Finally, we connect the improvement in
agreement across instance categories to local representation space statistics
of instances, paving the way for work on analyzing which intrinsic model
properties improve their predisposition to interpretability methods.
No creative common's license