Click here to flash read.
Video-grounded dialogue systems aim to integrate video understanding and
dialogue understanding to generate responses that are relevant to both the
dialogue and video context. Most existing approaches employ deep learning
models and have achieved remarkable performance, given the relatively small
datasets available. However, the results are partly accomplished by exploiting
biases in the datasets rather than developing multimodal reasoning, resulting
in limited generalization. In this paper, we propose a novel approach of
Compositional Counterfactual Contrastive Learning ($C^3$) to develop
contrastive training between factual and counterfactual samples in
video-grounded dialogues. Specifically, we design factual/counterfactual
sampling based on the temporal steps in videos and tokens in dialogues and
propose contrastive loss functions that exploit object-level or action-level
variance. Different from prior approaches, we focus on contrastive hidden state
representations among compositional output tokens to optimize the
representation space in a generation setting. We achieved promising performance
gains on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark and showed the
benefits of our approach in grounding video and dialogue context.
No creative common's license