Click here to flash read.
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP,
establish the correlation between texts and images, achieving remarkable
success on various downstream tasks with fine-tuning. In existing fine-tuning
methods, the class-specific text description is matched against the whole
image. We recognize that this whole image matching is not effective since
images from the same class often contain a set of different semantic objects,
and an object further consists of a set of semantic parts or concepts.
Individual semantic parts or concepts may appear in image samples from
different classes. To address this issue, in this paper, we develop a new
method called cross-model concept learning and inference (CCLI). Using the
powerful text-image correlation capability of CLIP, our method automatically
learns a large set of distinctive visual concepts from images using a set of
semantic text concepts. Based on these visual concepts, we construct a
discriminative representation of images and learn a concept inference network
to perform downstream image classification tasks, such as few-shot learning and
domain generalization. Extensive experimental results demonstrate that our CCLI
method is able to improve the performance upon the current state-of-the-art
methods by large margins, for example, by up to 8.0% improvement on few-shot
learning and by up to 1.3% for domain generalization.