Interpreting CLIP's Image Representation via Text-Based Decomposition. (arXiv:2310.05916v2 [cs.CV] UPDATED)

Click here to flash read.

We investigate the CLIP image encoder by analyzing how individual model
components affect the final representation. We decompose the image
representation as a sum across individual image patches, model layers, and
attention heads, and use CLIP's text representation to interpret the summands.
Interpreting the attention heads, we characterize each head's role by
automatically finding text representations that span its output space, which
reveals property-specific roles for many heads (e.g. location or shape). Next,
interpreting the image patches, we uncover an emergent spatial localization
within CLIP. Finally, we use this understanding to remove spurious features
from CLIP and to create a strong zero-shot image segmenter. Our results
indicate that a scalable understanding of transformer models is attainable and
can be used to repair and improve models.

Click here to read this post out

ID: 463708; Unique Viewers: 0

Unique Voters: 0

Total Votes: 0

Votes:

Latest Change: Oct. 11, 2023, 7:31 a.m. Changes:

/u/anonymous

Dictionaries:

Words:

Spaces:

CC:
No creative common's license

Comments: