CLIP Exhibits Improved Compositional Generalization Through Representation Disentanglement
Published in conference, 2023
Vision-language models (VLMs), such as CLIP, have shown promising Out-of-Distribution (OoD) generalization under various flavors of distribution shifts. Recent studies attempted to investigate the leading cause of this property. In this work, we target the same goal, but focus on a certain type of distribution shift, in which test images contain unseen compositions of attribute-object pairs, but with the objects and attributes being individually seen during training. The models are expected to classify those images into the composition classes, i.e. attribute-object pairs, and also into object classes by ignoring attributes. We carefully designed an authentic image test dataset consisting of attributes for objects that are unlikely encountered in the CLIP training data. We found that the compositions diversity in the training data, as measured by normalized mutual information between objects and attributes, has a significant effect on the improvement of compositional generalization in the CLIP models. We found that image/text representation disentanglement with respect to the composition constituents also plays a key role in the improved generalization of these models. We notice that larger training datasets could potentially trigger emergence of such a disentanglement, as the compositions are typically more diverse in such datasets. We validate this hypothesis through different representation disentanglement metrics, including Z-Diff, and explicitness scores for various CLIPs.
Download here