Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

Vision and AI Lab, Indian Institute of Science
(* indicates equal contribution)

TL;DR: VLMs learn highly specialized image representations and highly generalized text representations that represent the core concept of a class. We propose to distill both the image and text representations of a VLM while using its zero-shot classifier for effective generalization.


Abstract

Vision-Language Models (VLMs) such as CLIP are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions. However, in several cases, their expensive training and data collection/curation costs do not justify the end application. This motivates a vendor-client paradigm, where a vendor trains a large-scale VLM and grants only input-output access to clients on a pay-per-query basis in a black-box setting. The client aims to minimize inference cost by distilling the VLM to a student model using the limited available task-specific data, and further deploying this student model in the downstream application. While naive distillation largely improves the In-Domain (ID) accuracy of the student, it fails to transfer the superior out-of-distribution (OOD) generalization of the VLM teacher using the limited available labeled images. To mitigate this, we propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model, and further distills the aligned VLM representations to the student. This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings. The proposed approach achieves state-of-the-art results on the standard Domain Generalization benchmarks in a black-box teacher setting as well as a white-box setting where the weights of the VLM are accessible.



Robustness of CLIP Embeddings

Embedding used for computing similarity OH TI VLCS PACS Avg.
E1: T.E. for "A photo of a {class}" 82.36 34.19 82.08 96.10 73.68
E2: Avg. T.E. for "A {domain} of a {class}" across all train domains 83.70 35.55 82.28 96.21 74.44
E3: Avg. I.E. of each class (Source) 71.37 33.99 48.21 79.03 58.15
E4: Avg. I.E. of each class (Target) 78.21 38.69 69.31 93.08 69.82
E5: Avg. I.E. of 10 images per class closest to test image (Source) 76.42 39.33 76.42 92.15 71.08
E6: Avg. I.E. of 10 images per class closest to test image (Target) 84.86 85.38 87.88 98.32 89.11

  • In these experiments, we demonstrate the characteristics of the image and text embeddings of CLIP by using various embeddings (E1-E6) in the cosine similarity computation for zero-shot prediciton.
  • CLIP zero-shot results (E1) demonstrate the robustness of the text embeddings, which can further be enhanced by enforcing domain invariance (average of class-wise text embeddings across domains).
  • However, the same level of robustness is not observed with an average of the class-wise image embeddings across domains (E3). While the text embeddings for “A {domain} of a {class}” are obtained by training over a large dataset (400 million images), the image embeddings are an average of ∼0.01 million images on the downstream dataset. This result improves by using a similar average of the target domain images (E4), assuming that it is accessible.
  • The zeroshot performance improves further with class-wise average embeddings corresponding to a few (10) images closest to the target image in the source domain (E5) and the target domain (E6). However, in a DG setting where the target domain is inaccessible, the generic text embeddings provide the best robustness across distribution shifts.


Proposed Method

  • Vision-Language-to-Vision - Self-Distillation (VL2V-SD): A self-distillation approach for the white box setting where the invariances of generic text embeddings are distilled to the image encoder using the downstream dataset. The following cosine similarity loss is minimized: $$\mathcal{L}_{\mathrm{SD}} = - \frac{1}{2n} \sum_{i=1}^n \big\{\cos(\textbf{I}^s_{x_i}, \textbf{T}_{y_i}) + \cos(\textbf{I}^s_{x_i}, \textbf{I}^t_{x_i})\big\}$$
  • Vision-Language-to-Vision - Align, Distill, Predict (VL2V-ADiP): A black box distillation approach that first aligns the student representations with that of the image and text embeddings of the VLM and then distills the VLM's aligned representations to the student. The following cosine similarity loss is minimized: $$\mathcal{L}_{\mathrm{ADiP}} = - \frac{1}{2n} \sum_{i=1}^n \big\{\cos(\textbf{PF}^s_{x_i}, \textbf{T}_{y_i}) + \cos(\textbf{PF}^s_{x_i}, \textbf{I}^t_{x_i})\big\}$$
  • $\textbf{I}^s_{x_i}$: Student embedding for image $x_i$ in the white box setting
  • $\textbf{PF}^s_{x_i}$: Student embedding for image $x_i$ from the projection layer in the black box setting
  • $\textbf{I}^t_{x_i}$: VLM teacher's image embedding for input $x_i$
  • $\textbf{T}_{y_i}$: VLM teacher's text embedding for "a photo of a {cls}", where {cls} is the class name corresponding to the label $y_i$

Main Results

1. White-Box setting (CLIP initialization): Performance (%) of the proposed approach VLV2-SD, compared to the existing methods. ViT-B/16 architecture is used. (S) denotes SWAD
Method Office-Home TerraInc VLCS PACS DomainNet Avg-ID Avg-OOD
Zero-shot 82.40 34.10 82.30 96.50 57.70 - 70.60
SWAD (NeurIPS'21) 81.01 42.92 79.13 91.35 57.92 89.05 70.47
MIRO (S) (ECCV'22) 84.80 59.30 82.30 96.44 60.47 91.00 76.66
DART (S) (CVPR'23) 80.93 51.24 80.38 93.43 59.32 89.25 73.06
SAGM (S) (CVPR'23) 83.40 58.64 82.05 94.31 59.05 89.74 75.49
LP-FT (S) (ICLR'22) 81.17 47.26 80.88 92.92 57.04 88.97 71.85
FLYP (S) (CVPR'23) 82.76 33.25 66.64 78.53 57.41 78.94 63.72
CLIPood (S) (ICML'23) 83.31 46.28 77.19 93.16 57.78 69.90 71.55
RISE (S) (ICCV'23) 78.39 49.61 80.62 93.25 55.37 87.91 71.45
WiSE-FT (CVPR'22) 86.32 54.50 82.88 97.29 58.01 88.35 75.80
VL2V-SD (Ours) 87.38 58.54 83.25 96.68 62.79 89.99 77.73

2. SOTA comparison with ImageNet initialization: Performance (%) of the proposed approach VL2V-ADiP, compared to existing KD and DG methods (with SWAD). ViT-B/16 with ImageNet-1K initialization is used as the student.
Method Office-Home TerraInc VLCS PACS DomainNet Avg-ID Avg-OOD
ERM Linear Probe 71.48 31.35 77.52 67.02 36.65 73.99 56.81
ERM Full Fine-tuning 83.22 50.05 80.33 90.28 56.10 89.31 72.00
LP-FT (ICLR'22) 81.55 51.61 80.17 91.20 56.03 90.03 72.11
SimKD (CVPR'22) 66.76 28.24 81.01 83.92 49.42 68.24 61.87
KD 82.73 48.40 80.48 91.46 56.11 89.20 71.84
MIRO (ECCV'22) 80.09 50.29 81.10 89.50 55.75 88.71 71.35
DART (CVPR'23) 83.75 49.68 77.29 90.55 58.05 88.54 71.86
SAGM (CVPR'23) 82.22 53.24 79.60 90.02 55.66 89.22 72.15
Text2Concept (ICML'23) 70.24 26.46 64.77 79.03 23.26 53.15 52.75
RISE (ICCV'23) 83.48 52.55 83.70 93.54 56.58 88.91 73.97
VL2V-ADiP (Ours) 85.74 55.43 81.90 94.94 59.38 89.02 75.48

3. Distillation using various VLMs: Performance (%) of the proposed approach VL2V-ADiP (denoted as Ours) on 4 DG datasets, when distilling from FLAVA, BLIP, CLIP and the data-efficient versions of CLIP and FILIP. The student architecture is ViT-B/16 in all cases.
Teacher Method OH VLCS PACS TI Avg. Dataset
FLAVA
ViT-B/16
Zero-shot 69.99 79.21 91.34 28.85 67.35 PMD Corpus
70M
KD (S) 82.50 80.41 90.71 50.86 76.12
Ours (S) 84.16 82.94 93.22 54.56 78.72
BLIP
ViT-B/16
Zero-shot 84.83 71.60 92.23 29.75 69.60 CapFilt
129M
KD (S) 82.45 80.31 87.73 48.03 74.63
Ours (S) 85.86 81.60 94.10 52.07 78.41
CLIP
ViT-B/16
Zero-shot 81.57 82.55 95.99 31.15 72.81 CLIP-400M
KD (S) 82.73 80.48 91.49 48.33 75.76
Ours (S) 85.74 81.89 94.13 55.43 79.30
DeCLIP
ViT-B/32
Zero-shot 43.46 77.79 83.69 27.70 58.16 YFCC-15M
KD (S) 81.84 79.95 89.96 49.49 75.31
Ours (S) 82.85 81.40 92.16 50.50 76.73
DeFILIP
ViT-B/32
Zero-shot 46.97 74.08 82.02 16.34 54.85 YFCC-15M
KD (S) 82.14 79.53 90.68 50.96 75.83
Ours (S) 83.11 81.43 92.03 51.69 77.06

BibTeX

@article{addepalli2024leveraging,
  author    = {Addepalli, Sravanti and Asokan, Ashish Ramayee and Sharma, Lakshay and Babu, R Venkatesh},
  title     = {Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification},
  journal   = {CVPR},
  year      = {2024},
}