Download PDFOpen PDF in browser

Concept Attribution and Dual Explainability in Vision-Language Models

EasyChair Preprint 16013

8 pagesDate: March 23, 2026

Abstract

In this paper, we investigate methods for applying explainability to Vision-Language Models (VLMs). The main problem is given by the absence of a common representation between text and image, thus it is necessary to determine a modality for aligning the two information streams. Our research is particularized on the Contrastive Language-Image Pretraining (CLIP) model whose architecture is based on two independent encoders, both using Transformers. We propose a Concept Attribution method that addresses the problem of fusing visual encoder signals with textual gradients. This method is optimized by hybridizing the current model with a Large Language Model (LLM), which provides a developed linguistic context and implic itly an increase in the precision of explanations for multimodal reasoning. The fusion process is controlled through parameters that balance the textual and visual influence on explainability. Furthermore, for an objective validation of the results, we integrated Faithfulness Metrics, responsible for both analyzing the visual response and refining its visual representation through heatmaps. In addition, we demonstrated the utility of the method by applying it in a complex Dual Explainability system, which validates the coherence between the prompt given by the user and the model response.

Keyphrases: CLIP, Explainability, MLLM, VLM, transformers

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@booklet{EasyChair:16013,
  author    = {Laura-Luisa Voicu and Sebastian-Antonio Toma and Vlad Andrei Negru and Camelia Lemnaru and Rodica Potolea},
  title     = {Concept Attribution and Dual Explainability in  Vision-Language Models},
  howpublished = {EasyChair Preprint 16013},
  year      = {EasyChair, 2026}}
Download PDFOpen PDF in browser