Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models

NeurIPS 2024 (Datasets & Benchmarks Track)

1University of Oxford 2University of Illinois Urbana-Champaign
*Indicates Equal Contribution
IllusionBench overview

Overview of IllusionBench. For each of the 3 datasets in IllusionBench, we show an example image from the dataset alongside an example scene prompt and an example shape conditioning image used to generate it.

Abstract

Despite the importance of shape perception in human vision, early neural image classifiers relied less on shape information for object recognition than other (often spurious) features. While recent research suggests that current large Vision-Language Models (VLMs) exhibit more reliance on shape, we find them to still be seriously limited in this regard. To quantify such limitations, we introduce IllusionBench, a dataset that challenges current cutting-edge VLMs to decipher shape information when the shape is represented by an arrangement of visual elements in a scene. Our extensive evaluations reveal that, while these shapes are easily detectable by human annotators, current VLMs struggle to recognize them, indicating important avenues for future work in developing more robust visual perception systems.

IllusionBench examples

Can vision-language models (VLMs) recognize these shapes? IllusionBench dataset contains images in which scene elements are arranged to represent abstract shapes.

Results

Dataset Samples

Zero-Shot Leaderboard

Model Illusion-IN Illusion-Logo Illusion-Icon Average
Gemini-Flash 31.54 33.99 27.35 30.96
GPT-4O 35.10 21.96 21.37 26.14
Llava1.5-7b 26.56 25.86 11.82 21.41
MoE-Phi2 23.34 24.65 13.43 20.47
Llava1.5-13b 23.44 26.94 11.02 20.47
InstructBlip-13b 26.01 25.05 7.76 19.60
CogVLM 18.17 21.49 12.27 17.31
InstructBlip-T5 17.13 22.18 10.37 16.56
Llava1.6-7b 18.69 15.33 13.72 15.91
MoE-StableLM 10.83 20.60 15.63 15.69
InstructBlip-7b 19.93 25.66 1.09 15.56
Qwen 16.70 21.65 6.03 14.79
MoE-Qwen 11.73 21.35 9.80 14.29
Blipv2-t5 13.08 19.91 4.68 12.56

Results show zero-shot shape recall (%) on different IllusionBench datasets.

BibTeX

@inproceedings{hemmat2024hidden,
  title={Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models},
  author={Arshia Hemmat and Adam Davies and Tom A. Lamb and Jianhao Yuan and Philip Torr and Ashkan Khakzar and Francesco Pinto},
  booktitle={Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024}
}