Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision‑Language Models

Published in NeurIPS 2024 (Datasets & Benchmarks Track), 2024

This paper presents IllusionBench, a collection of three datasets designed to evaluate whether vision‑language models (VLMs) truly perceive abstract shapes when those shapes are formed by arranging objects within a scene. By conditioning diffusion models on binary masks to hide letters, faces and animals, the authors generate challenging scenes that require gestalt perception. Extensive zero‑shot and few‑shot experiments on GPT‑4o, Gemini, Llava and other VLMs reveal that humans find these tasks trivial while even the best models struggle, underscoring the need for better shape recognition and motivating future work on robust multi‑modal models.

Project PageOpenReview