MEENA (PersianMMMU): Multimodal‑Multilingual Educational Exams for N‑level Assessment
Published in Under review (COLM 2025), 2025
MEENA (PersianMMMU) introduces a comprehensive multimodal and multilingual benchmark designed to assess the reasoning and problem‑solving skills of vision‑language models across educational levels. The benchmark comprises 7,500 Persian and 3,000 English multiple‑choice questions covering subjects from elementary through high school. Each question is annotated with metadata such as difficulty level, correct answer explanation and historical student success rates, enabling fine‑grained evaluation. The paper reports results on GPT‑4, Gemini and other leading models in zero‑shot, few‑shot and hallucination detection settings, illustrating both the potential and the current limitations of these systems in educational applications.