Arshia Hemmat

MEENA (PersianMMMU): Multimodal‑Multilingual Educational Exams for N‑level Assessment

COLM 2025 – Under review

PersianMMMU

MEENA (PersianMMMU) introduces the first large‑scale Persian multimodal benchmark designed to evaluate the reasoning and problem‑solving abilities of vision‑language models across educational levels. The benchmark contains 7 500 Persian and 3 000 English multiple‑choice questions covering subjects from elementary through high school. Each question is accompanied by rich metadata—including difficulty level, detailed answer explanations and historical student success rates—facilitating fine‑grained analysis.

The benchmark challenges models to answer questions that require both visual and textual understanding, ranging from simple geometry to more abstract scientific reasoning. The paper reports results on GPT‑4, Gemini and other leading models in zero‑shot, few‑shot and hallucination detection settings. The findings highlight current limitations in multimodal models and provide a foundation for future research in educational AI.

Back to Publications