AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Abstract

The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general- purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM’ visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-BENCH, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs)—foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-BENCH pinpoints exactly where a VFM excels or falters. Applying AVA-BENCH to leading VFMs thus reveals distinctive “ability fingerprints,” turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8×, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-BENCH lays the foundation for the next generation of VFMs.

Contributions

Systematic Benchmark of Visual Abilities: We identify critical blind spots in existing evaluation protocols and introduce AVA-BENCH, a systematic, diagnostic, and comprehensive VFM evaluation benchmark covering 14 atomic visual abilities (AVAs), with carefully curated 218K samples with 26 datasets.
Actionable VFM Analysis: We conduct a detailed evaluation and insightful analysis of diverse leading VFMs, deriving actionable guidance for VFM selection in downstream applications such as customized MLLMs.
Lightweight, Open Evaluation Protocol: We release a resource-efficient evaluation protocol along with an open-source codebase to facilitate the development of the next generation of accountable and versatile VFMs.

14 AVA Samples Description — Overall statistics of AVA-BENCH

Highlights of Insights

14 AVA Samples — Performance comparison of VFMs across all AVAs. (Left) Language-Supervised VFMs with DINOv2 as a reference. (Right) Other VFMs with the SigLIP-2 as a reference.

Language-supervised VFMs excel broadly across AVAs.
All VFMs perform well in certain AVAs (e.g., relative depth, object recognition).
Non-language-aligned VFMs consistently fail OCR tasks.
Even lower-performing VFMs excel in at least one AVA.
VFMs perform similarly for large objects. Poor performance in some VFMs is mainly due to struggles in small objects.
Composite task failures typically stem from specific AVA deficiencies rather than general visual incompetence. With Bounding box, all VFMs perform perfectly in spatial AVA; without them, models with weaker localization perform worse

BibTeX


@article{mai2025ava,
  title={AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models},
  author={Mai, Zheda and Chowdhury, Arpita and Wang, Zihe and Jeon, Sooyoung and Wang, Lemeng and Hou, Jiacheng and Kil, Jihyung and Chao, Wei-Lun},
  journal={arXiv preprint arXiv:2506.09082},
  year={2025}
}