CVPR 2026

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Zheda Mai^*1, Arpita Chowdhury^*1, Zihe Wang^*1, Sooyoung Jeon¹, Lemeng Wang¹,
Jiacheng Hou¹, Jihyung Kil², Wei-Lun Chao³

* Equal contribution

¹The Ohio State University

²Adobe Research

³Boston University

Paper Code Data Testing Data

TL;DR: AVA-Bench is a diagnostic benchmark for Vision Foundation Models (VFMs) that breaks visual understanding into 14 Atomic Visual Abilities, such as localization, counting, depth, OCR, and spatial reasoning. Instead of asking which VFM is best overall, AVA-Bench reveals where each model excels or fails, enabling principled VFM selection for downstream applications.

AVA-Bench overview with models and atomic visual abilities

Diagnosis

14 AVAs pinpoint visual strengths and weaknesses.

Guided Selection

Ability fingerprints reveal VFM fits for applications.

Efficient & Open

Efficient evaluation and an open-source benchmark.

AVAs are fundamental visual abilities (e.g., counting, localization, depth estimation), enabling complex visual reasoning tasks.

VFM Evaluation

Traditional evaluation uses task-specific heads, such as linear probing or full fine-tuning, for each downstream task.

LLM-based evaluation uses visual instruction tuning with LLMs, then tests VFMs on diverse VQA benchmarks.

Two Blind Spots in LLM-based Evaluation

Atomic Visual Abilities

AVA-Bench is the first systematic evaluation explicitly disentangling 14 Atomic Visual Abilities (AVAs) for VFMs.

AVAs are fundamental perceptual capabilities that can be combined to address more complex visual reasoning tasks.

Examples of 14 atomic visual abilities in AVA-Bench

Designed to Isolate One Ability at a Time

Each AVA comes with train-test-matched data.

Given an AVA, image-question pairs are carefully designed to test only that AVA.

This eliminates the two blind spots and lets AVA-Bench pinpoint exactly where a VFM excels or falters.

Absolute depth example with provided car bounding box — Providing the car's bounding box removes localization from the question, isolating depth estimation.

Scale, Coverage, and Quality Control

AVA-Bench statistics: 14 AVAs, 26 datasets, 218K image-question pairs

AVA-Bench dataset composition sunburst chart

AVA-Bench carefully controls dataset balance, object visibility, and annotation biases.

Our evaluation follows an LLaVA-style interface, and the VFM remains frozen.

Following LLaVA's two-stage training (connector pretraining and instruction tuning), AVA-Bench adds a third stage: for each ability, train only the connector and a small LoRA module in the LLM, then evaluate on that ability-specific test set.

Three-stage AVA-Bench evaluation pipeline

A heavyweight LLM may NOT be required for reliable comparative evaluations.

0.5BQwen2 LLM

~8xLower GPU hours

Preserves similar relative VFM rankings to a 7B Vicuna.

0.5B versus 7B evaluator ranking consistency scatter plots — Qwen2-0.5B and Vicuna-1.5-7B preserve similar relative VFM rankings across task groups.

Overall Findings

Heatmap of VFM ranks over AVAs with highlighted best model-ability matches — Per-AVA ranking exposes each model's strengths, weaknesses, and standout abilities.

Each VFM has an AVA fingerprint.

Language-supervised VFMs excel broadly across AVAs.

Even weaker VFMs perform well in at least one AVA.

Diagnostics Reveal Failure Source

Composite task failures typically stem from specific AVA deficiencies rather than general visual incompetence.

Dog and plane image with red dog box and blue plane box

Where is the dog (annotated by the red box) located with respect to the plane (annotated by the blue box)?

A. Left above B. Left below C. Right above D. Right below.

Spatial task scores when bounding boxes are provided — With boxes provided, the task isolates spatial reasoning and performance remains high.

Dog and plane image without bounding boxes

Where is the dog located with respect to the plane?

A. Left above B. Left below C. Right above D. Right below.

Spatial and localization scores after removing bounding boxes — Without boxes, the same composite question becomes bottlenecked by localization.

Subgroup Analyses Can Reveal Hidden Patterns

Localization performance by object size subgroup — 0.1 means the bounding box size is 10% of the image size.

Large objects: similar performance across VFMs.

Small objects: primary performance bottleneck.

Insights and Practical Selection

While MLLM have demonstrated remarkable versatility, they are not universally effective in all scenarios, especially in specialized domains. Thus, there is a growing necessity for developing specialized MLLMs. Currently, selecting appropriate VFMs for such customized MLLMs remains largely heuristic. Our work provides actionable insights that transform this selection process from heuristic guesswork into principled engineering. By clearly identifying AVA-specific strengths and weaknesses, practitioners can now systematically choose VFMs to precisely address the particular visual demands of targeted downstream tasks. Moreover, AVA-BENCH represents a critical step towards developing next-generation VFMs by providing a systematic, diagnostic, and comprehensive evaluation framework. This benchmark enables VFM developers to accurately pin-point specific deficiencies and implement targeted improvements, fostering the creation of more robust, versatile, and well-rounded VFMs in the future.

From AVA demands to principled VFM selection — From heuristic VFM choice to principled AVA-driven selection.

Final Takeaway

AVA-Bench makes VFM choice diagnosable, actionable, and efficient by isolating what each model can truly do.