CVPR 2026

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Zheda Mai*1, Arpita Chowdhury*1, Zihe Wang*1, Sooyoung Jeon1, Lemeng Wang1,
Jiacheng Hou1, Jihyung Kil2, Wei-Lun Chao3

* Equal contribution

The Ohio State University logo 1The Ohio State University
Adobe Research logo 2Adobe Research
Boston University logo 3Boston University

TL;DR: AVA-Bench is a diagnostic benchmark for Vision Foundation Models (VFMs) that breaks visual understanding into 14 Atomic Visual Abilities, such as localization, counting, depth, OCR, and spatial reasoning. Instead of asking which VFM is best overall, AVA-Bench reveals where each model excels or fails, enabling principled VFM selection for downstream applications.


1

Highlight

AVA-Bench overview with models and atomic visual abilities
Diagnosis icon

Diagnosis

14 AVAs pinpoint visual strengths and weaknesses.

Guided selection icon

Guided Selection

Ability fingerprints reveal VFM fits for applications.

Efficient and open icon

Efficient & Open

Efficient evaluation and an open-source benchmark.

AVAs are fundamental visual abilities (e.g., counting, localization, depth estimation), enabling complex visual reasoning tasks.

2

Motivation

VFM Evaluation

Traditional and LLM-based VFM evaluation pipelines

Traditional evaluation uses task-specific heads, such as linear probing or full fine-tuning, for each downstream task.

LLM-based evaluation uses visual instruction tuning with LLMs, then tests VFMs on diverse VQA benchmarks.

Two Blind Spots in LLM-based Evaluation

Train-test mismatch in LLM-based VFM evaluation

A wrong prediction may arise from train-test mismatch rather than genuine visual deficiencies in a VFM.

VQA question requiring multiple atomic visual abilities

VQA questions often require multiple abilities simultaneously, making it hard to attribute a failure to missing abilities or one single critical ability.

3

AVA-Bench

Atomic Visual Abilities

AVA-Bench is the first systematic evaluation explicitly disentangling 14 Atomic Visual Abilities (AVAs) for VFMs.

AVAs are fundamental perceptual capabilities that can be combined to address more complex visual reasoning tasks.

Examples of 14 atomic visual abilities in AVA-Bench

Designed to Isolate One Ability at a Time

Each AVA comes with train-test-matched data.

Given an AVA, image-question pairs are carefully designed to test only that AVA.

This eliminates the two blind spots and lets AVA-Bench pinpoint exactly where a VFM excels or falters.

Absolute depth example with provided car bounding box
Providing the car's bounding box removes localization from the question, isolating depth estimation.

Scale, Coverage, and Quality Control

AVA-Bench statistics: 14 AVAs, 26 datasets, 218K image-question pairs
AVA-Bench dataset composition sunburst chart

AVA-Bench carefully controls dataset balance, object visibility, and annotation biases.

4

Efficient Evaluation Pipeline

Our evaluation follows an LLaVA-style interface, and the VFM remains frozen.

Following LLaVA's two-stage training (connector pretraining and instruction tuning), AVA-Bench adds a third stage: for each ability, train only the connector and a small LoRA module in the LLM, then evaluate on that ability-specific test set.

Three-stage AVA-Bench evaluation pipeline

A heavyweight LLM may NOT be required for reliable comparative evaluations.

0.5BQwen2 LLM
~8xLower GPU hours

Preserves similar relative VFM rankings to a 7B Vicuna.

0.5B versus 7B evaluator ranking consistency scatter plots
Qwen2-0.5B and Vicuna-1.5-7B preserve similar relative VFM rankings across task groups.
5

Key Findings

Overall Findings

Heatmap of VFM ranks over AVAs with highlighted best model-ability matches
Per-AVA ranking exposes each model's strengths, weaknesses, and standout abilities.

Each VFM has an AVA fingerprint.

Language-supervised VFMs excel broadly across AVAs.

Even weaker VFMs perform well in at least one AVA.

Diagnostics Reveal Failure Source

Composite task failures typically stem from specific AVA deficiencies rather than general visual incompetence.

Dog and plane image with red dog box and blue plane box

Where is the dog (annotated by the red box) located with respect to the plane (annotated by the blue box)?

A. Left above B. Left below C. Right above D. Right below.

Spatial task scores when bounding boxes are provided
With boxes provided, the task isolates spatial reasoning and performance remains high.
Dog and plane image without bounding boxes

Where is the dog located with respect to the plane?

A. Left above B. Left below C. Right above D. Right below.

Spatial and localization scores after removing bounding boxes
Without boxes, the same composite question becomes bottlenecked by localization.

Subgroup Analyses Can Reveal Hidden Patterns

Localization performance by object size subgroup
0.1 means the bounding box size is 10% of the image size.

Large objects: similar performance across VFMs.

Small objects: primary performance bottleneck.

Insights and Practical Selection

While MLLM have demonstrated remarkable versatility, they are not universally effective in all scenarios, especially in specialized domains. Thus, there is a growing necessity for developing specialized MLLMs. Currently, selecting appropriate VFMs for such customized MLLMs remains largely heuristic. Our work provides actionable insights that transform this selection process from heuristic guesswork into principled engineering. By clearly identifying AVA-specific strengths and weaknesses, practitioners can now systematically choose VFMs to precisely address the particular visual demands of targeted downstream tasks. Moreover, AVA-BENCH represents a critical step towards developing next-generation VFMs by providing a systematic, diagnostic, and comprehensive evaluation framework. This benchmark enables VFM developers to accurately pin-point specific deficiencies and implement targeted improvements, fostering the creation of more robust, versatile, and well-rounded VFMs in the future.

From AVA demands to principled VFM selection
From heuristic VFM choice to principled AVA-driven selection.

Final Takeaway

AVA-Bench makes VFM choice diagnosable, actionable, and efficient by isolating what each model can truly do.