FindIt: A Format-Informed Visual Detection Benchmark

4

Task Families

13

Datasets

8

BBox Formats

9

MLLMs Evaluated

Abstract

Multimodal large language models (MLLMs) are predominantly evaluated on free-form vision–language tasks such as visual question answering, captioning, and summarization. However, their practical use is rapidly expanding to more structured computer vision settings, where users prompt models to perform localization-centric tasks such as object detection, often within larger agentic or decision-making systems. Despite this shift, there is currently no standardized benchmark that systematically evaluates these capabilities at scale.

In this work, we introduce the first comprehensive benchmark specifically designed to assess the promptable localization abilities of generalist MLLMs. Our benchmark spans four core task categories: object detection, referring expression detection, instance-level detection, and video-based detection. To enable consistent and fair evaluation, we develop a unified framework that standardizes inputs, enforces parsable bounding box outputs, and defines transparent evaluation protocols across tasks.

Using this suite, we evaluate a diverse set of open-source and proprietary MLLMs, providing an in-depth analysis of their performance and limitations. Beyond accuracy, we examine models’ ability to adhere to output format specifications, showing that current systems are highly sensitive to formatting constraints and often fail to generalize even to minor variations. Our results highlight both the strengths and shortcomings of state-of-the-art MLLMs in localization settings, and point toward important directions for improving multimodal model design and evaluation.

Benchmark Design

FindIt is structured as a grid over three axes: task & data, bounding-box representation, and output format. Each model is reported at the combination that maximises its average F1@0.5.

Task Families

Object Detection

Single- and multi-label detection using class names as queries.

Pascal VOC · OpenImages V7 · iGround

Referring Expression Detection

Localize objects described by free-form natural language.

RefCOCO/+/g · RefL4 · D3 · PhraseCut · Flickr30k Entities · SVG

Instance Detection

Localize a specific instance given a visual support image.

HR-InsDet (easy & hard) · RoboTools

Video Object Detection

Object and instance localization extended to multi-frame inputs.

iGround (2/8 frames) · RoboTools (2/8 frames)

Format Axes

We vary the bounding-box representation across seven types spanning corner-based, center-with-size, and four-corner formats, plus an unconstrained condition. We evaluate both plain text and JSON output modes, with variations covering single-label, multi-label, and multi-frame inputs as well as different JSON key choices.

Overview of the seven bounding-box representations and text/JSON output format variants.

Results

We report cross-task performance averaged over all tasks, using the best output-format configuration found per model. GLM-4.6V achieves the highest overall scores, driven primarily by its strong performance on instance detection tasks.

Cross-task F1@0.5 performance of nine MLLMs grouped by task family.

Highlights by Task

Object Detection

Best model

Qwen3-VL

Open-source models significantly outperform proprietary, especially on single-object detection.

Referring Expressions

Best model

GLM-4.6V

Models achieve even higher absolute performance here than on object detection.

Instance Detection

Best model

GLM-4.6V

Focuses less on language and more on visual capabilities, a point most models seem to struggle on.

Video Detection

Best model

GLM-4.6V

Models struggle with longer visual inputs, showing drastic performance drops from 2 to 8 frames.

Open-source vs. proprietary. Open-source models significantly outperform proprietary models on object detection, referring expressions, and instance detection. The gap is widest on instance detection, where no proprietary model approaches the top four open-source models. Video is the exception: GPT-5.4 outperforms the second-best open-source model, though this is driven by its stronger instance detection performance.

Format Sensitivity

A central finding of FindIt is that a model’s score depends as much on output format as on grounding ability. We examine two axes: bounding-box representation and structured output format (text vs. JSON).

Bounding-Box Representation

1

Models specialize in one preferred format. The preferred bbox for most models is xyxy, yxyx for Gemma and Gemini 2.5 Flash, and xywh for GPT-5.4 and Sonnet 4.5 in JSON. Switching from the best to the second-best format collapses both mIoU and F1 on open-source models. cxcywh, all, and all-labelled fail for open-source models, with most F1 below 5.
2

Format instructions do not override internalized conventions. When prompted for cxcywh, every open-source model scores near-zero F1 using the prompted cxcywh during parsing — but parsing the same outputs as the model’s preferred corner format recovers 50–75 F1. This indicates that most models specialize in one preferred format and will output this format independent of the given prompt instructions.
3

Format adherence does not imply localization quality. While models can produce parseable output in the requested syntax, they might use coordinates that do not conform to that format. As a result, many outputs have a format adherence near 100 % with a very low F1 score, showing that a good format adherence does usually not correlate with a good F1@0.5 score.
4

Only GPT-5.4 generalizes across formats. GPT-5.4’s F1 remains within a 10-point band across xyxy, xywh, yxyx, and yxhw at 100 % adherence, with F1 between 32 and 42 across the four standard bbox formats. Its best result is still below Qwen3-VL and Gemini 2.5 Flash at their preferred formats.

Output Format: Text vs. JSON

5

JSON usually outperforms plain text. On object detection, JSON usually beats text on the F1 score, while only GLM-4.6V achieves higher scores on text, which is also the preferred output format when no format instructions are given.
6

Preferred JSON key varies across models. While all models are able to handle different JSON keys, the preferred JSON key itself varies across models, with sometimes strong variations as in the case of Gemini.

BibTeX

@article{khandelwal2026findit,
  title   = {FindIt: A Format-Informed Visual Detection Benchmark
             for Generalist Multimodal {LLMs}},
  author  = {Khandelwal, Eshika and Pan, Jingjing and Zhang, Mingfang
             and Kong, Quan and Garattoni, Lorenzo and Kuehne, Hilde},
  journal = {arXiv},
  year    = {2026}
}

FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

Abstract

Benchmark Design

Task Families

Format Axes

Results

Highlights by Task

Format Sensitivity

Bounding-Box Representation

Output Format: Text vs. JSON

BibTeX