FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

1Tübingen AI Center, University of Tübingen  2MIT-IBM Watson AI Lab  3Woven by Toyota, Inc., Tokyo, Japan  4Toyota Motor Europe, Brussels, Belgium
FindIt benchmark overview: four task families — object detection, referring expressions, instance detection, video detection.

FindIt is the first benchmark to systematically evaluate promptable bounding-box localization in generalist MLLMs — covering four task families, nine models, and a grid of bounding-box representations and output formats — revealing that current models are highly sensitive to format choice and that a model’s score can depend as much on output format as on its actual grounding ability.

4
Task Families
13
Datasets
8
BBox Formats
9
MLLMs Evaluated

Abstract

Multimodal large language models (MLLMs) are predominantly evaluated on free-form vision–language tasks such as visual question answering, captioning, and summarization. However, their practical use is rapidly expanding to more structured computer vision settings, where users prompt models to perform localization-centric tasks such as object detection, often within larger agentic or decision-making systems. Despite this shift, there is currently no standardized benchmark that systematically evaluates these capabilities at scale.

In this work, we introduce the first comprehensive benchmark specifically designed to assess the promptable localization abilities of generalist MLLMs. Our benchmark spans four core task categories: object detection, referring expression detection, instance-level detection, and video-based detection. To enable consistent and fair evaluation, we develop a unified framework that standardizes inputs, enforces parsable bounding box outputs, and defines transparent evaluation protocols across tasks.

Using this suite, we evaluate a diverse set of open-source and proprietary MLLMs, providing an in-depth analysis of their performance and limitations. Beyond accuracy, we examine models’ ability to adhere to output format specifications, showing that current systems are highly sensitive to formatting constraints and often fail to generalize even to minor variations. Our results highlight both the strengths and shortcomings of state-of-the-art MLLMs in localization settings, and point toward important directions for improving multimodal model design and evaluation.

Benchmark Design

FindIt is structured as a grid over three axes: task & data, bounding-box representation, and output format. Each model is reported at the combination that maximises its average F1@0.5.

Task Families

Object Detection

Single- and multi-label detection using class names as queries.

Pascal VOC · OpenImages V7 · iGround

Referring Expression Detection

Localize objects described by free-form natural language.

RefCOCO/+/g · RefL4 · D3 · PhraseCut · Flickr30k Entities · SVG

Instance Detection

Localize a specific instance given a visual support image.

HR-InsDet (easy & hard) · RoboTools

Video Object Detection

Object and instance localization extended to multi-frame inputs.

iGround (2/8 frames) · RoboTools (2/8 frames)

Format Axes

We vary the bounding-box representation across seven types spanning corner-based, center-with-size, and four-corner formats, plus an unconstrained condition. We evaluate both plain text and JSON output modes, with variations covering single-label, multi-label, and multi-frame inputs as well as different JSON key choices.

Overview of the seven bounding-box representations and text/JSON output format variants.

Results

We report cross-task performance averaged over all tasks, using the best output-format configuration found per model. GLM-4.6V achieves the highest overall scores, driven primarily by its strong performance on instance detection tasks.

Cross-task F1@0.5 performance of nine MLLMs grouped by task family.

Highlights by Task

Object Detection
Best model
Qwen3-VL
Open-source models significantly outperform proprietary, especially on single-object detection.
Referring Expressions
Best model
GLM-4.6V
Models achieve even higher absolute performance here than on object detection.
Instance Detection
Best model
GLM-4.6V
Focuses less on language and more on visual capabilities, a point most models seem to struggle on.
Video Detection
Best model
GLM-4.6V
Models struggle with longer visual inputs, showing drastic performance drops from 2 to 8 frames.
Open-source vs. proprietary. Open-source models significantly outperform proprietary models on object detection, referring expressions, and instance detection. The gap is widest on instance detection, where no proprietary model approaches the top four open-source models. Video is the exception: GPT-5.4 outperforms the second-best open-source model, though this is driven by its stronger instance detection performance.

Format Sensitivity

A central finding of FindIt is that a model’s score depends as much on output format as on grounding ability. We examine two axes: bounding-box representation and structured output format (text vs. JSON).

Bounding-Box Representation

  • 1
    Models specialize in one preferred format. The preferred bbox for most models is xyxy, yxyx for Gemma and Gemini 2.5 Flash, and xywh for GPT-5.4 and Sonnet 4.5 in JSON. Switching from the best to the second-best format collapses both mIoU and F1 on open-source models. cxcywh, all, and all-labelled fail for open-source models, with most F1 below 5.
  • 2
    Format instructions do not override internalized conventions. When prompted for cxcywh, every open-source model scores near-zero F1 using the prompted cxcywh during parsing — but parsing the same outputs as the model’s preferred corner format recovers 50–75 F1. This indicates that most models specialize in one preferred format and will output this format independent of the given prompt instructions.
  • 3
    Format adherence does not imply localization quality. While models can produce parseable output in the requested syntax, they might use coordinates that do not conform to that format. As a result, many outputs have a format adherence near 100 % with a very low F1 score, showing that a good format adherence does usually not correlate with a good F1@0.5 score.
  • 4
    Only GPT-5.4 generalizes across formats. GPT-5.4’s F1 remains within a 10-point band across xyxy, xywh, yxyx, and yxhw at 100 % adherence, with F1 between 32 and 42 across the four standard bbox formats. Its best result is still below Qwen3-VL and Gemini 2.5 Flash at their preferred formats.

Output Format: Text vs. JSON

  • 5
    JSON usually outperforms plain text. On object detection, JSON usually beats text on the F1 score, while only GLM-4.6V achieves higher scores on text, which is also the preferred output format when no format instructions are given.
  • 6
    Preferred JSON key varies across models. While all models are able to handle different JSON keys, the preferred JSON key itself varies across models, with sometimes strong variations as in the case of Gemini.

BibTeX

@article{khandelwal2026findit,
  title   = {FindIt: A Format-Informed Visual Detection Benchmark
             for Generalist Multimodal {LLMs}},
  author  = {Khandelwal, Eshika and Pan, Jingjing and Zhang, Mingfang
             and Kong, Quan and Garattoni, Lorenzo and Kuehne, Hilde},
  journal = {arXiv},
  year    = {2026}
}