Traditional Composed Image Retrieval (CIR) enables flexible multimodal search but inherently prioritizes broad semantic matching, often failing to retrieve a user-specified instance across contexts. To bridge this gap, we introduce Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained retrieval task that mandates strict instance-level consistency.
To advance research on this challenging task, we construct OACIRR, the first large-scale, multi-domain benchmark comprising over 160K quadruples. We further propose AdaFocal, an efficient framework featuring a Context-Aware Attention Modulator (CAAM) that dynamically intensifies attention on the anchored instance region.
The paradigm of image retrieval has progressively evolved toward more flexible and user-oriented forms of interaction. While traditional single-modal methods often struggle to express complex user intentions, Composed Image Retrieval (CIR) has emerged as a powerful paradigm to address this limitation. By combining a reference image with modification text, CIR leverages the synergy between visual and textual modalities to retrieve semantically aligned target images.
Despite its flexibility, the fundamental design of CIR prioritizes semantic matching over instance-level fidelity. The reference image in a conventional CIR query often serves as a coarse-grained visual anchor, rendering the retrieval of a specific instance unreliable, particularly in the presence of visually similar distractors. In many practical applications, emphasizing concrete instance fidelity is often more critical than achieving broad semantic alignment.
In this work, we propose Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained image retrieval task that mandates strict instance-level consistency. OACIR extends the conventional compositional query by incorporating an anchored instance. The objective is to retrieve a target image that semantically satisfies the textual modification while strictly preserving the identical anchored instance.
OACIRR (Object-Anchored Composed Image Retrieval on Real-world images) is the first large-scale, multi-domain benchmark tailored for the OACIR task.
Unlike traditional Composed Image Retrieval (CIR), which inherently prioritizes broad semantic matching, OACIRR mandates strict instance-level fidelity. By anchoring a specific object via a bounding box in the reference image, it requires models to retrieve a target image that semantically satisfies the textual modification while strictly preserving the identical anchored instance.
OACIRR comprises a unified training set of 127K quadruples covering 2,647 instances, along with an extensive evaluation benchmark containing 33.4K queries across 1,238 instances from four diverse domains: Fashion, Car, Product, and Landmark. The benchmark is enriched with over 26.6K curated distractor instances to form challenging galleries.
Collectively, OACIRR encompasses 160K+ quadruples, providing both a high-quality foundational dataset and a rigorous, comprehensive benchmark for the OACIR task.
To address the core challenges of the OACIR task, we propose AdaFocal, an effective framework that dynamically modulates visual attention for precise, instance-level retrieval. Our approach augments a multimodal fusion backbone with a lightweight Context-Aware Attention Modulator (CAAM), enabling a nuanced balance between instance fidelity and compositional reasoning.
Specifically, AdaFocal employs a two-stage reasoning process: Contextual Perception and Adaptive Focus. It first perceives the query's compositional context to predict a modulation scalar (β). This learned signal then drives an Attention Activation Mechanism, which explicitly and adaptively intensifies the visual focus on the user-specified instance region (provided via bounding box) during multimodal feature fusion.
By dynamically re-weighting the attention distribution, AdaFocal seamlessly synthesizes the anchored instance, the global visual scene, and the textual modification into a coherent representation, establishing a robust and flexible baseline for identity-preserving retrieval.
Our extensive evaluation demonstrates that the OACIR task presents a profound challenge to existing models. While current Universal Multimodal Retrieval (UMR) and Composed Image Retrieval (CIR) paradigms struggle with instance-level fidelity, our proposed AdaFocal establishes a robust and effective baseline.
| Domain | Method | Pretraining Data | Fashion | Car | Product | Landmark | Avg. | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RID@1 | R@1 | R@5 | RID@1 | R@1 | R@5 | RID@1 | R@1 | R@5 | RID@1 | R@1 | R@5 | ||||
| UMR | UniIR-CLIPSF | M-BEIR | 17.33 | 12.26 | 24.76 | 32.67 | 16.95 | 41.89 | 33.71 | 18.22 | 40.10 | 29.47 | 15.51 | 43.24 | 27.18 |
| UniIR-BLIPFF | M-BEIR | 28.53 | 22.41 | 39.63 | 37.21 | 19.97 | 46.51 | 37.76 | 20.98 | 43.19 | 31.71 | 17.14 | 52.12 | 33.10 | |
| LamRA-Ret | M-BEIR+NLI | 27.45 | 21.63 | 37.10 | 61.03 | 35.44 | 74.51 | 69.45 | 39.53 | 70.25 | 58.64 | 32.58 | 68.74 | 49.70 | |
| MM-Embed | M-BEIR+MTEB | 41.38 | 34.55 | 52.50 | 53.21 | 30.06 | 62.80 | 71.03 | 41.47 | 71.15 | 78.85 | 38.88 | 79.32 | 54.60 | |
| GME (2B) | UMRB | 38.13 | 32.14 | 51.50 | 58.84 | 31.60 | 66.03 | 76.89 | 44.11 | 74.20 | 73.86 | 38.99 | 75.61 | 55.16 | |
| GME (7B) | 44.98 | 39.24 | 60.18 | 63.11 | 38.34 | 75.38 | 83.44 | 54.60 | 84.15 | 77.11 | 47.09 | 82.69 | 62.53 | ||
| Qwen3-VL-Embedding (2B) | - | 47.95 | 37.74 | 55.16 | 65.98 | 44.76 | 80.76 | 80.20 | 29.01 | 57.94 | 66.71 | 32.11 | 66.51 | 55.40 | |
| Qwen3-VL-Embedding (8B) | 56.21 | 46.48 | 64.39 | 75.77 | 46.63 | 81.62 | 81.60 | 35.69 | 66.25 | 70.01 | 44.18 | 76.76 | 62.13 | ||
| CIR | SPRC (ViT-G) | CIRR | 28.62 | 25.79 | 44.48 | 25.13 | 15.92 | 37.06 | 54.39 | 34.85 | 62.31 | 40.41 | 26.29 | 52.39 | 37.30 |
| SPRC (ViT-G) | OACIRR (Ours) | 65.25 | 58.51 | 80.89 | 72.87 | 49.82 | 89.57 | 86.05 | 70.61 | 93.68 | 76.32 | 56.04 | 89.00 | 74.05 | |
| OACIR | Baseline (ViT-G) | OACIRR (Ours) | 69.07 | 58.76 | 81.44 | 74.59 | 49.78 | 89.46 | 87.48 | 69.53 | 93.66 | 79.80 | 55.49 | 89.87 | 74.91 |
| AdaFocal (ViT-G) | 77.15 | 65.31 | 86.88 | 78.42 | 53.63 | 92.22 | 91.86 | 74.11 | 95.39 | 82.92 | 58.47 | 91.63 | 79.00 | ||
Visual comparisons demonstrating AdaFocal's superior ability to maintain instance-level fidelity while accurately reflecting complex contextual modifications.
@article{yang2026beyond,
title={Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval},
author={Yang, Yuxin and Zhou, Yinan and Chen, Yuxin and Zhang, Ziqi and Ma, Zongyang and Yuan, Chunfeng and Li, Bing and Gao, Jun and Hu, Weiming},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026},
note={arXiv preprint arXiv:2604.05393}
}