OACIR: Object-Anchored Composed Image Retrieval

What Did We Do?

Traditional Composed Image Retrieval (CIR) enables flexible multimodal search but inherently prioritizes broad semantic matching, often failing to retrieve a user-specified instance across contexts. To bridge this gap, we introduce Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained retrieval task that mandates strict instance-level consistency.

To advance research on this challenging task, we construct OACIRR, the first large-scale, multi-domain benchmark comprising over 160K quadruples. We further propose AdaFocal, an efficient framework featuring a Context-Aware Attention Modulator (CAAM) that dynamically intensifies attention on the anchored instance region.

OACIR Task

Overview of the Object-Anchored Composed Image Retrieval (OACIR) task and our OACIRR dataset.

The paradigm of image retrieval has progressively evolved toward more flexible and user-oriented forms of interaction. While traditional single-modal methods often struggle to express complex user intentions, Composed Image Retrieval (CIR) has emerged as a powerful paradigm to address this limitation. By combining a reference image with modification text, CIR leverages the synergy between visual and textual modalities to retrieve semantically aligned target images.

Despite its flexibility, the fundamental design of CIR prioritizes semantic matching over instance-level fidelity. The reference image in a conventional CIR query often serves as a coarse-grained visual anchor, rendering the retrieval of a specific instance unreliable, particularly in the presence of visually similar distractors. In many practical applications, emphasizing concrete instance fidelity is often more critical than achieving broad semantic alignment.

In this work, we propose Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained image retrieval task that mandates strict instance-level consistency. OACIR extends the conventional compositional query by incorporating an anchored instance. The objective is to retrieve a target image that semantically satisfies the textual modification while strictly preserving the identical anchored instance.

OACIRR Dataset

OACIRR (Object-Anchored Composed Image Retrieval on Real-world images) is the first large-scale, multi-domain benchmark tailored for the OACIR task.

Unlike traditional Composed Image Retrieval (CIR), which inherently prioritizes broad semantic matching, OACIRR mandates strict instance-level fidelity. By anchoring a specific object via a bounding box in the reference image, it requires models to retrieve a target image that semantically satisfies the textual modification while strictly preserving the identical anchored instance.

OACIRR comprises a unified training set of 127K quadruples covering 2,647 instances, along with an extensive evaluation benchmark containing 33.4K queries across 1,238 instances from four diverse domains: Fashion, Car, Product, and Landmark. The benchmark is enriched with over 26.6K curated distractor instances to form challenging galleries.

Collectively, OACIRR encompasses 160K+ quadruples, providing both a high-quality foundational dataset and a rigorous, comprehensive benchmark for the OACIR task.

Slide to view the dataset construction pipeline, instance statistics, and instance examples.

AdaFocal Framework

Overall architecture of our proposed AdaFocal framework.

To address the core challenges of the OACIR task, we propose AdaFocal, an effective framework that dynamically modulates visual attention for precise, instance-level retrieval. Our approach augments a multimodal fusion backbone with a lightweight Context-Aware Attention Modulator (CAAM), enabling a nuanced balance between instance fidelity and compositional reasoning.

Specifically, AdaFocal employs a two-stage reasoning process: Contextual Perception and Adaptive Focus. It first perceives the query's compositional context to predict a modulation scalar (β). This learned signal then drives an Attention Activation Mechanism, which explicitly and adaptively intensifies the visual focus on the user-specified instance region (provided via bounding box) during multimodal feature fusion.

By dynamically re-weighting the attention distribution, AdaFocal seamlessly synthesizes the anchored instance, the global visual scene, and the textual modification into a coherent representation, establishing a robust and flexible baseline for identity-preserving retrieval.

Benchmark Results

Our extensive evaluation demonstrates that the OACIR task presents a profound challenge to existing models. While current Universal Multimodal Retrieval (UMR) and Composed Image Retrieval (CIR) paradigms struggle with instance-level fidelity, our proposed AdaFocal establishes a robust and effective baseline.

Domain	Method	Pretraining Data	Fashion			Car			Product			Landmark			Avg.
Domain	Method	Pretraining Data	R_ID@1	R@1	R@5	R_ID@1	R@1	R@5	R_ID@1	R@1	R@5	R_ID@1	R@1	R@5	Avg.
UMR	UniIR-CLIP_SF	M-BEIR	17.33	12.26	24.76	32.67	16.95	41.89	33.71	18.22	40.10	29.47	15.51	43.24	27.18
	UniIR-BLIP_FF	M-BEIR	28.53	22.41	39.63	37.21	19.97	46.51	37.76	20.98	43.19	31.71	17.14	52.12	33.10
	LamRA-Ret	M-BEIR+NLI	27.45	21.63	37.10	61.03	35.44	74.51	69.45	39.53	70.25	58.64	32.58	68.74	49.70
	MM-Embed	M-BEIR+MTEB	41.38	34.55	52.50	53.21	30.06	62.80	71.03	41.47	71.15	78.85	38.88	79.32	54.60
	GME (2B)	UMRB	38.13	32.14	51.50	58.84	31.60	66.03	76.89	44.11	74.20	73.86	38.99	75.61	55.16
	GME (7B)	UMRB	44.98	39.24	60.18	63.11	38.34	75.38	83.44	54.60	84.15	77.11	47.09	82.69	62.53
	Qwen3-VL-Embedding (2B)	-	47.95	37.74	55.16	65.98	44.76	80.76	80.20	29.01	57.94	66.71	32.11	66.51	55.40
	Qwen3-VL-Embedding (8B)	-	56.21	46.48	64.39	75.77	46.63	81.62	81.60	35.69	66.25	70.01	44.18	76.76	62.13
CIR	SPRC (ViT-G)	CIRR	28.62	25.79	44.48	25.13	15.92	37.06	54.39	34.85	62.31	40.41	26.29	52.39	37.30
CIR	SPRC (ViT-G)	OACIRR (Ours)	65.25	58.51	80.89	72.87	49.82	89.57	86.05	70.61	93.68	76.32	56.04	89.00	74.05
OACIR	Baseline (ViT-G)	OACIRR (Ours)	69.07	58.76	81.44	74.59	49.78	89.46	87.48	69.53	93.66	79.80	55.49	89.87	74.91
OACIR	AdaFocal (ViT-G)	OACIRR (Ours)	77.15	65.31	86.88	78.42	53.63	92.22	91.86	74.11	95.39	82.92	58.47	91.63	79.00

Qualitative Performance

Visual comparisons demonstrating AdaFocal's superior ability to maintain instance-level fidelity while accurately reflecting complex contextual modifications.

Citation

@inproceedings{yang2026beyond,
    title={Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval},
    author={Yang, Yuxin and Zhou, Yinan and Chen, Yuxin and Zhang, Ziqi and Ma, Zongyang and Yuan, Chunfeng and Li, Bing and Gao, Jun and Hu, Weiming},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages={31155--31165},
    year={2026}
}