TL;DR: We introduce SpatialMosaic, a large-scale multi-view benchmark across indoor and outdoor scenes. It is designed to evaluate spatial reasoning under partial visibility, occlusion, and low-overlap conditions from fragmented visual observations.
The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling MLLMs to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored.
To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under complex and diverse scenarios, consisting of 1M QA pairs across 6 tasks. Our proposed dataset spans both indoor and outdoor scenes, enabling comprehensive evaluation in diverse real-world scenarios.
In addition, we introduce a new baseline for multi-view settings, SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset effectively enhances spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and challenging QAs.
The recent progress of MLLMs has raised the possibility of endowing them with human-level 3D spatial understanding. However, existing benchmarks largely rely on fully visible scenes or sequential inputs, failing to reflect realistic conditions where observations are sparse and incomplete. In real-world multi-view settings, models must reason from fragmented visual cues across viewpoints, where objects may be partially visible, occluded, or observed under minimal overlap. While humans can integrate such incomplete observations to form a coherent 3D understanding, current MLLMs often struggle under these conditions.
To address this gap, we define three under-explored spatial reasoning constraints that frequently arise in multi-view settings:
Built upon a scalable data generation pipeline, we construct SpatialMosaic, a comprehensive multi-view instruction-tuning dataset containing 2M QA pairs that capture challenging, frequently occuring real-world scenarios.
In addition, we introduce SpatialMosaic-Bench, a large-scale benchmark consisting of 1M QA pairs across 6 tasks, designed to evaluate spatial reasoning under realistic and challenging multi-view scenarios. Unlike prior multi-view spatial datasets which focus exclusively on either indoor or outdoor layouts, our dataset spans both domains, enabling more comprehensive training and evaluation across diverse real-world scenes.
SpatialMosaic is constructed using a scalable multi-view data generation pipeline designed to capture realistic spatial reasoning scenarios under partial visibility, occlusion, and low-overlap conditions. Given multi-view images and 3D point clouds, we first compute occlusion-aware spatial annotations and sample sparse viewpoints to encourage reasoning from fragmented observations. We then filter object instances based on visibility constraints and derive 3D spatial relations using geometric cues. Finally, task-specific templates are used to generate diverse and geometrically grounded QA pairs, resulting in 2M training QA pairs and an additional 1M evaluation QA pairs in SpatialMosaic-Bench, spanning both indoor and outdoor environments.
To quantify occlusion under realistic multi-view conditions, we introduce an occlusion ratio that captures both inter-object obstruction and field-of-view truncation. Specifically, we render per-instance and full-scene depth maps from multi-view images and compare depth values to determine whether each point is visible or occluded. Based on this, we compute the object occlusion ratio, which measures the proportion of points occluded by other objects, and the field-of-view occlusion ratio, which captures truncation caused by image boundaries. These complementary measures provide a unified representation of occlusion, enabling fine-grained control over visibility conditions during data generation and supporting the construction of spatial reasoning tasks under challenging, partially observable scenarios.
@article{lee2025spatialmosaic,
title={SpatialMosaic: A Multiview VLM Dataset for Partial Visibility},
author={Lee, Kanghee and Lee, Injae and Kwak, Minseok and Ryu, Kwonyoung and Hong, Jungi and Park, Jaesik},
journal={arXiv preprint arXiv:2512.23365},
year={2025}
}