GASE: Gaussian Splatting–Based Automated System for Reconstructing Embodied-Simulation Training Scenarios

Abstract

Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.

Method Overview

Pipeline of our GASE system. Given scene images and camera poses and point cloud derived from them, we leverage the pose mapping relationships together with the tracking capability of SAM2 to localize target objects across all frames based on user-provided textual or click prompts, and separate them from the background. Subsequently, we respectively reconstruct the scene and the objects using 3DGS and TRELLIS.

3D Gaussian Scene Reconstruction

Select a scene and an object to compare videos: left is full scene, right is the object-removed scene.

Scene

Object

Full Scene

Object Removed

Object Generation

Interactive 3D models generated by GASE. Drag to rotate, scroll to zoom.

Kettle

Cup

Bag

Bottle

Box

Coke

BibTeX

@misc{zhang2026gasegaussiansplattingbasedautomated, title={GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments}, author={Jiawei Zhang and Yiming Yan and Chao Liang and Nuo Xu and Seson Sun and Qichen Zhang and Yuhao Xu and Yantai Yang and Yingqiao Wang and Qin Jin and Zhipeng Zhang}, year={2026}, eprint={2606.17520}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2606.17520}, }