GASE: Gaussian Splatting–Based Automated System for Reconstructing Embodied-Simulation Training Scenarios

Jiawei Zhang1,2,3, Yiming Yan1,3, Chao Liang3,*, Nuo Xu3,
Seson Sun1,3, Qichen Zhang3, Yuhao Xu1, Yantai Yang1, Yingqiao Wang1,
Qin Jin2,†, Zhipeng Zhang1,†
1 AutoLab, SAI, Shanghai Jiao Tong University
    2 AIM3 Lab, School of Information, Renmin University of China
    3 Research Lab, Anyverse Dynamics
*Project Leader      Corresponding Authors

Abstract

Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.

Introduction Video

Method Overview

GASE pipeline overview

Pipeline of our GASE system. Given scene images and camera poses and point cloud derived from them, we leverage the pose mapping relationships together with the tracking capability of SAM2 to localize target objects across all frames based on user-provided textual or click prompts, and separate them from the background. Subsequently, we respectively reconstruct the scene and the objects using 3DGS and TRELLIS.

3D Gaussian Scene Reconstruction

Select a scene and an object to compare videos: left is full scene, right is the object-removed scene.

Scene
Object

Full Scene

Object Removed

Object Generation

Interactive 3D models generated by GASE. Drag to rotate, scroll to zoom.

Kettle

Cup

Bag

Bottle

Box

Coke

Real-Robot Experiments

Select a transfer method and a task to watch the demonstration.

Method
Task

BibTeX

@misc{zhang2026gasegaussiansplattingbasedautomated, title={GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments}, author={Jiawei Zhang and Yiming Yan and Chao Liang and Nuo Xu and Seson Sun and Qichen Zhang and Yuhao Xu and Yantai Yang and Yingqiao Wang and Qin Jin and Zhipeng Zhang}, year={2026}, eprint={2606.17520}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2606.17520}, }