Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.
Pipeline of our GASE system. Given scene images and camera poses and point cloud derived from them, we leverage the pose mapping relationships together with the tracking capability of SAM2 to localize target objects across all frames based on user-provided textual or click prompts, and separate them from the background. Subsequently, we respectively reconstruct the scene and the objects using 3DGS and TRELLIS.
Select a scene and an object to compare videos: left is full scene, right is the object-removed scene.
Full Scene
Object Removed
Interactive 3D models generated by GASE. Drag to rotate, scroll to zoom.
Kettle
Cup
Bag
Bottle
Box
Coke
Select a transfer method and a task to watch the demonstration.