Speed3R

Abstract

While recent feed-forward 3D reconstruction models accelerate 3D reconstruction by jointly inferring dense geometry and camera poses in a single pass, their reliance on dense attention imposes a quadratic complexity, creating a computational bottleneck that severely limits inference speed. To resolve this, we introduce Speed3R, an end-to-end trainable model inspired by the core principle of Structure-from-Motion that a sparse set of keypoints is sufficient for robust estimation. Speed3R features a dual-branch attention mechanism where the compression branch creates a coarse contextual prior to guide the selection branch, which performs fine-grained attention only on the most informative image tokens. This strategy mimics the efficiency of traditional keypoint matching, achieving a remarkable 12.4x inference speedup on 1000-view sequences, while introducing a minimal, controlled trade-off in accuracy. Validated on standard benchmarks with both VGGT and \(\Pi^3\) backbones, our method delivers high-quality reconstructions at a fraction of computational cost, paving the way for efficient large-scale scene modeling.

Method

Our model processes a sequence of input images through a shared feature encoder. The resulting tokens are then processed by a series of transformer blocks that alternate between local Frame Attention (within each view) and our proposed Global Sparse Attention (across all views). The GSA module efficiently integrates global information by decomposing attention into a Compression Branch for coarse context and a Selection Branch for fine-grained details, guided by a Top-k selection mechanism. Finally, updated tokens are passed to task-specific heads to predict camera pose and dense depth maps.

Performance Comparison

Qualitative Comparison

BibTeX

@inproceedings{ren2026speed3r,
  title={Speed3R: Sparse Feed-forward 3D Reconstruction Models},
  author={Ren, Weining and Tan, Xiao and Han, Kai},
  booktitle={Arxiv},
  year={2026}
}

Acknowledgements

This work is supported by Hong Kong Research Grant Council - General Research Fund (Grant No. 17213825). Weining Ren is supported by Hong Kong PhD Fellowship Scheme (HKPFS). We are grateful to Hongjun Wang for helpful discussions, and to Zhiheng Wu and Yumeng Zhang for GPU support.

Speed 3R

Sparse Feed-forward 3D Reconstruction Models

CVPR 2026 Findings

Speed3R accelerate VGGT/Pi3 with trainable sparse attention.