Abstract
Bio-inspired spike cameras, offering high temporal resolution spike streams, have brought a new perspective to address common challenges (e.g.,high-speed motion blur) in depth estimation tasks. In this paper, we propose a novel problem setting, spike-based stereo depth estimation, which is the first trail that explores an end-to-end network to learn stereo depth estimation with transformers for spike cameras, named Spike-based Stereo Depth Estimation Transformer (SSDEFormer). We first build a hybrid camera platform and provide a new stereo depth estimation dataset (i.e.,PKU-Spike-Stereo) with spatiotemporal synchronized labels. Then, we propose a novel spike representation to effectively exploit spatiotemporal information from spike streams. Finally, a transformer-based network is designed to generate dense depth maps without a fixed-disparity cost volume. Empirically, it shows that our approach is extremely effective on both synthetic and real-world datasets. The results verify that spike cameras can perform robust depth estimation even in cases where conventional cameras and event cameras fail in fast motion scenarios.