Header

UZH-Logo

Maintenance Infos

Combining Events and Frames Using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction


Gehrig, Daniel; Ruegg, Michelle; Gehrig, Mathias; Hidalgo-Carrio, Javier; Scaramuzza, Davide (2022). Combining Events and Frames Using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction. IEEE Robotics and Automation Letters, 6(2):2822-2829.

Abstract

Event cameras are novel vision sensors that report per-pixel brightness changes as a stream of asynchronous “events”. They offer significant advantages compared to standard cameras due to their high temporal resolution, high dynamic range and lack of motion blur. However, events only measure the varying component of the visual signal, which limits their ability to encode scene context. By contrast, standard cameras measure absolute intensity frames, which capture a much richer representation of the scene. Both sensors are thus complementary. However, due to the asynchronous nature of events, combining them with synchronous images remains challenging, especially for learning-based methods. This is because traditional recurrent neural networks (RNNs) are not designed for asynchronous and irregular data from additional sensors. To address this challenge, we introduce Recurrent Asynchronous Multimodal (RAM) networks, which generalize traditional RNNs to handle asynchronous and irregular data from multiple sensors. Inspired by traditional RNNs, RAM networks maintain a hidden state that is updated asynchronously and can be queried at any time to generate a prediction. We apply this novel architecture to monocular depth estimation with events and frames where we show an improvement over state-of-the-art methods by up to 30% in terms of mean absolute depth error. To enable further research on multimodal learning with events, we release EventScape, a new dataset with events, intensity frames, semantic labels, and depth maps recorded in the CARLA simulator.

Abstract

Event cameras are novel vision sensors that report per-pixel brightness changes as a stream of asynchronous “events”. They offer significant advantages compared to standard cameras due to their high temporal resolution, high dynamic range and lack of motion blur. However, events only measure the varying component of the visual signal, which limits their ability to encode scene context. By contrast, standard cameras measure absolute intensity frames, which capture a much richer representation of the scene. Both sensors are thus complementary. However, due to the asynchronous nature of events, combining them with synchronous images remains challenging, especially for learning-based methods. This is because traditional recurrent neural networks (RNNs) are not designed for asynchronous and irregular data from additional sensors. To address this challenge, we introduce Recurrent Asynchronous Multimodal (RAM) networks, which generalize traditional RNNs to handle asynchronous and irregular data from multiple sensors. Inspired by traditional RNNs, RAM networks maintain a hidden state that is updated asynchronously and can be queried at any time to generate a prediction. We apply this novel architecture to monocular depth estimation with events and frames where we show an improvement over state-of-the-art methods by up to 30% in terms of mean absolute depth error. To enable further research on multimodal learning with events, we release EventScape, a new dataset with events, intensity frames, semantic labels, and depth maps recorded in the CARLA simulator.

Statistics

Citations

Dimensions.ai Metrics
44 citations in Web of Science®
61 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

0 downloads since deposited on 17 Feb 2022
10 downloads since 12 months

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:03 Faculty of Economics > Department of Informatics
Dewey Decimal Classification:000 Computer science, knowledge & systems
Scopus Subject Areas:Physical Sciences > Control and Systems Engineering
Physical Sciences > Biomedical Engineering
Physical Sciences > Human-Computer Interaction
Physical Sciences > Mechanical Engineering
Physical Sciences > Computer Vision and Pattern Recognition
Physical Sciences > Computer Science Applications
Physical Sciences > Control and Optimization
Physical Sciences > Artificial Intelligence
Language:English
Date:2022
Deposited On:17 Feb 2022 09:41
Last Modified:26 Feb 2024 02:49
Publisher:Institute of Electrical and Electronics Engineers
ISSN:2377-3766
OA Status:Green
Publisher DOI:https://doi.org/10.1109/LRA.2021.3060707
Other Identification Number:merlin-id:22156
  • Content: Accepted Version