Neuromorphic Optical Flow and Real-time Implementation with Event Cameras

Optical ﬂow provides information on relative motion that is an important component in many computer vision pipelines. Neural networks provide high accuracy optical ﬂow, yet their complexity is often prohibitive for application at the edge or in robots, where efﬁciency and latency play crucial role. To address this challenge, we build on the latest developments in event-based vision and spiking neural networks. We propose a new network architecture, inspired by Timelens, that improves the state-of-the-art self-supervised optical ﬂow accuracy when operated both in spiking and non-spiking mode. To implement a real-time pipeline with a physical event camera, we propose a methodology for principled model simpliﬁcation based on activity and latency analysis. We demonstrate high speed optical ﬂow prediction with almost two orders of magnitude reduced complexity while maintaining the accuracy, opening the path for real-time deployments.


Introduction
Optical flow is defined as an apparent motion of objects, edges and surfaces in a visual scene registered by the observer.It is caused by the relative motion between the observer and the scene and does not distinguish between actual motion in the visual scene and change in the observer's pose.The applications of optical flow in the field of computer science include motion estimation and video compression [1,7].In machine perception as well as in robotics, optical flow is used for both object detection and tracking [3,14,27], robot navigation and even for control of micro air vehicles [9,17,31].
Deployment of optical flow in real-time robotic scenarios requires low-latency processing and energy efficiency.Existing algorithms usually calculate optical flow at discrete rates based on frames obtained from conventional cameras [13].Neuromorphic dynamic vision sensors (DVS) operate similarly to the eye's retina by providing a continuous stream of events representing brightness changes rather than absolute measurements at fixed time intervals [15].Since optical flow computation relies on regions and time instants where brightness changes, DVS represents a viable alternative for fast optical flow prediction, demonstrated in recent works [6,26].Moreover, the sparsity of the events can be exploited by spiking neural networks (SNN) as opposed to artificial neural networks (ANNs).The advantage of SNNs deployed on neuromorphic hardware is low latency and energy efficiency coming from sparse computations [4].
Recently, researchers have presented an approach to produce sparse optical flow based on event data with SNNs [18].However, there is a disconnect between large-scale architecture modelling and real-time deployments in efficient hardware.Here, we present a novel approach of a Timelens [29]-like network for sparse optical flow predictions.Apart from surpassing the optical flow baseline in terms of the average endpoint error (AEE) [18], we also address the deployment aspect through systematic model reduction and demonstrate real-time operation with a physical DVS camera, as schematically illustrated in Fig. 1.
This paper makes the following contributions: 1. We design an optical flow architecture inspired by the Timelens architecture, enriched with spiking neurons operating with DVS-based inputs.
2. We surpass the state-of-the-art self-supervised optical 2. Related Work

Deep learning of SNNs
In recent years, SNN popularity in machine learning has been increasing owing to research advancements that enabled easy modelling and training in deep learning frameworks [23,30].Beyond the standard Leaky Integrateand-Fire (LIF) model, an even wider variety of neuroinspired spiking models has been explored.In particular, a framework around so-called Spiking Neural Unit (SNU) includes the plain SNU with LIF dynamics and typical axodendritic synapses, as well as its variants that model further biological aspects, such as axo-axonic and axo-somatic synapses in SNUo and SNUa, respectively.These variants demonstrated improvements in large-scale speech recognition models [5].In the context of optical flow, modifications of a LIF implementation were also proposed and called ALIF, XLIF, and PLIF [18].

Architectures for optical flow
Successful training of neural networks relies on a proper loss definition, where historically supervised losses were used [13,16,28].Due to challenges with obtaining a large number of high-quality labels, it is beneficial to reformulate the training in terms of a self-supervised loss [21,32].In [18], the optical flow prediction task was posed as a selfsupervised contrast maximization problem.This training approach can be applied to popular network architectures for optical flow prediction that include EV-FlowNet [32] and FireNet [25].State-of-the-art SNN implementations are based on their adaptation to inputs from event-based cameras and the operation with spiking neurons [18].

Timelens
The Timelens architecture [29] was proposed in the context of event-based video frame interpolation.The design itself has been inspired from the hourglass network with skip connections for frame-based video interpolation -a problem posed initially in [19].A peculiarity of the network architecture is the reduction of the spatial dimensionality in the encoding part using a pooling operator rather than exploiting strided convolutions.Another feature is the bigger kernel sizes for the initial two convolutions compared with the rest of the encoding/decoding blocks.

Network model
We propose an architecture for prediction of optical flow based on SNNs receiving an event stream from DVS. Design choices, such as spatial down-and up-sampling, channel dimensions, kernel sizes and skip connections, are inspired by the Timelens network [29].Our network is reformulated as an SNN by incorporating spiking spatial convolutions featuring stateful neural cells and layer recurrency.An overall architectural diagram is presented in Fig. 2 and the details are described in the following subsections.

Neuron models
First, we implement an SNN using state equations that describe the common neuroscientific LIF model in a form trainable within the realm of deep learning [18,23,30]: where s t is the state -membrane potential voltage of the neuron, W and H are the input and optional recurrent weights, respectively, d is membrane potential decay factor, y t is the output, v th is a firing voltage threshold, and h is the step activation function.The model is trainable with backpropagation-through-time assuming a smooth derivative of arctanspike(x, a) = 1/(1 + a • x 2 ) for h, with a = 10.Trainable parameters include W , H, d and v th .This neuronal model is our main focus and we will quantify architectures using it with the SNN prefix.Secondly, we consider a more advanced biologicallyinspired extension of the basic LIF -the so-called SNUo unit, which models the concept of axo-axonic synapses that enrich neuronal connectivity by modulating the neuronal outputs [5].From implementation perspective, this leads to emission of sparse analog-valued spikes, or graded spikes in the nomenclature of Intel's Loihi 2 implementation [24].The equations of SNUo are [5]: where ỹt is the unmodulated neuron output used for resetting the membrane potential, y t is the modulated output propagating to downstream units, and g is an additional activation function that we set to leaky ReLU with a leak of 0.1.W o , H o and b o are additional trainable parameters.We use the sigmoid function as activation o for output modulation in Eq. 5 to mimic the inhibitory character of the axo-axonic synapses as suggested in [5].We will quantify networks using this approach with the SNUo prefix.Lastly, the benefits of neuromorphic internal dynamics were demonstrated also in the non-spiking mode: by operating with real-values in the so-called soft SNU (sSNU) approach [30].The idea is to replace the step activation function h with sigmoid function in Eq. 2. As sSNU operates by continuously outputting real values, we will benchmark it against non-spiking baselines.We will quantify networks using this unit with the sSNU prefix.

Network structure
Spiking convolutions work similarly to conventional convolutions found in ANNs except for the neural dynamics applied to their outputs.The computed per pixel, per channel outputs of the 2D convolutions serve as input currents (W x t in Eq. 1) for the spiking neural units.Simultaneously, layer-wise recurrency (Hy t−1 in Eq. 1) is an additional feature to capture temporal dependencies that is not always considered in SNN modelling.We explicitly mention whenever we do include this term.
The network structure is illustrated in Fig. 2. Its first stage comprises two spiking 2D convolutions expanding the N in input channels to 32 output channels featuring 7×7 kernels.While the spatial dimension is retained for the spiking convolution by using stride 1 and appropriate zero padding, spatial down-sampling is performed afterwards using 2D average pooling with kernel size 2 × 2. The remaining encoding parts of the network are five similar encoding blocks consisting of two spiking convolutions followed by pooling operators.The kernel sizes are 3 × 3, except for the first encoding block (5 × 5).For each encoding block the number of output channels is doubled while the spatial resolution is halved.For the two spiking convolutions in each encoding block, we consider all combinations of layer-wise recurrency, as marked in Fig. 2.
For decoding, five identical decoding blocks are used.Each consists of 2D bilinear up-sampling by a factor of 2, followed by two spiking convolutions.The number of output channels gets halved with each decoding block and the convolutional kernel sizes are 3 × 3. Skip connections between each encoder/decoder pair of the same resolution provide values which are concatenated channel-wise before the second spiking convolution in each decoding block.
To obtain continuous optical flow values, the final layer is a 1 × 1 convolution with tanh activation.This layer reduces 32 base channels to N out = 2 channels representing the optical flow components u and v that correspond to horizontal and vertical optical flow magnitudes, respectively.

Input coding
The DVS event stream contains events of the form: where x i and y i represent the pixel coordinates, t i the timestamp and p i the ON/OFF polarity of the event.Different encoding strategies have been proposed to process the raw event stream data prior to feeding it into a neural network.
Commonly used input coding techniques are the count encoding [22] and the voxel grid encoding [34], depicted in Fig. 3.The count encoding loses the temporal information of single events within the aggregation window.Events get accumulated per pixel and per polarity for the entire window width.On the other hand, a voxel-based representation discretizes the time span of the aggregation window and uses temporal bi-linear interpolation to populate the bins with events.Polarity is not treated as a separate channel, but negative OFF events (-1) and positive ON events (+1) are summed in a single channel.
For our spiking architectures, we opted for the voxel grid input coding.The number of discrete time bins is an additional hyperparameter.Choosing the number of bins too high yields overly sparse inputs while for a low number of bins the encoding collapses to a count representation with a single channel.In the latter case, positive and negative events can annihilate each other leading to information loss.For our spiking network, performance peaked at six time bins (N in = 6).However, when operating in the nonspiking mode of sSNU, the count encoding with separate ON/OFF channels (N in = 2) performed better, so we use it for sSNU-based networks.To ensure a fair comparison, the aggregation window width is fixed and the set of encoded events is therefore the same for both encoding approaches.

Training setup
All models are trained in a self-supervised fashion on the UZH-FPV Drone Racing Dataset [11], using the approach and configurations from [18].Specifically, contrast maximization loss is applied there to compensate the motion and predict optical flow from the input events.The loss is:  [8] proposed in [32,34] and λ = 0.001 is a balancing constant.Truncated back-propagation through time (TBPTT) is performed after every 10 forward passes.
In the original approach [18], the loss included different spatial resolutions of the optical flow maps.We analogously extended our architecture with 2D convolutions with tanh activation to produce optical flow predictions of different resolutions at each decoding block.These intermediate optical flow maps are then up-sampled to the initial spatial dimension using nearest neighbour interpolation for the loss computation.Simultaneously, they are concatenated to the input channels of the subsequent decoding blocks.
However, in contrast to the prior work, we also considered an architecture with the loss applied only to the last output layer's prediction.This approach is simpler, faster and turned out to be beneficial for our architecture.

Simulation results
The quantitative performance and generalization abilities of the trained models (self-supervised on the UZH-FPV Drone Racing Dataset) are evaluated on the MVSEC dataset [33] following the comparison approach from [18].The predicted sparse optical flow is compared against the ground truth optical flow provided by [32].The ground truth labels are available at timestamps corresponding with conventional camera's frames and quantify the optical flow over one (dt = 1) or four (dt = 4) frames.
The well-established average end point error (AEE) in pixels is used to evaluate the four sequences of the dataset: outdoor day1 (od1), indoor flying1 (if1), indoor flying2 (if2), indoor flying3 (if3).For easier comparability, we introduce a weighted average endpoint error (WAEE) to combine the four metrics into a single scalar value: where the four weights are based on the average AEE of the best-performing spiking architectures of the prior art [18]  rations of the layer recurrency in the convolutional blocks, visualized in Fig. 2. As each block comprises two spiking convolutions, there are four different combinations of recurrent (R) and feed-forward (F) convolutions: R/F, F/R, R/R and F/F.Table 1 reports the results in terms of WAEE and the average percentage of outliers % Outlier .When operating in the spiking mode, having one convolution with layer recurrency per block is favourable.In particular, best performance is achieved with recurrent layers in the first convolution (R/F).On the contrary, in the context of non-spiking mode of sSNU, double layer recurrency (R/R) is beneficial.We use these best configurations for the final models.We also evaluated an implementation of multi-resolution loss, described in the training section.For both settings of dt = 1 and dt = 4, the reported WAEE values in Table 2 demonstrate that using the simpler setup of the loss applied only at the last layer is preferred for our architecture.A possible interpretation of the observed deterioration is that the multi-layer loss function trains the deeper decoders to encode down-sampled optical flow rather than to develop higher-level features.Furthermore, such a formulation is inconsistent with the ultimate task of the network, which is predicting high-resolution optical flow at the last layer rather than outputting the flow predictions at multiple intermediate stages.Imposing a loss only on the last layer, omits this restriction.We use this approach for all our models.
The resulting AEEs, WAEEs and outlier percentages (AEE > 3 pixels) for our Timelens-based architecture with spiking (SNN), analog-valued spiking (SNUo) and nonspiking (sSNU) units are reported in Table 3.Our model is compared with the state-of-the-art spiking and non-spiking architectures trained in the identical self-supervised setting [18].For an extended comparison with EV-FlowNet [32,34] and Hybrid-EV-FlowNet [20] that use different training datasets and setups, see Supplementary Note 2.
For spiking neural networks, our SNN-Timelens surpasses the performance of the LIF-and XLIF-EV-FlowNet by 9.7%, 6.7% with regard to WAEE and lowers the percentage of outliers % Outlier by 30.5%, 24.1% for dt = 1, respectively.As the improvement over the XLIF-EV-FlowNet is 5.6% for dt = 4, the average prediction error is reduced by 6.1%.Table 3 shows that our SNNs are not only better on average, but outperform the comparable state-of-the-art on each MVSEC sequence for dt = 1 and dt = 4.
Operating with analog-valued spikes, SNUo-Timelens achieves a further substantial reduction in WAEE: 18.3% Lastly, the best performing model is the sSNU-Timelens incorporating neuromorphic dynamics into the non-spiking mode of operation.Despite featuring 8.5% less parameters than the best baseline EV-FlowNet (32.9M), our sSNU-TimeLens (30.1M) outperforms it with regard to WAEE for both dt = 1, dt = 4 by 6.4%, 4.1%, respectively.The average improvement over the state-of-the-art for comparable non-spiking models therefore equals 5.5%.

Model reduction for real-time operation
The state-of-the-art models listed in Table 3 involve tens of millions of parameters and are executed on high-end GPUs.To close the gap between large-scale architecture modelling and real-time deployment, model reduction is required.We propose a principled approach for model reduction that includes analysis of the network activity and of the relationships between the number of parameters and inference speed at different stages of the architecture.
We focus our exploration on the SNN-Timelens that could benefit the most from efficient implementation on SNN chips, such as TrueNorth [2], Loihi [10] or Kraken's SNE [12], that support the LIF equations used in the SNU.If support for analog-valued spikes increases, as in Loihi 2, the SNUo-Timelens architecture could become appealing.

Spiking activity analysis
A spiking activity analysis has been conducted to obtain potential information about the importance of different net-work building blocks.For a test sequence of the MVSEC dataset, the fraction of neurons that produced spikes was registered for each network layer: input layer, initial convolutional layers, encoding layers s[i], decoding layers u[i] and the final prediction layer.Fig. 4 shows the spiking activity in SNN-Timelens architecture with 5 encoding/decoding blocks (left) compared with a network reduced to 3 encoding/decoding blocks (right).In the following we will refer to the number of encoding/decoding blocks as the number of stages of the Timelens model.
In general, the fraction of non-zero outputs, which corresponds to the fraction of neurons that spike, is almost constant until time step 210.At this time step, the drone in the DVS recording lifts off and the incoming events actually come from movement rather than static noise.The spiking activity for all layers fluctuates between 0 and 0.5 when optical flow is predicted due to the actual movement.It has to be noted that the activity of the last layer is 1.0 for all times since the final prediction layer does not feature a step but rather a continuous tanh activation function.
For the bigger model comprising 5 stages the fraction of non-zero outputs does not vary at all for the deep encoding s3 -s5 and decoding u1 -u3 layers.However, evaluation of the gradients indicates that the weights get updated during training.The question therefore arises whether these deep layers are crucial for the overall model performance.Reducing the number of stages from 5 to 3 shows indeed almost on par performance, only 2.6 % WAEE drop on MVSEC, while the spiking activity varies for all layers.The smaller model features only 1.75M parameters, which is 14.5 times less than the initial SNN architecture with 25.35M.The constant spiking activity for the layers can be interpreted as a quasi-identity mapping between early encoding and late decoding layers.Thus, dropping these layers tends to have a minor effect on the network capabilities.

Network profiling
In deep CNNs there is no simple linear relationship between the number of parameters and the inference latency.Therefore, we profiled the contributions of the components of the model to assess how the number of stages (encoding/decoding blocks) and the size of convolution impacts the inference frequency in frames per second (fps).Model performance is monitored throughout the process to find a balance between speed and quality of the predicted optical flow.The fps values are calculated from timings of 100 forward passes on 128×128 DVS inputs using Pytorch code executed on a single core of Intel Core i7 2.6GHz CPU.
Reducing channels.Network profiling has revealed that the first convolution and the first encoding block are partic-   ularly costly in terms of computations.On one hand this is due to large spatial input dimension, on the other hand it is influenced by the big convolutional kernels (7×7 and 5×5).
Nevertheless, decreasing the number of output channels effectively reduces the computational costs.Fig. 5 illustrates a trade-off between the number of channels and performance in terms of WAEE and fps for the SNN-Timelens model with 5 stages.Note the non-linear relationship between convolutional channels and network parameters.
Reducing stages.The spiking analysis showed that less than 5 stages, e.g. 3 stages, are sufficient to obtain reasonable optical flow predictions.Table 4 extends the analysis, reporting the WAEE (dt = 1), the number of network parameters and model inference frequency for different number of channels and stages.Comparing the WAEE between 5 and 2 stages, we observe minor performance degradation: 0.84 versus 0.86.The 2-stage model comes with 44.4 times less parameters and increases the evaluation frequency by 93.2%.For further speedup, the number of channels of the 2-stage SNN-Timelens model can be decreased at the cost of degrading performance in terms of WAEE.

Model reduction results
The comparison of our architecture before and after reduction is presented for a set of selected configurations in Fig. 6.While our initial SNN-Timelens featured 5 stages with 32 channels and used 25.35M parameters, our model after reduction features only 2 stages with 32 channels and 0.57M parameters, thus reducing the number of trainable parameters by a factor of 44.4.It involves a trade-off in  terms of WAEE performance that degrades by just 2.6% (0.84 versus 0.86).Remarkably, it still remains better than the prior state-of-the-art large models of LIF-and XLIF-EV-FlowNet (20.4M) with WAEE 0.90 and 0.93, respectively.Note that to match prior art performance (WAEE 0.90), our SNN-Timelens needs only 2 stages with 24 channels (0.32M), featuring 63.75 times less parameters.

Qualitative results
For qualitative performance assessment and validation of the generalization ability of the last proposed network with 2 stages and 24 channels, a complete real-time pipeline was implemented to process the event stream of a DVS128 camera from iniVation AG.Fig. 7 shows the optical flow predictions different hand movements in front of the DVS.While color-coding is used to encode the optical flow, additionally a sparse arrow grid is superimposed to the optical flow for instant intuitive validation of the predictions.Arrow angles and magnitudes represent direction and magnitude of the biggest optical flow within a local 10 × 10 neighborhood of pixels, respectively.
The predicted flow in Fig. 7 looks reasonable and coincides with the expected dislocations caused by the moving hand.Linear motion is correctly captured (left plots) and the model generalizes well to more challenging scenarios such as rotating or approaching hand (right plots).

Conclusion
In this work we proposed a neuromorphic solution for optical flow estimation comprising an event camera com-bined with a Timelens-inspired architecture.We demonstrated SNN, SNUo and sSNU versions of our architecture, operating with different biologically inspired neuron models.By tuning the architectural design, the event encoding, the placement of recurrent connections, and the loss function formulation, we improved the performance in comparison with prior art models on the MVSEC dataset.Our architecture surpassed both SNN and ANN baselines when operating in spiking and real-valued modes, respectively.Remarkably, when operating with analog-valued spikes, it demonstrated performance comparable to the ANN baseline.Furthermore, a principled model reduction approach was proposed to meet realistic real-time hardware constraints.Our SNN-Timelens model reduced to 0.32M parameters achieves WAEE on-par with the state-of-the-art while decreasing the number of parameters by almost two orders of magnitude.Finally, a real-time pipeline was demonstrated with a physical DVS camera.Future work includes deployment of the proposed architecture on a neuromorphic SNN chip to further decrease the latency and increase energy efficiency.

Supplementary Notes for Neuromorphic Optical Flow and Real-time Implementation with Event Cameras
Yannick Schnider 1,2 , Stanisław Woźniak 1 , Mathias Gehrig

Additional comparison
Table 1 includes an extended comparison with additional prior art non-spiking models.In particular, EV-FlowNet PM [3] was trained in comparable setting to ours, but used a photometric loss (PM).The results had been only reported for dt = 1 mode.Furthermore, several prior art architectures were trained in a different setup using directly the MVSEC dataset, as opposed to our architectures that were trained on the UZH-FPV Drone Racing Dataset and evaluated on the MVSEC dataset.Results for models trained directly on MVSEC, delimited by dashed lines, include: • EV-FlowNet PM-MVSEC [4], trained in a self-supervised manner with the photometric loss (PM), • EV-FlowNet CM-MVSEC [5], trained in a self-supervised manner with a contrast maximisation loss (CM), • Hybrid-EV-FlowNet MVSEC [2], trained in a selfsupervised manner with the photometric loss.
Considering the extended comparison with non-spiking ANN prior art models, the EV-FlowNet CM-MVSEC [5] yields the best performance on all MVSEC sequences for dt = 1 with regard to WAEE and percentage of outliers.Its WAEE of 0.67 is 8.2% lower than 0.73 of our sSNU-Timelens.In turn, the Hybrid-EV-FlowNet MVSEC [2] is outperformed by our sSNU-Timelens by 26.0% (0.73 vs. 0.92).
In summary, the EV-FlowNet CM-MVSEC [5] and the Hybrid-EV-FlowNet MVSEC [2] perform best for MVSEC evaluations with dt = 1 and dt = 4, respectively.Remarkably, our sSNU-Timelens is a runner-up in both cases, de 1. Extended evaluation on MVSEC: AEE (the lower, the better ↓), the percentage of outliers %Out.(↓) per sequence, and the overall WAEE(↓) as defined in Eq. 1 as well as the average percentage of outliers %Out.(↓).Best scores are in bold, while runner-ups are underlined.Horizontal lines delimit the spiking and the non-spiking models.Dashed line delimits not directly comparable prior art setups.

Figure 1 .
Figure 1.Optical flow estimation from DVS events: We propose a Timelens-based neural network architecture that in comparison with prior art provides lower error and higher real-time framerates.

Figure 2 .
Figure 2. Spiking architecture inspired from Timelens: in the encoding part, we consider different layer-wise reccurency configurations.

Figure 3 .
Figure 3. Different input event encodings: Count encoding (per polarity, per pixel) and voxel grid encoding via temporal bi-linear interpolation of combined events into N time bins.

Figure 4 .
Figure 4. Spiking activity for an MVSEC test sequence: The fraction of spiking neurons is registered for all layers from input to prediction layer.Left: An architecture with 5 encoding and decoding blocks, Right: a reduced architecture with 3 encoding and decoding blocks.

Figure 6 .
Figure 6.Model reduction results: SNN-Timelens compared with state-of-the-art (SOTA) in our CPU setup.WAEE plotted versus frames per second (fps); circle size indicates model size.For readability, only selected SNN-Timelens fromTable 4 are labeled.

Figure 7 .
Figure 7. Real-time predictions: DVS events aggregated over one aggregation window and the corresponding optical flows from reduced SNN-Timelens (0.32M) applied for different movements of a hand: (a) to the right, (b) to the left, (c) rotation, (d) approaching the camera.

Table 1 .
see Supplementary Note 1 for the values for each dt setting.Effects of layer recurrency placement on WAEE (the lower, the better ↓) and %Outlier(↓) in the encoding blocks.Best scores are in bold, while runner-ups are underlined.

Table 2
. Effects of multi-layer loss function on intermediate upsampled flow predictions for different layer recurrency placement in the encoder.WAEE(↓) and its relative increases with regard to the last layer loss in Table1for SNN-Timelens.

Table 3 .
Evaluation on the MVSEC dataset for comparable models trained on UZH-FPV Drone Racing Dataset: AEE (the lower, the better ↓), the percentage of outliers %Out.(↓) per sequence, and the overall WAEE(↓) as defined in Eq. 8 as well as the average percentage of outliers %Out.(↓).Best scores are in bold, while runner-ups are underlined.Horizontal lines delimit the spiking and the non-spiking models.

Table 4
. Impact of convolutional channels count: WAEE, number of network parameters (in millions [M]), and inference frequency (in frames per second [fps]) of our Timelens-based SNNs for 5, 3 and 2 stages (encoding/decoding blocks).

Table
spite being trained without access to the examples from the MVSEC dataset.