From Chaos Comes Order: Ordering Event Representations for Object Recognition and Detection

representations that exceed


Introduction
Event cameras are biologically inspired vision sensors that function in a fundamentally distinct way [12].Unlike traditional cameras that capture images at a fixed rate, these cameras measure brightness changes independently for each pixel, and these changes are referred to as events.The events encode the time, location, and polarity (sign) of the brightness changes.Event cameras offer several advantages over frame-based cameras, including exceptionally high temporal resolution (in the order of µs), a high dynamic range, and low power consumption.Their numerous benefits make them attractive for a wide range of applications like robotics, autonomous vehicles, and virtual reality.However, due to their sparse and asynchronous nature, applying classical computer vision algorithms remains challenging.
Many state-of-the-art deep learning models address this challenge by converting sparse and asynchronous events into dense grid-like representations before processing them with off-the-shelf deep neural networks.By using these networks, methods like this enjoy the advantages of mature learning algorithms and network architectures, and optimized hardware, but we need to make a non-trivial choice of event representation.In fact, the computer vision and robotics fields are witnessing a surge in the number of research papers utilizing event-based vision, resulting in a plethora of new event representations being proposed.Despite this, extensive comparisons of these representations remain rare, making it unclear whether these newer representations should be adopted.
No efficient methodology exists for comparing these representations.Conventionally, comparing event representations involves training a fixed deep-learning model for each event representation separately and subsequently selecting the optimal one based on a validation score.This process is very time-intensive since it requires network training in the loop which often takes hours or days (Fig. 1

a top).
In this work, we propose a fast method to compare event representations which circumvents the need to train a neural network and instead computes the Gromov-Wasserstein Discrepancy (GWD) between the raw events and event representation (Fig. 1 a bottom).This metric effectively measures the distortion that is introduced through converting raw events to representations and thus puts an upper bound on the amount of information that can be accessed by downstream neural networks.We show extensive experimental evidence, that this metric preserves the task-performance ranking across a wide range of input representations for several datasets, neural network backbones and tasks (Fig. 1  b).Due to its low computational cost, we apply the GWD to, for the first time, explicitly optimize over a large family of event representations, which reveals a new and powerful representation, which we term 12-channel Event Representation through Gromov-Wasserstein Optimization (ERGO-12).For the task of object detection, networks trained with these representations outperform other representations by 1.9 mAP on the 1 Mpx dataset and 0.3 mAP on Gen1, even outperforming state-of-the-art methods by 2.1 mAP on Gen1 and state-of-the-art feed-forward methods by 6.0 mAP on the 1 Mpx dataset.On object recognition, we instead find that our representation outperforms state-of-the-art representations by 3.8%.We believe that the GWD is a powerful tool that opens up a new research field that searches for optimized event representations.Our contributions are summarized as follows: • We introduce a novel, efficient approach for comparing dense event representations using the Gromov-Wasserstein Discrepancy (GWD).
• We show extensive experimental evidence that it preserves the task performance ranking of neural networks trained with these representations across datasets, neural network backbones and tasks.
• We use it to, for the first time, conduct a hyperparameter search on a vast family of event representations, unveiling novel and powerful event representations that outperform the current state-of-the-art representations on the object detection and object classification task.

Related Work
In the field of event-based vision, two primary groups of representations exist: sparse and dense.Methods that use sparse representations [35,51,23,37] preserve the sparsity in the events but do not yet scale to more complex tasks due to a lack of specialized hardware and mature neural networks architectures.This frequently results in lower performance on downstream tasks.In contrast, dense representations [31,48,53,50] offer improved performance since they can leverage mature machine learning algorithms and neural network architectures.Sparse representations, pioneered by asynchronous SNNs [35,23,37], are limited by the lack of specialized hardware and computationally efficient backpropagation algorithms.Point cloud encoders [47,40,11] have been used due to the spatio-temporal nature of event data, but can be computationally expensive and noisy.Graph neural networks [26,46,4,3,33,10] are scalable and have achieved high performance on various vision tasks but are still less accurate than dense methods for event-based vision.In this study, we focus on dense event representations and aim to achieve better task performance by utilizing existing efficient learning algorithms that are appropriate for current hardware.
Early dense representations converted events to histograms [31], generated time surfaces [48] or combined both [52] while relying on standard neural network backbones to process them.However, these representations only capture a low-dimensional representation of events since they typically only use a few channels.Later approaches have tried to capture more event information by either computing higher-order moments [1] or stacking multiple time windows of events [53].These methods still stack events based on fixed time windows which is problematic when the event rate becomes too large or too small and lead to the introduction of stacking based on the number of events [50].In parallel, a bio-inspired approach led to the introduction of Time Ordered Recent Event Volumes (TORE) [2], which aggregate events into queues.However, they are slow to compute and perform similarly to existing Voxel Grids [53].
Most recently, a powerful representation was proposed by Nam et al. [34], which divides events into multiple overlapping windows that halve the number of events at each stage, which are more robust during varying scene dynamics.
Few papers study the effect of event representations on task performance.While [38] and [21] show small-scale ablation studies to select event representations, only Gehrig et al. [13] performed a large-scale investigation of event representations by training models on various inputs for multiple tasks.Their study demonstrated the advantages of splitting polarities and incorporating timestamps into representations, and it introduced a learnable representation.However, training for a single task was still computationally expensive, which limited the number of representations that could be compared.For this reason, their study did not cover a large number of representations, and in particular, did not consider different window sizes as is done in later work [50,34], or more advanced aggregations and measurements like in [1].Our method instead introduces an efficient metric to compare event representations that solves these limitations, allowing us to perform a search over a large family of event representations and go beyond the representations in [50,1,34,14], including even non-differentiable hyperparameters.

Method
In this section, we will first introduce the preliminaries on computing event representations (Sec.3.1) and then propose the metric we use to measure the discrepancy between events and their representation based on the GWD (Sec.3.2) before concluding with Sec.3.3 where we use Bayesian optimization to find an optimal event representation.

Preliminaries
Event cameras measure brightness changes as an asynchronous stream of events.Each event is triggered when the intensity L(u) at the pixel u = (x, y) changes by the contrast threshold C at time t, and thus satisfies where p ∈ {−1, 1} is the polarity of the event, and t − ∆t is the time of the last event.Within a time window ∆T , an event camera thus generates an ordered set of events E = {e k } Ne−1 k=0 , with each event e k = (u k , t k , p k ) ∈ R 4 .To bridge the gap between asynchronous events and dense neural networks, they are usually converted to a dense event representation which has features f x .= R(x) ∈ R N f indexed by the integer-valued pixel location x.The above representation thus generates a set of features F = {f x } x∈Ω , where Ω denotes the image domain and has size |Ω| = N f with N f being the number of pixels in the image.In what follows, we will derive a measure to quantify the distortion between events E and features F based on the GWD.

Gromov-Wasserstein Discrepancy
Converting raw events to a representation invariably distorts the events by removing important distinguishing features from the stream.We would like to measure this distortion since we expect it correlates strongly with a neural network's ability to extract features from these events.In what follows, we will derive a measure of this distortion rate based on the GWD.
We show an overview of the GWD in Fig. 2. We start by measuring the similarity between a set of events and their representation by building a soft correspondence be-tween events e i and features f xj , which we denote as T ij1 .This transport plan effectively moves each event to a corresponding feature, thereby distorting the original set and destroying information.Importantly, such a plan can be interpreted as follows: We transport the events with a total weight of 1, i.e. per-event weight 1  Ne to the output features, which also need to receive a total weight of 1 or per-feature weight of 1 N f .By this construction, T ij needs to satisfy i T ij = 1/N f and j T ij = 1/N e .This means that a total weight of 1/N e moves from each event, and each feature receives a total weight of 1/N f .
In the next step, we measure the distortion introduced by this transportation plan by considering pairwise similarities of input events and features.Let e i , e k ∈ E be a pair of events and f xj , f x l ∈ F pair of features, with similarity scores C e ik .= C e (e i , e k ) and between events and features respectively.Next, consider how the transport plan T acts on these pairs: Generally, a weight T ij is moved from event e i to feature f xj .Similarly, the weight T kl is moved from event e k to feature f x l .Ideally, such a transport plan should preserve the similarity between pairs of source events and target features, and thus the difference in similarity scores between pairs (i, k) and (j, l) can be used as a measure of distortion.For each event pair and feature pair, we define the distortion as where L denotes some disparity measurement between C e ik and C f jl .Summing over all possible pairs of events and features we thus arrive at the transportation cost: Minimizing over transport plans, we arrive at the GWD: which can be optimized efficiently using [39].Since the above metric is defined for a single time window of events, we may average it over multiple samples to find: GWD N can be interpreted as an average distortion rate from raw events to event representations.In Sec.4.2, we show that GWD N correlates with a NN's performance with that representation across network backbones, datasets, and event representations.It's also efficient to compute, taking 9 seconds for 50,000 events.In what follows, GWD N will denote the average over N samples, while GWD denotes the average over the whole validation set.Similarity scores and distortion function As similarity scores, we choose Gaussian radial basis functions [45] for both events and image features.In detail, with a bandwidth parameter h = 0.7.The selection of datadependent variances normalizes the distances between pairs of events and features such that the similarity score is robust to the dimensionality of the data and the number of samples in the source and target domain.These details are discussed in [7,45,36].While more complex similarity scores could be used, this simple function already achieved good results.
As the distortion function, we chose the KL-Divergence i.e.
As a result, the optimization already rejects terms for which C e ik ≈ 0, i.e. pairs of events that are far apart.We found that this property was also beneficial in improving the convergence of the optimization problem in Eq. ( 5) Improving Convergence of Eq. (5).We found that three features improved convergence and speed up the optimization: (i) Normalization of the event coordinates and timestamps by the sensor size and time window respectively, (ii) Concatenation of the normalized pixel position to the image features, and (iii) Sparsification of image features.Both (i) and (ii) make the optimization more numerically stable.In fact, without concatenating position information to image features, randomly pixel-wise shuffled event representations would retain the same GWD, although intuitively, neural networks would have a harder time learning from such representations since they typically process nearby features together.Thus reintroducing the position removes this ambiguity and improves the convergence.Finally, (iii) removes image features with ∥f x ∥ = 0 since these correspond to pixels where no events were triggered.This step significantly sped up computation by reducing the size of the pair-wise similarity score matrix C f , with a small impact on convergence.

Optimizing over Event Representations
With a fast method to measure the effectiveness of an event representation, we can now search for the optimal representation by minimizing Eq. ( 6) over a space of possible representations with a set of hyperparameters p.
To simplify this optimization, we first describe a very general parametrization, which defines a large family of event representations, extending the family described in [13] in a few ways.These hyperparameters are a small set of categorical variables which can subsequently be optimized using Bayesian optimization.

Parametrization of Event Representations
We illustrate the hyperparameters in Fig. 3.In general, we assume an event representation comprises a stack of features, indexed by c (right side), each of which is derived from (i) a specific time window w c of events, (ii) a specific measurement m c of events such as polarity or timestamps, and (iii) a specific aggregation a c , such as summation or averaging.We write such a representation as N c concatenated feature maps: with Here w c is a windowing function which selects events within an interval, m c is the measurement function which selects an event feature, and a c is the aggregation function, which aggregates measurements into a single feature map.Note, in our formulation each channel can have an independent set of parameters, different to [13], which assumes a shared aggregation and measurement function for all channels.This makes our parametrization substantially more expressive than the one in [13].The number of (non-learnable) representations is (|A||M||W|) Nc ≈ 3.21 × 10 27 , since each channel can be configured independently.Moreover, while a c and m c were already discussed in [13], the windowing function is a more general concept, illustrated in Fig. 3 (left).While these windows can be non-overlapping (w 3 , w 4 , w 2 ), as for Voxel Grids [53], they can also be overlapping, (w 0 , w 1 , w 2 ) as in Mixed-Density Event Stacks [34], or describe windows of a con-stant event count or constant time [50] In this work we allow each feature channel to select from a basis of windows , which can combine all types of windows unifying these concepts.In summary, a representation is parametrized by: For the sets of aggregation functions we select A = {max, sum, mean, variance} and for the measurement functions M = {t + , t − , t, p, c + , c − , c} which are most commonly used.Here c, p, and t denote event count (discarding polarity), polarity, and timestamp.The subscripts +/− select only positive or negative events.For the basis of time windows, we select three equally spaced, non-overlapping windows from [53] and four overlapping windows from [34], including the global window.These are illustrated in Fig. 7 (right).

Optimization
Procedure The aforementioned parametrization generates redundant combinations that can be obtained by permuting channels or selecting the same feature for different channels.To address this issue and expedite convergence, we propose a stage-wise optimization procedure.Initially, we start with a volume consisting of zeros with N c channels and optimize over a 0 , w 0 , and m 0 to fill in the first channel.Next, we optimize the feature for the second channel while keeping the first fixed.With this iterative process, we incrementally fill up the representation, avoiding the selection of redundant representations and resulting in faster optimization.At each stage, we use Gryffin [18], a specialized Bayesian optimizer for categorical variables.

Experiments
In Sec.4.1, we first connect the GWD and event distortion, demonstrating its behavior on several toy examples.Next, in Sec.4.2, we showcase the correlation between the metric and task performance across multiple tasks, datasets,  [53] and Mixed-Density Event Stacks [34] with an increasing number of channels.(bottom) Effect of applying Gaussian blur with different blur kernels to the event representation.backbones, and representations.Lastly, in Sec.4.3, we show the outcome of Bayesian Optimization on the GWD and offer insights on the acquired event representation before comparing it against the state-of-the-art.

Toy Example
As explained in Sec.3.2, the GWD measures the distortion rate from raw events to event representation.Two experiments were designed to test this claim: (1) analyzing the behavior of the metric when varying the number of bins in Voxel Grid [53] and Mixed-Density Event Stack (MDES) [34] representations, and (2) blurring the event representation with increasing standard deviations before measuring the GWD.We perform experiments on the validation set of Gen1 [9] and report the results in Fig. 4.
Fig. 4 (top) confirms that the GWD decreases as the number of channels increases in both Voxel Grids and Mixed-Density Event Stacks, which aligns with our intuition that using more channels preserves more information from raw events, resulting in a lower distortion rate.Similarly, the bottom part illustrates that with a growing blur radius, more edges in the event representation are removed, leading to an increase in the GWD.This verifies that the GWD measures the distortion from raw events to event representations.

Object Detection
Next, we investigate the relationship between GWD and the task performance of a NN trained for object detection.We choose two widely used object detection datasets, the Gen1 [9] and 1 Mpx [38] Automotive Detection Dataset, which both deal with event streams featuring labeled bounding boxes for pedestrians, cars, and other vehicles.While the former has a resolution of 304×240, the latter has a resolution of 1280 × 720.We report the GWD computed over the validation set of each dataset and then train an off-theshelf object detection framework based on YOLOv6 [24], pre-trained on the Microsoft Common Objects in Context (MS-COCO) [28], on different input representations as in [13].To accommodate the varying number of channels, we replace the 3-channel input convolution with a N c = 12 channel input convolution, where N c represents the number of channels in the representation.To show the generality of the result, we also vary the detection backbone between ResNet-50 [17], EfficientRep [24] and Swin Transformer V2 [30], and report results for each.Representations: We test a range of common representations listed below.We compute 12 channels for each representation, except for the 2D Histogram, which has 2. Voxel Grid: Here, the event time window is split into N c equal, non-overlapping time windows, and then events in the same time window are aggregated by summing their polarity on a per-pixel basis using bilinear voting [53].Mixed-Density Event Stack (MDES): For each channel c, this representation selects the most recent Ne 2 c events, where N e are the events in the time window.For each window, it aggregates the polarity at that pixel [34].Event Histograms: Events in the time window are split by polarity and then summed into two channels [31].Time Surface: We convert events into time surfaces from [22] with a decay constant of τ = 5ms.We then sample it at Nc 2 = 6 equally spaced timestamps, once for positive and negative events resulting in c = 12 channels.TORE: The Time-Ordered Recent Event Volumes store event timestamps in per-pixel queues.We use queues with capacity Nc 2 = 6, one for positive and one for negative events, and concatenate them to N c = 12 channels.Learned representation: Event Spike Tensor (EST) [13] is a learnable event representation that employs a Quantization Layer featuring a trainable kernel.This approach enables the model to effectively transform raw events, optimizing their performance for a given task.Training Details We adopt the training procedure from YOLOv6 [24].For each backbone, representation, and dataset, we train for 100 epochs, using Stochastic Gradient Descent with Nesterov momentum increasing from 0.5 to 0.83 over the first 2 epochs.We use a batch size of 32, and Cosine learning rate schedule, starting at 0.00323 and ending at 0.000387 after 100 epochs.We adopt the classification and box regression losses in [24].
Results: Figs. 5 summarizes the results of the above experiments.For both datasets, Gen1 and 1 Mpx, and all backbones, there is a clear correlation between the GWD and the task performance, i.e. task performance increases as GWD decreases.This conclusion holds for all three network backbones.In particular, MDES with a Swin-V2 detection backbone consistently achieves the highest mAP with 0.43 on Gen1 and 0.39 on 1 Mpx.It also consistently achieves the lowest GWD with 0.38 on Gen1 and 0.40 on 1 Mpx.Utilizing the learnable EST in YOLOv6 with SwinV2 backbone, we achieved a 45.31 mAP score on the Gen1 validation set and a GWD score of 0.3552, positioning EST between MDES and ERGO-12 in Fig. 1b, confirming the expected ranking.We also see that the Swin V2 backbone outperforms other backbones on both datasets for all tested representations.We conclude that while the representation affects task performance, the neural network also has an influence.However, we see that the overall ranking of the representations is preserved.Using Fewer Samples While in the previous section, we reported the GWD over the validation set of the Gen1 and the 1 Mpx datasets, averaging this metric over such a large dataset still incurs a high computational cost and would make such a metric infeasible for optimization.Therefore, here we investigate if we can use smaller sample sizes to speed up the computation of GWD.In Fig. 6, we show the GWD for the representations in Sec.4.2 while varying the number of samples.We see that as the sample number decreases, the mean values of the metric change, but the ranking between representations is still preserved reliably after around 100 samples, and thus we use GWD 100 to optimize over representations.Below this number, the ranking of representations can fluctuate.However, we found that this happens due to a bad convergence of Eq. ( 5).
Timing Results: We time the computation of the GWD over 100 samples of the Gen1 validation set, where each sample comprises 50,000 events.We run our experiment on an AMD EPYC 7702 32-core server-grade CPU with 32 GB RAM and achieve a runtime of 15 minutes.Note that computing the GWD does not require a GPU.By contrast, training the models in Fig. 5 requires 2 days on a single Tesla V100 GPU with 32 GB of memory, making GWD computation 192 times faster.

Optimization of Event Representations
Here we report the results of optimizing the event representation according to the procedure in Sec.3.3.We optimize N c = 12 channels, and at each optimization cycle for channel c, the Bayesian Optimizer selects 100 configurations and then keeps the best-performing configuration.
We show the result of this optimization procedure in Fig. 7.The left shows how the GWD decreases as new channels are added to the representation.We see that after 2 channels, our method outperforms 2D Histograms, after 6 channels we outperform Time Surfaces, after 7 channels we outperform Voxel Grids / TORE, after 9 channels we outperform Mixed-Density Event Stacks and finally, after 12 channels, we achieve a GWD 100 of 0.47.On the right, we show the different windows (columns) and measurement functions (rows) that are selected.We do not show the order since our representation is unique up to a random channel permutation.However, in the appendix, we show which features are selected at each stage.We see that all windows and all measurement functions are selected at least once, showing how our representation tries to diversify as much as possible.Moreover, timestamp-based measurements often show multiple aggregations, which we argue are necessary to replicate their complex continuous signal.
Comparison with State-of-the-Art Here we compare our method against state-of-the-art recurrent and feedforward methods on the Gen1 and 1 Mpx test sets.We summarize the results in Tab. 1. Recurrent methods in-7: Gromov-Wasserstein Discrepancy for 100 samples (left).At each channel, a Bayesian optimizer selects the next best hyperparameter triple.The chosen hyperparameters are broken down by window and measurement function (right).
We compare these methods to our best-performing YOLOv6 detector from Sec. 4.2 with a SwinV2 transformer backbone trained on various input representations.These include the ones analyzed in Sec.4.2, i.e. the 2D Histogram, Time Surface, TORE, Voxel Grid, and MDES, as well as the optimized representation ERGO-12.Note that these methods do not include data augmentation.We also trained our model with Mixup and Mosaic augmentation from [24], and is indicated by an asterisk * in Tab. 1.

Results
On the Gen1 dataset, we see that YOLOv6 with the SWINv2 backbone and ERGO-12 input and data augmentation outperforms all state-of-the-art methods by up to 2.9% mAP by achieving an mAP of 0.504.The runnerup is RVT-B, which uses a recurrent vision transformer.Even without data augmentation, our network with ERGO-12 achieves 0.493, which improves the mAP by 2.1% compared to RVT-B.Compared to the other feed-forward methods based on YOLOv6, ERGO-12 achieves an 0.3 mAP higher mAP, the next best being YOLOv6 with Time Surface [22] with 0.490.Interestingly, as indicated in Fig. 1 (b), the difference in mAP performance between ERGO-12 and Time Surface is 1.8 mAP on the validation set of the Gen1 dataset, which is substantially larger than on the test set.
On 1 Mpx, the best-performing method is ASTMNet with an mAP of 0.483, followed by RVT-B with 0.474.From the feed-forward methods YOLOv6 with ERGO-12 and data augmentation achieves the highest score with 0.406, outperforming runner-up method YOLOv6 with Time Surfaces by 2.3% mAP.Even without augmentation, ERGO-12 achieves a 1.7% higher score than YOLOv6 with Time Surfaces.Compared to state-of-the-art feed-forward methods, ERGO-12 with data augmentation achieves a 6.0% higher score, the runner-up being Events+YOLOv3 with 0.346.On this dataset, recurrent methods are known to perform better since many sequences include stops or slowmotion scenarios.This is challenging for feed-forward methods since they are not able to maintain long-term memory.In general, the improvement of ERGO-12 on 1 Mpx is higher compared to Gen1.

Object Classification
As an additional task, we investigate the relationship between GWD and the object classification task performance using the ResNet-34 [17] backbone classifier, pre-trained on ImageNet [44], while changing the input convolution to be compatible with N c = 12 channel  representations.We use the large-scale neuromorphic variant of the ImageNet dataset [44], captured from an event camera that observes monitor-displayed images from ImageNet.Event sequences were recorded using the 480 × 640 resolution Samsung DVS Gen3 event camera [49].
We report the GWD computed over the validation set of the Gen1 dataset.Subsequently, we train the model on Mini N-ImageNet.We opt for GWD on the Gen1 dataset to show the generalization capability of GWD across diverse datasets.
Representations: Excluding the learned representation, we evaluate using identical representations as in object detection.Each representation is computed over 12 channels, except for the 2D Histogram, which uses two channels.
Training Details: Adopting the methodology from the N-ImageNet study [21], all inputs are resized to a 224 × 224 dimension, optimizing GPU memory and inference duration.Training is initialized from scratch with a learning rate set at 3 • 10 − 4 and spans 100 epochs.The Adam optimizer with a Nesterov momentum of 0.9 and a weight decay of 0.0001 is used alongside a batch size of 64.
Results: Table 2 summarizes the results of the experiments.
There is a clear correlation between the GWD (even on different datasets, in this case, Gen1) and the task performance, i.e. task performance increases as GWD decreases.This conclusion holds for all tested representations on the Mini N-ImageNet dataset.We obtained the following validation set accuracies: 2D Histogram (46.10%),Time Surface (57.58%),TORE (54.64%),Voxel Grid (52.40%),MDES (53.30%),ERGO-12 (61.4%).Their ranking is consistent with the GWD ranking in Fig. 5. Therefore, the GWD can be used also for classification.

Conclusion
State-of-the-art event-based deep learning methods typically need to convert raw events into dense input representations before they can be processed by standard net-works.However, selecting this representation is very expensive since it requires training a separate neural network for each representation and comparing the validation scores.In this work, we circumvent this bottleneck by measuring the quality of event representations with the Gromov Wasserstein Discrepancy (GWD), which is 200 times faster to compute.We validated extensively on multiple tasks, datasets and neural network backbones that the performance of neural networks trained with a representation correlates with its GWD.We then used this metric to, for the first time, optimize over a large family of representations, revealing a new, powerful representation, ERGO-12.With it, we outperform state-of-the-art representations by 1.9 mAP on the 1 Mpx dataset and 0.3 mAP on the Gen1 dataset, two object detection benchmarks.We also exceed existing representation by 3.8% on the tasks of classification.Moreover, we even outperform the state-of-the-art by 2.1 mAP on Gen1 and state-of-the-art feed-forward methods by 6.0 mAP on the 1 Mpx dataset.This work thus opens a new unexplored field of explicit representation optimization that will push the limits of event-based learning methods.

Appendix
Here we add additional qualitative results and proofs to support the work in the main manuscript.We will refer to sections, equations, figures, and tables in the main manuscript with the prefix "M-", while referring to those in the appendix with "A-".We start by providing additional details about ERGO-12 and GWD in Sec.A-7.1 and include two proofs regarding the robustness of Gromov-Wasserstein Discrepancy (GWD) in Sec.A-7.2.Afterward, we provide more results with fewer optimized channels in Sec.A-7.3.Finally, we show the qualitative results of our method on the Gen1 and 1 Mpx datasets in Sec.A-7.4.

Additional Details on ERGO-12 and GWD
ERGO-12 details: We provide more details of our optimized representation in Fig. A-9.As can be seen from the top sub-figure, we show the optimized channels in more detail than in Figure M-7.At each new step, there is a decrease in GWD, which demonstrates that additional channels reduce the distance.We calculated GWD on the Gen1 [9] validation dataset, which contained 100 samples, and plotted the results as dashed horizontal lines for chosen representations.The blue line shows the performance of the optimization process after each channel addition.We can observe that, for example, our optimized representation outperforms the Voxel Grid after seven channels and MDES after nine channels.Furthermore, we found that the optimization process initially selected the time function, which capitalizes on the high temporal resolution of event cameras to minimize GWD.Subsequently, counts and polarity were used.
In the bottom sub-figure of Fig. A-9, we visualize the channels of ERGO-12 (our optimized representation after 12 channels).For visualization, we min-max normalized the channels within the range of 0-255.Each channel emphasizes different parts of the image.For instance, the last channel highlights the left edges of the pedestrian, while the seventh channel emphasizes the right part.Our optimization process enables us to capture as much information as possible at different scales and resolutions (spatial and temporal), which is highly advantageous when training with common object detectors.The optimized representation achieves an mAP of over 50% on the Gen1 dataset, and it represents the first non-recurrent neural network architecture that scores over 40% mAP on the 1 Mpx [38] dataset.Mathematical properties of the GWD: The GWD introduced in [39] and used in this work does not satisfy all axioms of a distance measure and is thus not a metric.It is a generalization of the GW Distance that is specifically designed for spaces where an L2 metric comparison is not suitable, as in this work where we compare raw events and representations.[39] showed that using KL-divergence (Eq.9) with the kernel in Eq. 7 can effectively discard out-liers, which we leverage in our work.Due to this more general formalism, the GW Discrepancy does not satisfy symmetry, or the triangle inequality (due to the KL-Divergence in Eq. 9), but ensures non-negativity, and is 0 only for equal sets.Absolute scalability is also not satisfied (see Eq. 7), but is not a common property of distance measures.

Invariances of the GWD for Events
In this section, we will go over some basic properties of the GWD for events.In particular, we will show that it is invariant to affine feature transformations, concatenation with a constant, and duplication of the features.For clarity, we repeat here the definition of the GWD for events, following Eq.M-5: with similarity matrices for Eqs.M-7 and M-8.
C e ik = e Affine transformation: We expect that if we apply an affine transformation to the event representation, the score should not change since information in the representation remains distinctive.Moreover, we do not want the GWD to be sensitive to the scale of the feature.We see that replacing representation features with changes only the similarity matrix C f jl to We see that the norms and data-dependent variances then transform as We thus see that = C f jl (24) which shows that the similarity matrix does not change.The minimizer of Eq.M-5 thus also does not change, which means the GWD is invariant to this affine transformation.This invariance is only possible through the use of a datadependent variance, and thus highlights its advantage.
Invariances to Concatenation In the case of concatenation, we consider the following transformation: where [.∥.] denotes concatenation, and c x ∈ R C denotes a pixel dependent additional feature.Again, we find that only the similarity matrix C f jl is affected, and in particular, only the norm and variance, which become: We will consider two special cases: c x = c, a constant vector, and c x = f x the same feature.In the first case, the additional terms above become 0, meaning that the norm does not change, and thus the metric stays the same.In the second case, the norm transforms as in the affine case, multiplying the squared norm and variance by 2. For the same reasons as before, the metric also stays the same.Generalizing this result to more general c x remains future work.

Fewer optimized channels
Figure 8 depicts a correlation between the GWD (given on the x-axis, computed on the Gen1 validation dataset with 100 samples) and the task performance (mAP on object detection task).Since the Swin V2 backbone outperforms all other backbones, it is the only backbone shown in the plot, and the 2D Histogram, which is the poorestperforming method, is omitted.The results demonstrate that our optimized representation with nine and seven channels performs better than MDES and Voxel Grid, respectively, which is consistent with the findings in Figure 9. Furthermore, we observe that the results on 1 Mpx correlate with GWD computed on the Gen1 validation dataset with 100 samples, which highlights the generalization capabilities of our approach.
Figure 8: Correlation of the Gromov-Wasserstein Discrepancy with the mAP (higher is better) for object detection on Gen1 [9] (top) and 1 Mpx [38] (bottom) datasets.ERGO-12, ERGO-9, and ERGO-7 represent our optimized representations with twelve, nine, and seven channels.The mAP is reported on the validation set, while the Gromov-Wasserstein Discrepancy is reported on the Gen1 validation dataset with 100 chosen samples.

Qualitative results
We present qualitative object detection results on the 1 Mpx and Gen1 datasets in Figs. 10 and 11, respectively.Our approach exhibits the ability to detect objects that are not present in the ground truth.

Figure 2 :
Figure 2: Overview of the Gromov-Wasserstein Discrepancy (GWD) between raw events and representations.Events E are converted to event representations, i.e. a set of features F at pixel locations x.It is defined as the solution to an optimal transport problem which transports events pairs (e i , e k ) to feature pairs (f xj , f x l ) via transport plan T ij , T kl .If the transport plan preserves the similarities C e ik and C f jm between event and feature pairs, this results in a low GWD.

4 HyperparamsFigure 3 :
Figure 3: Overview of the hyperparameters we use to construct an event representation (right).For each channel c, we select one of several event time windows w c ∈ W (in color, left), measurement functions m c ∈ M (timestamp, polarity, positive timestamps, etc.), and aggregation functions a c ∈ A (max, mean, sum, variance), resulting in 3N c parameters.

Figure 4 :
Figure 4: Validation of our metric on the Gen1 validation set.(top) Our metric for Voxel Grids[53] and Mixed-Density Event Stacks[34] with an increasing number of channels.(bottom) Effect of applying Gaussian blur with different blur kernels to the event representation.

Figure 5 :
Figure 5: Correlation of the Gromov-Wasserstein Discrepancy with the mAP (higher is better) for object detection on the Gen1 [9] (top) and 1 Mpx [38] (bottom) datasets.Note the spliced y-axis on 1 Mpx, due to the high GWD of the 2D Histogram.

Figure 6 :
Figure 6: Gromov-Wasserstein Discrepancy for a varying number of samples from the validation set.
This work was supported by the Swiss National Science Foundation through the National Centre of Competence in Research (NCCR) Robotics (grant number 51NF40 185543), and the European Research Council (ERC) under grant agreement No. 864042 (AGILE-FLIGHT).

Figure 9 :
Figure 9: Visualization of the channels of ERGO-12, min-max normalized in the range 0-255.The channels are ordered in row-major order, and the hyperparameters selected are shown in the top left of each subfigure.

Figure 10 :
Figure10: Qualitative results of our method with ERGO-12 input on the 1 Mpx[38] dataset.(top row) predictions, and (bottom row) ground truth.Note that sometimes our method detects objects that do not appear in the ground truth.

Figure 11 :
Figure 11: Qualitative results of our method with ERGO-12 input on the Gen1 [9] dataset.(top row) predictions, and (bottom row) ground truth.Note that sometimes our method detects objects that do not appear in the ground truth.

Table 2 :
Mini N-ImageNet validation accuracy evaluated on various event representations.