LiteEdge: Lightweight Semantic Edge Detection Network

Scene parsing is a critical component for understanding complex scenes in applications such as autonomous driving. Semantic segmentation networks are typically reported for scene parsing but semantic edge networks have also become of interest because of the sparseness of the segmented maps. This work presents an end-to-end trained lightweight deep semantic edge detection architecture called LiteEdge suitable for edge deployment. By utilizing hierarchical supervision and a new weighted multi-label loss function to balance different edge classes during training, LiteEdge predicts with high accuracy category-wise binary edges. Our LiteEdge network with only ≈ 3M parameters, has a semantic edge prediction accuracy of 52.9% mean maximum F (MF) score on the Cityscapes dataset. This accuracy was evaluated on the network trained to produce a low resolution edge map. The network can be quantized to 6-bit weights and 8-bit activations and shows only a 2% drop in the mean MF score. This quantization leads to a memory footprint savings of 6X for an edge device.


Introduction
Scene parsing and semantic segmentation [2] are fundamental problems in computer vision research. They can provide information about major landmarks in the surrounding environment, as well as objects of interest in the foreground. The resulting segmented output can be utilized for downstream applications, such as navigation [11] and indoor autonomous mobile robotics [3,22]. Meanwhile, the classical edge detection task has been shown beneficial for solving many computer vision tasks such as 3d reconstruction [29], 3d shape recovery [20], medical image processing * Current affiliation: Donders Centre for Cognition, Radboud University, Nijmegen, The Netherlands. This work has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 899287. [24], as well as semantic segmentation [4,6].
Semantic edge detection combines both edge detection and semantic classification associating edge pixels with one or more object categories [13,33]. For every pixel on an image the task solves whether the pixel lies on an edge and which class(es) it belongs to. It is modeled as a multi-label learning problem [33] since one pixel can belong to multiple classes, e.g. boundaries distinguishing vehicles on a road have the semantic edge class 'vehicle' and 'road' at the same time. Moreover, for each class, the output map will mask out all pixels which do not fall on the outline of this class, resulting in a highly sparse binary output. With rich semantic information in addition to the basic edge location information, semantic edges can be directly used or easily extended to solve many tasks such as refined object detection [12], image-based geopositioning [26], panoptic segmentation [17] and vision restoration applications in brainmachine interface research [9,27].
Previous work on semantic edge detection [1,16,33] use heavy backbone networks. Because many of these networks have a large memory footprint and require high computes, they cannot be deployed for real-time performance on an edge device. This work proposes a lightweight, wellperforming end-to-end semantic edge detection network, LiteEdge, which is suitable for deployment on the edge. We implement two ways of keeping the network light but accurate as validated using the Citiscapes dataset. First, we start with the architecture of a state-of-the-art semantic segmentation network [10] as the backbone and further reduce the output image dimensions for lower computes. Second, to improve the accuracy, we incorporate an additional hierarchical supervision architecture to generate sparse edge segmentation maps for each class, as well as edge class weights in the loss computation. We show that our network while maintaining good prediction quality runs at reasonable inference speeds on edge devices. Moreover, by applying quantization aware training (QAT) [19], we are able to compress the model size by 6 times without much loss in accuracy.
The main contributions of this work are as follows: • A novel end-to-end semantic edge detection network architecture named LiteEdge, which gives competitive prediction accuracy on Cityscapes and uses only ≈3 M parameters resulting in a high throughput of 112 frames per sec (FPS) on an Nvidia RTX 2080Ti GPU.
• A hierarchical supervision module using only binary edges that improves the semantic edge accuracy of LiteEdge by 12.0%.
• A new weighted multi-label loss function to address the class pixel imbalance problem. This loss takes into consideration the difference of segmentation pixel counts in different classes across the dataset, allowing for improved semantic edge learning.
• By adding one segmentation branch to LiteEdge, the new network (LiteEdgeSeg) outputs both semantic edge and segmentation results simultaneously while maintaining a similar inference speed to LiteEdge.

Related works
Edge detection Traditional edge detection algorithms such as Canny [5] use convolutional filters, which generate category-free edges. In addition, a wide variety of deep neural networks are driven towards the edge detection area, such as Deep contour [30] and holistically-nested edge detection (HED) [32]. However, these methods produce lowlevel edges unlike semantic edge detection which is related to both geometric edges and semantic understanding.
Semantic edge detection The idea of semantic edge detection first comes up in [25]. In the work of [13], the Semantic Boundaries Dataset (SBD) is introduced and an inverse detector is proposed. The inverse detector can detect category-aware semantic edges because it has information from both bottom-up edge and top-down detector. Many semantic segmentation works can be loosely regarded as the semantic edge detection task, since employing an edge detector with the information from segmentation results can produce semantic edges. For example, the "High-for-Low" approach (HFL) [4] employs VGG to extract binary semantic edges with the features from semantic segmentation networks to obtain category labels. However, these methods are typically not end-to-end and need additional postprocessing.
The CASENet [33] architecture is trained end-to-end using ResNet-101 [14] as the backbone. It combines both low-and high-level features with a designed multi-label loss function to produce semantic edges. In the work of STEAL [1], the authors propose a new thinning layer and loss, which can be added on top of any end-to-end edge detector. The DFF model [16] is the first work to to use an adaptive weight fusion module to dynamically generate location-specific fusion weights that are conditioned on the image content. These location-specific weights are applied to fuse both the high-level and low-level response maps in order to predict the semantic edges with higher accuracy.
A few studies have also combined the task of semantic segmentation with semantic edge detection, such as JSENet [18] which simultaneously predicts the semantic segmentation point mask and the semantic edge point map. However, all the architectures mentioned are relatively heavy, either because of the large backbone model or the addition of multiple modules for increasing accuracy. Previous endto-end trained semantic edge detection models reported inference speeds that are below 10 FPS on a GPU. We believe that LiteEdge is the first model that runs above 10 FPS and reaches more than 100 FPS on a Nvidia RTX 2080Ti GPU.

Network
Problem formulation. The main goal of semantic edge detection is to compute the semantic edge maps for each category. Formally, given an input image x ∈ R H×W ×Ch with C defined semantic categories, the model predicts C edge mapsȲ = (ȳ 1 ,ȳ 2 , ...,ȳ C ). The model is trained on input images and ground truth semantic edge images, each represented as Y = (y 1 , y 2 , ..., y C ). y c andȳ c are binary maps with dimensionality {0, 1} H×W . Pixels in y c orȳ c with value equals to 1 belong to the c-th category and those equal to 0 are not.

Basic architecture
The basic architecture is a modified form of the semantic segmentation network, LiteSeg [10]. The details of LiteSeg are shown in the light green box of Figure 1. It consists of three main parts, the backbone network, the deeper atrous Spatial Pyramid Pooling (DASPP) module and a decoder module. The input and output of the DASPP module are concatenated by using a short connection, and the output of the 3rd block of the encoder is connected to the decoder by using a long connection.
The backbone network. The task accuracy and computational efficiency of this network are highly dependent on the chosen backbone network which can be any type of convolution network, such as VGG16 [31], MobileNet [15,28] and ResNet [14]. For real-time segmentation, MobileNet v2 [28] can provide a good trade-off between accuracy and computational efficiency in this architecture.
The DASPP module [10] ( Figure 2) is based on the Atrous Spatial Pyramid Pooling (ASPP) module in DeepLabV3 [7]. It comprises a set of convolution blocks with increasing dilation rates, which helps the network to capture object features as well as useful image context at multiple scales. Compared with the ASPP module, there is an additional standard 3 x 3 separable convolution after the 3 x 3 atrous separable convolution to refine the features in the DASPP module, and a short residual connection to fuse the features from the input and output of the DASPP module. The number of filters for the convolution layers in ASPP is reduced from 255 to 96 to further improve the efficiency.
The decoder module is modified from Deeplabv3+ [7], it is a simple architecture that only contains four convolutions blocks, one upsample step and a concatenation step. As shown in Figure 1, the concatenation step (long residual connection) combines the information from the 3rd block in the backbone network (MobileNet v2) and the feature map after the upsample step to further merge low-level and high-level features. Fusion of low-level features which often include edges or color blobs from bottom layers and high-level features which often capture semantic information from top layers is helpful for semantic edge prediction.

LiteEdge architecture
The LiteEdge model shown in Figure 1 uses the basic architecture described in Sec. 3.1 as the backbone network. It incorporates modified versions of the feature extraction and the hierarchical supervision modules of JSENet [18]. These modules are added to the side outputs of the backbone network. We also include a new fuse classification module on the output of the basic architecture.
Hierarchical supervision We use hierarchical supervision to regularize the side feature extraction during learning of the binary edges. This supervision is useful because firstly, the context information learned in the bottom layers plays a vital role in semantic classification. They help augment top classifications, therefore, merging the information from side outputs can improve the MF score of edge prediction. Secondly, the receptive field of the deeper layers is limited, and the network can lose the detailed pixel-wise information at this stage. Thus, it is beneficial to give a supervision signal about semantic edges in the early stage of the network. Unlike JSENet [18] where the first three side outputs are supervised by binary edges and the last two side outputs are supervised by the semantic segmentation map, our LiteEdge model uses side feature extraction modules that are all supervised by binary edge labels.
Side feature extraction module Motivated by JSENet [18], the features from bottom layers help to improve the accuracy of classification and segmentation task but need to be processed to incorporate with the features from the main backbone. The architecture of the side output feature extraction module is shown in Figure 3a, these modules are different from those in JSENet, they include an additional 3x3 convolution block and the deconvolution is replaced by the upsampling block.
Fuse classification module In order to fuse the feature map from side feature extraction and the backbone branch, we add a fuse classification module to the decoder of LiteEdge. This module (see Figure 3b) has a shared concatenation layer [33] and two convolution blocks. The shared concatenation layer fuses the feature maps from side outputs and the features from the main backbone. The last layer of the fuse classification module is a 1x1 group convolution.

LiteEdgeSeg architecture
LiteEdge can be extended to a new model, LiteEdge-Seg, which has two branches with a shared encoder. The first branch predicts the semantic segmentation map, and the second branch outputs the semantic edge maps. The structure is shown in Figure 1. Besides the fuse classification module, the decoder for the semantic segmentation branch has the same structure as the decoder for the semantic edge detection branch. Compared to LiteEdge, LiteEdgeSeg has only an additional 6% parameters.

Weighted multi-label loss function
Inspired by the class weighting scheme in [23], we propose a multi-label semantic edge loss term ( SE ).
First, we define w pos as the ratio of non-edge pixels to the edge pixels. The value of w pos is calculated per image. Second, we introduce w cls as the weighted class frequency for each class. Algorithm 1 shows how to calculate w cls . The proposed loss term SE is a modified version of the multilabel loss in [33]. Our proposed loss integrates the class weights w cls ∈ R C into the loss to overcome the class pixel count imbalance problem. Given w cls as the class weights,

Experiments
We compare our LiteEdge model with other state-ofthe-art semantic edge detection models including CASENet [33] and DFF [16], on the Citiscapes dataset. LiteEdge was trained with a 512x1024 input size instead of the original input size because of the lengthy simulation time. The corresponding output size is 128x256 to reduce further needed computes. Both CASENet and DFF were trained on a 1024x2048 input image size. The resulting output of 1024x2048 was downsampled by 4x on both axes for

Dataset
We use the Cityscapes [8] dataset which comprises complex and diverse stereo video sequences recorded from 50 different cities in Europe. There are 5000 images that have high-quality pixel-level annotations and 20000 images only have coarse annotations. Out of the 5000 images, 2975 are in the training set, 500 in the validation set and 1525 in the test set. For evaluation, we use the validation dataset.

Evaluation protocol
The maximum F (MF) measure at the optimal dataset scale (ODS) [13] is the metric used for the evaluation of semantic edges. For each point on the precision-recall curve, the F-score is calculated as 2·Precision·Recall Precision+Recall , and the MF is the maximum value of these F-scores. The ODS metric uses a fixed threshold value that gives the maximum F-score on the validation dataset. MF is computed for each class, and the mean MF is the average value of the MFs across all classes. The matching pixel distance tolerance is the maximum margin allowed for correct matches of edges to ground truth during evaluation. The distance tolerance is often measured as the proportion to the length of the image diagonal, which we choose as 0.0035 in the experiments as in [16]. 1 https://github.com/anirudh-chakravarthy/CASENet 2 https://github.com/Lavender105/DFF

Training setup
We used Stochastic Gradient Descent (SGD) with Nesterov, a momentum value of 0.9 and weight decay of 0.0005. We train the network for 100 epochs using a batch size of 4. The multiple learning rate policy is used. The initial learning rate, lr 1 , is set to 0.01, the learning rate changes every 5 epochs. The current epoch's learning rate is computed by  Table 1 compares the mean MF score, network parameter size, and runtime metrics, i.e., floating point operations (FLOP), frames per sec (FPS), of our LiteEdge model with current state-of-the-art models.

Results
For LiteSeg 3 [10] and Canny, we took the evaluation model with input of 512x1024, performed Canny on the output, before downsampling the output to 256x512. Our LiteEdge model uses at least 10X fewer parameters than DFF and CASENet. The inference time of our LiteEdge model is 22X faster than CASENet for the same input resolution. The mean MF score of LiteEdge falls behind the heavy backbone CASENet by around 5% while enabling the capability of deployment on edge devices. Directly adding a Canny edge detector on a semantic segmentation network will dramatically decrease the processing speed. Note that Canny edge detection runs on the CPU. Figure 4 shows the MF scores of LiteEdge on different classes. In many small object classes where it is difficult to extract intact and clean edges, LiteEdge has the best MF (ODS) score. The comparison to other models is further described in Section 4.5.  LiteEdgeSeg. By adding one extra branch, we create a new model which can produce the semantic edges and the semantic segmentation map simultaneously. The pixel accuracy (PA), mean intersection of union (MIoU) and frequency weighted intersection of union (FWIoU) are three common evaluation metrics of semantic segmentation task. PA reports the percentage of pixels in an image that are correctly classified; IoU measures the percentage overlap between the target mask and the prediction mask, and MIoU is the average value across all classes; FWIoU multiply the IoU with the frequency for each class and sum up overall classes. The PA, MIoU and FWIoU of LiteEdge-Seg's semantic segmentation results are 93.80%, 66.94% and 88.83%, which are close to the performance of the Lite-Seg model. The mean MF of semantic edge evaluation for LiteEdgeSeg is 53.0%, which is as good as the LiteEdge model's performance (52.9%).

Ablation study
We performed an ablation study to determine the importance of the class weight and hierarchical supervision modules. The results are presented in Table 2. By using class weight labels (following the calculation method in Section 3.4) to train the network, we encourage the network to better predict edges on rare or small classes. Figure 4 gives the MF score of each class. The scores on small objects (such as traffic light, person, rider and bike)  Table 2: Ablation study of LiteEdge model. and the classes that are rare in the training set (such as truck and bus) are improved, but influences the prediction on big objects (such as road and building). The overall MF score increases around 6.4% with the use of class weighted labels.
To determine the need for the hierarchical supervision on bottom side outputs, we directly concatenate the side outputs with the decoder without supervising them to learn binary edges. Comparing the results in the first two rows of Table 2, we see that the hierarchical supervision improves the mean MF score by 11%.

Model compression study
Quantization has become a significant method for optimizing deep-learning models so that they can accelerate inference when deployed on embedded systems with restricted memory footprint and computing resource. In this work, we use the Neural Network Intelligence toolkit 4 . To increase the network accuracy, we use the quantization aware training method (QAT) [19], where we started with the trained model and further refined the model with quantized parameters for 100 epochs.
The results in Table 3 show that the quantized model with 8-bit weights and activations only has a small drop of  2.2% in the mean MF score. The model size of the quantized network is around 4X smaller than the full precision network. In addition, QAT brings weight sparsity, where 2.65% of the parameters are zero after the quantization. The results for the 6-bit and 4-bit weight quantized models are also presented in Table 3. Compared with the 8-bit weight quantized model, the 6-bit weight quantized model shows a drop of 0.2% in the mean MF score but achieves around 4 times weight sparsity. The 4-bit weight quantized network shows a larger drop in the mean MF score but has more than 38% zero parameters. In addition, its model size is 13X smaller than the full precision model.

Real-time edge performance
The proposed models are further deployed on an Nvidia Jetson Nano 5 to evaluate their runtime performance on an edge device. This device comes with a 128-core integrated Nvidia Maxwell GPU and a quad-core 64-bit ARM CPU. It has 5W and 10W power modes. TensorRT 6 7.1.3 is used to generate optimized FP16 run-time engines for the models. The 10W power mode is activated before inference. The results are averaged over 100 runs. Table 4 shows the frame rates for the full precision models deployed on the Jetson Nano. Both LiteEdge and LiteEdgeSeg can achieve a frame rate of 15-18 FPS when using an image resolution of 256 × 512. By comparison, the inference frame rate of CASENet is 10X lower.

Conclusion
We present LiteEdge, an end-to-end lightweight semantic edge detection model suitable for edge deployment. It achieves a mean MF score of 52.9% on the Cityscapes validation set with a reduced input and output size to address the accuracy versus compute tradeoff. The model gives 22X and 10X higher frame rate compared to previous models on a desktop GPU and an edge device respectively. By adding  the hierarchical supervision module and a new multi-class weight label loss, we could increase the mean MF of this network which has a lower output resolution. By adding one additional semantic segmentation branch, we extend LiteEdge to LiteEdgeSeg which outputs both the semantic edge and semantic segmentation maps. The 6-bit weight quantized LiteEdge model shows only a small drop of 2% in mean MF score and has a memory footprint savings of 6X. The added modules to LiteSeg [10] can also be applied towards other segmentation networks. Preliminary results show that when they are added to a recent reported segmentation network (FSFNet [21]), the mean MF score of the edge prediction increases by 1.2% compared to using the Canny edge detector on segmented maps.