Open-Set Face Recognition with Maximal Entropy and Objectosphere Loss

Open-set face recognition characterizes a scenario where unknown individuals, unseen during the training and enrollment stages, appear on operation time. This work concentrates on watchlists, an open-set task that is expected to operate at a low False Positive Identification Rate and generally includes only a few enrollment samples per identity. We introduce a compact adapter network that benefits from additional negative face images when combined with distinct cost functions, such as Objectosphere Loss (OS) and the proposed Maximal Entropy Loss (MEL). MEL modifies the traditional Cross-Entropy loss in favor of increasing the entropy for negative samples and attaches a penalty to known target classes in pursuance of gallery specialization. The proposed approach adopts pre-trained deep neural networks (DNNs) for face recognition as feature extractors. Then, the adapter network takes deep feature representations and acts as a substitute for the output layer of the pre-trained DNN in exchange for an agile domain adaptation. Promising results have been achieved following open-set protocols for three different datasets: LFW, IJB-C, and UCCS as well as state-of-the-art performance when supplementary negative data is properly selected to fine-tune the adapter network.


Introduction
In open-set face recognition, there is no guarantee that a person caught on camera has been previously enrolled in the gallery of known individuals.Within the open-set task, there are watchlists, a scenario that must operate at a very low False Positive Identification Rate (FPIR) as a foresight that the majority of queried individuals are not expected to be registered in the gallery set.When a detected face is mistakenly assigned to one of the identities, it raises a false alarm (a false positive identification) that usually triggers human actions and, therefore, must be avoided to decrease both operational cost and personal discomfort of innocent citizens [1,2].Additionally, subjects of interest may be either missed by a face detector or erroneously classified as unknown individuals or assigned a different identity.
Face biometric systems using deep convolutional neural networks have matured into an age of ubiquitous deployment and high performance in recent years.However, most researchers have left open-set problems aside and channeled their efforts into closed-set identification and verification applications.Recently, an outstanding vendor of face recognition technology suffered considerable criticism for matching USA congress members to mugshots of criminals [3].The incident became an eye-opener on the risks of such commercial identification systems as false alarms can substantially bias security personnel while increasing the responsibility for officers to thoroughly verify the results of the surveillance system.After all, no one would be contented with innocent people being held up by lawenforcement agencies due to a biometric system error.
Neural networks are biased toward the data they have been trained on and rarely work well with unknown classes.Fig. 1, adapted from Dhamija et al. [5], illustrates such behavior on a handwritten digit and character recognition task.Charts (a) and (b) demonstrate that unknown samples (gray dots) cover most of the known classes when the cross-entropy loss is employed, which proved to be insufficient for open-set problems.Contrarily, adopting a cost function that duly handles negative samples attains better class separation and achieves superior performance.
Although the illustration 1(c) may hold true for elementary problems holding abundant samples and very few classes, it is not guaranteed that such behavior would propagate to more demanding biometric applications [2,6].In favor of investigating neglected real-world face problems, this study evaluates how open-set loss functions assist neural networks when the training data consists of a few instances per identity.We propose Maximal Entropy Loss (MEL), a function that adds a penalty margin to known identities and increases the entropy for negative samples as it guides a network into differentiating unknown from known subjects.We also implement with an adapter network that is quickly trained by inputting deep features obtained with leading face architectures and avoid retraining deep backbones every time the gallery set is updated.
This work discloses how a compact adaptation net- additional negative data and evaluated with unknown samples.LeNet++ network [4] topologies are trained on 10 MNIST classes (knowns, colored dots) and evaluated with EMNIST letters (negatives, black) as well as Devanagari letters (unknowns, gray).
work, equipped with few fully-connected layers, responds to open-set protocols on three different datasets, namely LFW, IJB-C and UCCS [7][8][9].We exploit data that do not require domain adaptation to perform gallery specialization, in which the knowledge obtained from networks pre-trained on large face datasets is reused to boost performance on related face recognition tasks.We evaluate three architectures for feature extraction, including AFFFE [10], a deep-feature extractor adapted to handle misaligned and blurry faces; VGGFace2 SEnet50 [11], a backbone that takes advantage of its squeeze-and-excitation blocks; and ArcFace [12], a ResNet-101 network that applies a special loss for producing better-suited face representations.The proposed approach actually differs from most investigations available in the literature.

Related Work
Most modern face recognition systems rely on deep convolutional neural networks (DNNs) [12][13][14][15][16][17][18][19].Strategies have been designed to achieve better identification performance on difficult images, such as margin-based or triplet loss, and different network topologies [11,20,21].However, DNNs are not usually designed to handle facial images with a low optical resolution, or even false-positive face detections.Besides, the aforementioned works do not "disregard" low-interest samples and, as a result, end up matching all unknown identities with their respective most similar subjects from the gallery set.
Vareto et al. [22] combined hashing functions to set up a vote-list histogram.Some researchers have adopted onevs-all SVM or PLS models [23,24] whereas others explored clustering techniques [25,26].The aforementioned methods neither implement the entire closed-set identification pipeline nor comply with the requirements of real-time or real-world applications.Most present-day methods aim at improving closed-set recognition or person re-identification problems and rarely consider open-set protocols [15][16][17].Others typically focus on open-set recognition by providing better feature embeddings for face verification, which comprises a different biometric task [12,18,19].Moreover, Hassen et al. [27] introduced a loss function that draws same-class samples near and Zhou et al. [28] introduced an additional layer to store class-specific thresholds.Researchers have also explored adversarially-generated samples for "balanced" decision boundaries among known and unknown classes [29][30][31].However, these approaches have been evaluated on datasets holding numerous samples per class and, as a consequence, they are not an accurate portrayal of real-world biometric problems.
Most used datasets in non-face open-set recognition are CIFAR [32], MNIST [33], SVHN [34] and TinyImageNet, a subset of ImageNet [35], to name a few.They range from 5 to 20 classes in the known set, but each class encompasses myriads of samples.Approaches evaluated on such data are not hampered by the shortage of image samples available for training and, in fact, better preserve the inherent data distribution [36].Labeled Faces in the Wild (LFW) [7] used to be the leading facial benchmark.LFW contains 13,233 images unevenly distributed among almost six thousand classes.As it was initially designed for verification, experts have proposed non-official open-set protocols [37,38].IJB-C [8] contains two disjoint gallery partitions of known individuals merged together for closed-set recognition.The open-set protocol requires the use of a single gallery partition and, hence, half of the probe subjects have no corresponding match in the gallery set.
In contrast to IARPA's benchmarks, the original Un-Controlled College Students (UCCS) dataset [39] and its extended version [9] mandate for faces to be detected as part of the recognition pipeline.The UCCS dataset consists of images captured at a university campus covering different weather conditions.UCCS's gallery set encompasses 1,085 known subjects, with approximately 20 instances per class, and countless face samples not labeled to any of the known identities.There are several partiallyoccluded faces due to lamp posts and tree branches along with accessories like sunglasses, hats, hoodies, or fur jackets that make both detection and recognition in UCCS benchmark a challenging task.
In summary, few works have designed methods to properly tackle open-set face recognition with mechanisms that enable the network to differentiate individuals of interest from unknown people in a scenario with thousands of identities but few samples per class.With that in mind, we evaluate our proposed approach on realistic face datasets as a meaningful contribution to the biometric discipline.Due to these fundamental properties and their intrinsic open-set nature, we use both IJB-C and UCCS datasets along with LFW in our experiments.

Proposed Approach
A watchlist application S generally consists of three sequential stages: S = S d → S r → S c and should raise an alarm only when probe samples belong to gallery set G. Subsystem S d corresponds to the face detection and landmark localization method locates faces in the original input image.For every detected face, the representation module S r extracts a corresponding numerical feature vector.The identification subsystem S c assigns one of the gallery identities g ∈ G to the probe face sample.As shown in Fig. 2, we introduce an additional adaptation module S a that takes original features from the representation stage and further transforms them into attributes that are better suited for the task at hand.
Template T g = S a (S r (S d (x g ))) corresponds to the mean representation of subject g when multiples sample are available per class.Similarly, F p = S a (S r (S d (x p ))) becomes the probe representation.The classification subsystem S c computes a similarity score s(T g , F p ) between F p and template T g for each known individual g ∈ G.Then, S c rejects probe samples as unknown when they attain scores lower than θ for every subject of interest.If not, F p is assigned to the identity holding highest score max g∈G s(T g , F p ).
The compact adapter network S a aims to establish a drastic difference between gallery subjects and unknown faces.Therefore, it is not possible to enroll new subjects in the gallery set without retraining.Since we rely on features extracted from representational network S r , retraining the adapter network S a is fast and can be performed whenever a new subject needs to be enrolled -given that watchlists are oftentimes relatively stable over time.

Training
One of the fundamental aspects behind the procedure depicted on Fig. 2 is that any pre-trained network, such as VGGFace2 [11], AFFFE [10] and ArcFace [12] can be adopted as the pipeline's face representation subsystem S r .Consequently, the proposed approach does not require time-consuming retraining of massive deep networks every time a new subject is inserted into gallery G since a small adapter network S a fits the extracted set of representations Adapter Network.The adapter network S a consists of a multi-layer perceptron network with fully-connected layers.In fact, S a is composed of an input layer L i , two hidden layers L h1 and L h2 , and an output layer L o .The input layer takes in feature vectors R extracted with a pre-trained DNN S r and, therefore, its size varies according to the deep feature dimension.The first hidden layer L h1 incorporates a non-linear hyperbolic tangent activation function that outputs values in the range −1.0 to +1.0 whereas L h2 delivers a compact feature representation.
The learning strategy is similar to the training process followed by traditional face recognition systems: we set the output layer L o to hold a size analogous to the number of gallery-enrolled identities.In other words, each last layer's logit node L o g ∈ L o , also denoted as l g ∈ L o , stands for the corresponding activation of known subject g ∈ G.In general, these activations are employed for open-set face classification but they present inferior performance when compared to the distance computation of deep features obtained with neural networks [2].

Similarity (S c )
Pretrained Representation Network (S r )

Training Images
Probe Images The adapter network has been originally designed as a conventional multi-layer perceptron network.Ordinarily, its output logit layer L o could be associated with the Softmax activation function A S and assumes a role as the ultimate recognition phase: However, it performs differently considering that S a also provides discriminative feature representations that are required in the subsequent similarity classification subsystem S c (see the blue rectangle in Fig. 2).The adapter network yields its two last layers during the training stage: logits from L h2 input Objectosphere while L o values feed the remaining loss functions as detailed below.
Entropic Open-Set Loss (J E ) [5].The Entropic Open-set loss comes to maximize the uncertainty of negative samples by inducing their Softmax responses to lie uniformly distributed.J E boosts the maximum entropy distribution of uniform probabilities from negative samples over all |G| known classes registered in the gallery set G. In the classic Cross-Entropy loss, t g represents a one-hot vector holding the value of one at the index that corresponds to known class g.Under the inclusion of negative instances, J E attributes uniform values to target vector ∀g : |G| in such a way that unseen samples are considered as equal members of each known identity: Maximal Entropy Loss (J M ).The proposed Maximal Entropy loss associates the previously stated Entropic Openset loss with margin-based Softmax (A Sm ) [40,41].Equation (3) points out how A Sm affixes a non-negative penalty margin m to A S in order to decrease the intra-class distance and maximize the segregation among distinct classes.
As the penalty increases, a network learns parameters that push samples more firmly toward their class centroids.The parameter defines a distance among different classes and, consequently, draws same-class samples closer [40].
The Maximal Entropy Loss J M combines the best of both worlds since the Soft-Margin Softmax targets known training samples whereas the Entropic Open-set handles negative instances available during the learning stage.More precisely, function J M maximizes the entropy regarding the correct target class when x ∈ G in the interest of making the closed-set identification more rigorous and, as a result, equips the adapter network with more discriminative weights.The handicap parameter m establishes a decision boundary for a more appropriate separation of known individuals: For a negative sample x / ∈ G, the designed loss uniformly distributes the target variable score among all g ∈ G subjects in an attempt to support the network in distinguishing gallery-enrolled subjects from unknown identities.Similar to the aforementioned J E loss, the insight of equalizing logit values for unknown samples lies behind not knowing anything about their corresponding identity and, therefore, they hold an equivalent likelihood of being assigned to any subject registered in the gallery set.Analogous to Dhamija et al. [5], the overall error obtained with J M is minimized when the Softmax responses A S (•) of negative samples are equally distributed.
Objectosphere Loss (J O ) [5].Objectosphere dissociates representations of known and negative samples by directly modifying their feature magnitudes.Since J E cannot guarantee that such a pattern would be generated for unknown samples, Fig. 1(c) illustrates that Objectosphere modifies the network weights to drive negative instances toward the feature space origin.This is achieved by forcing the magnitude of negative features ||L h2 (R x )|| 2 to be closer to zero while simultaneously pushing known feature magnitudes to at least ξ, a required hyperparameter for Objectosphere.
Larger ξ values scale up deep features, including those extracted from unknown samples, which can be compensated by lower weights in the last layer L o ; however, what actually makes a difference is the increased separation among known, negative and, ultimately, unknown samples.
Additional Garbage Class (J G ).With the high demand for open-set recognition systems and the practicability of the Cross-Entropy loss, accessible in every deep learning framework, a common strategy is to add an extra class |G| + 1 to encompass negative samples.We refer to the adapter network S a trained with J G as the Garbage approach in the experimentation section.

Enrollment and Inference
The enrollment of subjects of interest is illustrated in Fig. 2. It starts with the extraction of compact features from all gallery samples in the interest of creating a gallery of templates T .Equation (7) demonstrates that for each known identity g ∈ G, a unique template T g is established by averaging the normalized compact features obtained with the adapter network where |K g | is the number of enrollment samples available for subject g.
Analogous feature vectors are obtained for probe images x p ∈ P during the inference stage by employing the very same representational and adaptation networks utilized in the enrollment phase: Then, the classification module S c computes similarity scores between probes and all gallery-enrolled identities through the angular cosine similarity: It is worth mentioning that we have also investigated other similarity-based functions that make use of probe feature magnitudes [2,5]; however, they include several issues that have not been addressed in this work.

Experiments
This section presents the experimental evaluation of the approaches described in Section 3. It starts detailing the adopted evaluation metrics, assessed methods, and a description of the experimental setup along with the explored datasets.Further, it provides an experimental assessment of the obtained feature magnitudes and a comparison between the traditional Cross-Entropy and the negative-based cost functions, namely Entropic Open-set, Objectosphere, and the proposed Maximal Entropy Loss.

Evaluation Metrics
We adopt the open-set ROC curve [42][43][44], which plots the True Positive Identification Rate (TPIR) against the False Positive Identification Rate (FPIR) by varying the rejection threshold θ.TPIR is computed solely on probe samples of known subjects K by considering probes to be correctly identified if the similarity to the correct identity g * is the highest and above operating threshold θ: FPIR corresponds to the false alarm rate triggered by unknown samples U. A false positive identification occurs when the similarity of an unknown sample F p to any of the known subject templates T g is larger than threshold θ: An optimal open-set face identification system presents a TPIR of 1 at an FPIR of 0. By varying the threshold θ, the open-set ROC curve can be created.

Evaluated Datasets
We utilize a data partition [26,37] that splits LFW into three disjoint groups: 602 known, 1070 negative, and 4096 unknown identities.We use the provided hand-labeled landmarks available in LFW dataset during the alignment process.For IJB-C, we train the method on gallery A only so that all gallery B matching identities available in the probe set act as unknown face samples.Additionally, LFW is incorporated as the negative set since none of its classes are encountered in IJB-C.UCCS metadata provides bounding boxes and identity labels, containing either known subject identities or negative labels for unknown faces.We incorporate the MTCNN face detector [45] as the default detection system S d on IJB-C and UCCS benchmarks.We employ the very same face detector throughout the experiments to standardize the face detection stage.Following the evaluation protocol, all background detections of MTCNN serve as additional unknown samples during testing in the UCCS dataset [9].

Evaluated Approaches
In the interest of comparing the proposed adapter network along with Maximal Entropy and Objectosphere loss functions to other methods, we incorporate four additional approaches: Baseline, SoftMax, Garbage and Entropic.Apart from Baseline, all evaluated methods run the complete pipeline depicted in Fig. 2 in which the template gallery consists of feature vectors extracted from the adapter network S a .In addition to the adapter network, we also investigate whether it is beneficial to fine-tune the entire feature backbone model on the gallery data, which has been shown to be beneficial for larger datasets.Since this training is much more time consuming, we restrict our experiments to the largest and most difficult dataset, i. e., IJB-C.The seven evaluated techniques are: • Baseline consists of creating a template set with the original features extracted from the representational system S r and computing the cosine similarity.
• SoftMax follows the proposed pipeline by training the adapter network S a with Cross-Entropy loss, without exploiting any negative samples (negative-free).
• Garbage extends SoftMax as it creates a template T g for each known individual g ∈ G along with an exclusive template T |G|+1 holding negative samples.
• Entropic also follows the proposed pipeline, but this time the Entropic Open-set loss is adopted to handle known and negative samples (negative-based).
• MaxEntropy consists of training the adapter network S a with the proposed Maximal Entropy loss, which holds hyperparameter m = 0.40 as a default value.
• Finetuning involves training all layers of the adopted architecture on the evaluated IJB-C dataset.

Network Setup
The network S a benefits from representation systems S r , that is, AFFFE, ArcFace and VGGFace2 [10][11][12] with 1000, 512 and 2048-dimensional deep features, respectively.The feature extraction counts on Bob's [46,47] biometric pipeline1 that handles face detection, alignment and feature extraction.The adapter network S a topology is a compact fully-connected network with 512 and 256 neurons in the two hidden layers.Given the aforestated hyperparameters, VGGFace2 composes the worst-case scenario in which the adapter network would hold no more than 1.7 million trainable weights, which corresponds to a small fraction of the total of 138 million parameters contained in the deep backbone (98% less than VGG-16).
The pipeline is built upon PyTorch framework and consists of 500 training epochs for all datasets.Convergence on the validation set was commonly achieved in the first 150 epochs, only minor improvements have been encountered after 200 epochs.When disregarding the feature extraction process performed in S r , the training procedure takes around 20 minutes for LFW, 80 minutes for UCCS and no more than three hours for IJB-C on a regular multicore desktop computer with a single Nvidia Titan X GPU.If more training speed is required, the network topology can be adapted, the number of epochs can be reduced or more GPU resources can be added.

Comparison to the State of the Art
In the interest of showing the advantage of MaxEntropy and Objectosphere over SoftMax, Garbage and Entropic, the adapter network S a is trained on different face datasets holding the very same topology and hyper-parameters for all dependent methods.Figures 3 through 7 depict several approaches in which all of them, except Baseline, rely on S a .Additionally, Tab. 1 provides a detailed list of TPIR values for selected FPIR operating points, evaluated on all three network topologies and all three datasets.The results obtained on the three evaluated datasets are described in the following paragraphs: Labeled Faces in the Wild.Fig. 3 portrays the investigation on LFW considering different feature representations: VGGFace2, AFFFE and ArcFace.Baseline presents an outstanding performance using VGGFace2 representation module in Fig. 3(a), implying that no supplementary data is required for LFW due to its innate characteristics.Plots (b) and (c) point out a comparable performance between Baseline and negative-based cost functions.
There is an equivalent behavior with AFFFE when the false-positive proportion exceeds three per thousand samples (3×10 −3 ).ArcFace backbone equipped all approaches    with discriminative feature vectors so that very little can be concluded in terms of accuracy.Note that four methods attained open-set performance greater than 95% in (c) when FPIR surpasses 2 × 10 −3 .However, results are substantially inferior under SoftMax or Garbage approach.
Unlike most recent face datasets, LFW consists of reasonably good-quality images of cropped faces that cooperate with deep networks in delivering satisfactory feature representations.As a consequence, computing the cosine distance among original feature vectors, as performed by Baseline, is sufficient to go toward the state of the art.The small amount of data (three images per subject) seems insufficient to train the adapter network with traditional cost functions.The adopted non-official protocol [37] holds nearly 9,300 samples in the probe set and, therefore, the actual threshold value is estimated at no more than 10 images when the FPIR is less than 10 −3 .Moreover, the TPIR performance score for scarce samples is not reliable in low FPIR regions due to the natural threshold fluctuation.
UnConstrained College Students.Fig. 4 discloses the experimental evaluation on the UCCS benchmark.Along with identities composing the gallery set, UCCS data encompasses both false positive detections (misdetections) and faces from unknown subjects.MaxEntropy seems capable of attenuating the domain difference between the source data used to train the representation network S r and the student population scope present in the UCCS dataset.On the other hand, the domain adaptation seems less impactful for ArcFace features, which indicates that ArcFace architecture can be used in various domains.
Fig. 4(a) reveals that our approach can benefit from the addition of negative samples as the best overall result was achieved with the adapter network when trained with the proposed Maximal Entropy loss.Fig. 4(b) also signalizes significant accuracy gained through the addition of negative samples.The chart indicates that AFFFE face representations are better adapted for low-resolution images than VGGFace2; however, both are surpassed by Ar-cFace's robust feature vectors.Fig. 4(c) shows that the negative-exploring cost functions obtain analogous performance: slight dominance of MaxEntropy when FPIR is between 10 −3 and 10 −1 .Although Baseline prevails in the interval [10 −1 , 10 0 ], it attains lower accuracy in the aforementioned range along with the other methods.
IARPA Janus Benchmark C series.Fig. 5 exposes experiments on IJB-C merged with more than 13,000 negative  samples acquired from LFW.The discrepancy 2 in image resolution and pose variations between both datasets ends up reflecting on the results as LFW does not play a decisive enhancement role in the proposed adapter network's identification performance when assessing IJB-C benchmark.
The three plots suggest that the Finetune approach could not maintain the generalization capability of the original backbone performance either combined with cosine similarity (Baseline) or the adapter network.According to Fig. 5(a), negative samples do not seem to provide significant improvement when evaluating VG-GFace2 feature vectors and, in fact, they turn out to impair Objectosphere's exactness.Fig. 5

(b) corresponds to experiments containing AFFFE features and shows that
Baseline outperforms all other approaches.MaxEntropy attains comparable performance at a low false positive identification rate when it ranges from 1×10 −3 to 3×10 −3 .ArcFace experiments in Fig. 5(c) also demonstrate the dominance achieved with the Baseline approach.An approximate accuracy is reached by the MaxEntropy method when FPIR comprises the area to the left of 2 × 10 −3 .

Discussion
This section examines the effect of training the adapter network with different-distribution data.Tab. 1 provides a complete view of the results for different FPIRs.

Differences between IJB-C and LFW
Dhamija et al. [5] pointed out that the choice of negative samples plays an important role when training an open-set network.Fig. 6(a) shows that LFW does not follow the same feature distribution as IJB-C.As revealed in Fig. 5, selecting LFW to compose the set of negative samples could not provide further improvements and outperform the baseline method on IJB-C dataset except for experiments containing VGGFace2 representations.
2 IJB-C contains images without standardized traits whereas LFW comprises mostly good-quality images with close-to-frontal faces.
IJB-C probe samples as well as its enrollment data are distributed differently.More precisely, gallery-enrolled samples contain mostly good-quality still photos whereas probe samples are mainly composed of low-resolution still images or blurred video frames.We tend to believe that the adapter network S a over-adapts to good-quality enrollment samples when it inputs only high-standard data.Therefore, module S a ends up lowering the performance on IJB-C by rejecting many probe samples as unknown.Selecting enrollment and probe data with similar distribution is likely to increase performance.Fig. 7 discloses an additional set of experiments on IJB-C benchmark in which gallery A populates the known set and gallery B composes the negative set for Entropic, MaxEntropy and Objectosphere.This scenario affords a related data distribution between both training subsets.Despite probe and enrollment data sharing different capture quality, results show that appropriate negative samples significantly improve the open-set face recognition pipeline.All charts indicate a dominance of MaxEntropy over negative-free methods when FPIR lies below 10 −1 .Factually, using ArcFace backbone achieves the highest accuracy rate of all experiments conducted on IJB-C dataset.
We reckon that real-world watchlist applications would scarcely ever contain negative identities overlapping with unknown face samples.However, the assessment displayed in Fig. 7 provides a reference point on the maximal identification correctness.Results show a recurring superiority of MaxEntropy regardless of the adopted representation network.Unlike Fig. 5, where gallery and negative samples hold contrasting data distribution, the resemblance between both IJB-C disjoint galleries delivers discriminative class boundaries.Distribution-alike data is a musthave aspect required by "negative-based" error functions when seeking negative samples in exchange for a meaningful contribution to the open-set recognition pipeline.
Although experiments showed in Fig. 7 do not adhere to the official IJB-C protocol, there are scenarios in which this training scheme would be appropriate.For in-  stance, an enterprise may have premium clients that must be treated differently than regular customers.They could be addressed by name and offered a comfortable room on the premises.Privileged customers constitute known classes but the remaining ones are placed in the negative set.Prospect customers (unknowns) lie somewhere in between and shall be treated better than the ordinary, but not as good as premium.Consequently, the face recognition system is supposed to raise an alert whenever premium customers come over.

Deep Feature Magnitudes
Objectosphere loss aims to push feature magnitude extracted from unknown samples to very low figures.It simultaneously attempts to shift the magnitude of known samples toward a specified value ξ.Robust open-set methods are expected to achieve high accuracy in different datasets with consistent parameters.This requirement plays an essential role in biometrics as it is not possible to anticipate the visual traits of all probe samples.Best results have been attained on UCCS when setting Objectosphere pa-rameters ξ = 1 and λ = 0.01, and we have verified that these parameters also work well on LFW and IJB-C.Fig. 8 displays deep feature magnitude histograms for UCCS evaluation data.Original VGGFace2 features hold a considerable magnitude overlap among unknown and known subjects as well as false-positive detections in the background.The intersection remains when training the adapter network with SoftMax, but Objectosphere reduces the coincidental area between known and unknown samples.Basically, weights learned with Objectosphere can distinguish enrolled subjects from unknown identities during the testing stage.Known samples are distributed closer to the desired target magnitude whereas negative (background) samples have a peak close to 0, but are distributed throughout the range of magnitudes.
As indicated in Fig. 6(a), LFW images provide higher magnitudes but IJB-C instances result in low-magnitude representations.Deep networks may misclassify probe samples since the image quality has an impact on the acquired feature vectors.Due to the lack of similarity between IJB-C and LFW, the latter is not capable to guide the Unknowns specifies probe samples without corresponding identity in the gallery set, and Background refers to face misdetections.Feature magnitude is an indicator that domain adaptation improves separation between known and unknown subjects.While SoftMax provides low magnitudes for background and approximately half of the unknown samples, Objectosphere delivers even better separation.
adapter network S a in discriminating IJB-C probe samples.Fig. 6(b) and (c) present probe feature magnitudes when S a is trained with negatives proceeding from LFW or IJB-C's gallery B. Note that the magnitudes are wellabove the intended separation threshold ξ = 1 and, hence, appropriate negatives might help to separate further.

Proposed Approach Applicability
MaxEntropy requires a distance margin m in the interval [0, 1] whereas Objectosphere includes sphere-related and regularizing parameters (ξ and λ, respectively).As a result, associating both losses culminates in the specification of three hyper-parameters, not including the ones regarding the adaptation network, such as the number of neurons, learning rate, and batch size to name a few.Cost functions that require the adjustment of multiple parameters make their deployment unfeasible in both academic and realistic scenarios.Consequently, we do not combine MaxEntropy with Objectosphere since a large number of tunable parameters turns into an optimization problem.
We acknowledge that a desirable open-set face recognition approach would only require the enrollment of subjects of interest without the need to fine-tune the deep representation backbone.However, the three evaluated datasets are composed of numerous identities holding very few samples per class in the training set, a common requisite in watchlist problems.Since applying an untouched pre-trained representation model to dissimilar data dis-tributions regularly results in a substantial accuracy loss, the designed adapter network S a offers a flexible trade-off between computational time and correctness.

Conclusion
Pre-trained deep networks usually require considerable time to be adapted and retrained to new domains, especially when the training data is constantly updated.This is the scenario in which the proposed compact adapter network comes in handy as it serves as a quick-trainable replacement for the output layer.Moreover, the evaluated cost functions take advantage of supplementary information when adding negative samples to the training stage.Experiments have shown that additional samples play an important role in "identifying" the unknown when the training samples are sufficiently representative of the uninvestigated feature space.
The proposed approach is suited for watchlists and transfer-learning tasks since the adaptation network can be attached to the output of any pre-trained deep network model and be quickly adjusted to different data distributions.Retraining large deep backbones, such as ArcFace and VGGFace2, every time a new identity is added to the gallery set becomes categorically infeasible and has proven to be counterproductive.The ArcFace network, for instance, contains nearly 50 million weights in contrast to 394,850 parameters in the adapter network when trained on LFW and inputting 512-dimensional feature embeddings from ArcFace/ResNet-100.
Experiments carried out on the open-set face recognition protocols of LFW, UCCS, and IJB-C have provided a comprehensive analysis of the compact network and the employed loss functions.The evaluation has shown that the association of the adapter network with Objectosphere or the proposed Maximal Entropy loss is capable of outperforming the original deep features in many cases.As detailed in the literature comparison, part of the adopted negative images clearly boosted the performance of our method whereas others encompassing distinct domains as well as different data distribution were not adequate and contributed little to the approach accuracy.How to obtain or generate more effective negative samples will be investigated in future work.

( a )Figure 1 :
Figure 1: Boosting unknown detection with negative samples.The behavior of three different approaches when trained with To the best of our knowledge, no genuine open-set face recognition work has been evaluated on the IJB-C benchmark.Most methods typically aim at improving open-set recognition by providing better feature embeddings for face verification, which comprises a different biometric task.Moreover, MEL is the first loss function to simultaneously penalize known and negative samples.Its distinctiveness drives the network toward learning more discriminative face embeddings as it meticulously searches for enhanced parameters.The major contributions of our work are: (a) We evaluate distinct cost functions as well as propose MEL, a novel loss function that maximizes the entropy in order to make training more rigorous.(b) We further analyze the Objectosphere loss [2] in favor of verifying how it modifies the feature vector norm of training and test face samples.(c) We present an adapter network that accelerates the computationally-expensive retraining or fine-tuning of deep convolutional neural networks.(d) We conduct a detailed open-set analysis of all evaluated cost functions on datasets containing thousands of identities but few samples per class.(e) We run experiments to verify whether the proposed approach is effective when combined with distinct deep feature extractors and evaluated on well-known open-set face recognition datasets.The remainder of this work is organized as follows: Section 2 provides an overview of related work.Section 3 describes the proposed approach: a compact adaptation network combined with MEL or other open-set loss functions.Section 4 exposes the experimental evaluation on three different face datasets.Section 5 presents a thorough discussion of the attained results and Section 6 finishes up with the conclusion and final words.

Figure 3 :
Figure 3: LFW Evaluation.Open-set ROC charts are shown for VGGFace2, AFFFE, and ArcFace features.Due to the small size of the LFW dataset, FPIR values smaller than 10 −3 cannot reliably be computed and are, hence, left out.Because of the small amount of three training samples per identity, the adapter network is not able to provide more meaningful features than the representation network Sr.

Figure 4 :
Figure 4: UCCS evaluation.UCCS is a more challenging dataset than LFW as it comprehends a surveillance and unrestricted domain.Therefore, training the adapter network using UCCS known and negative samples improves the performance over the Baseline, especially when training and evaluation samples hold equivalent distribution.

Figure 5 :
Figure 5: IJB-C + LFW evaluation.Open-set ROC charts are shown for AFFFE, ArcFace and VGGFace2 features.This evaluation follows IJB-C's open-set protocol test 4 with the addition of the entire LFW dataset as negative samples.Negative data diverging from the set distribution seem incapable to contribute to the method's performance.

Figure 7 :
Figure 7: IJB-C evaluation.Open-set ROC charts are shown for AFFFE, ArcFace and VGGFace2 features.Negative samples are obtained from gallery set B of IJB-C dataset.This evaluation does not adhere to IJB-C's open-set protocol test 4.When negative samples embody the same distribution as known samples, "negative-based" cost functions along with the adapter network outperform the Baseline.
IJB-C and LFW magnitudes.Chart (a) exposes training feature magnitudes obtained with VGGFace2 in which knowns and unknowns come from IJB-C and negatives derive from LFW. Plots (b) and (c) demonstrate how the adapter network Sa combined withObjectosphere behaves on the evaluation data when trained with negatives coming either from LFW or IJB-C gallery B. Note that gallery B provides better separation between knowns and unknowns whereas LFW is not sufficient to push the distributions apart.

Table 1 :
Open-set ROC evaluation.Open-set ROC results are shown for AFFFE, ArcFace and VGGFace2 feature representations on the three evaluated datasets, namely LFW, UCCS and IJB-C.Each cell consists of the True-Positive Identification Rate where False-Positive Identification Rate value is indicated in the first column (TPIR@FPIR).Best values per model are highlighted in bold, second in italics.UCCS magnitudes.All plots portray results obtained with probe data: Knowns designates subjects registered in the watchlist,