Abstract
Traditional approaches to structured semantic segmentation employ appearance-based classifiers to provide a class likelihood at each spatial location and then post-process it with Markov Random Fields (MRF) to enforce label smoothness and structure in the output space. The spatial support for such techniques is usually a patch of pixels, which makes the prediction over-smoothed because the borders of objects are not explicitly taken into account. This is further exacerbated by MRF post-processing employing the standard Potts model, which tends to further over-smooth predictions at boundaries. In this paper, we propose a different but related approach: we optimize an energy function finding the optimal combination of small ground truth (GT) tiles from training data over predictions at test time, effectively solving a puzzle. We optimize over a first configuration given by a Convolutional Neural Network (CNN) output.