We aim to jointly estimate height and semantically label monocular aerial images. These two tasks are traditionally addressed separately in remote sensing, despite their strong correlation. Therefore, a model learning both height and classes jointly seems advantageous and so, we propose a multitask Convolutional Neural Network (CNN) architecture with two losses: one performing semantic labeling, and another predicting normalized Digital Surface Model (nDSM) from the pixel values. Since the nDSM/height information is used only in the second loss, there is no need to have a nDSM map at test time, and the model can estimate height automatically on new images. We test our proposed method on a set of sub-decimeter resolution images and show that our model equals the performances of two separate models, but at the cost of a single one.