Predicting dark matter halo formation in N-body simulations with deep regression networks

Dark matter haloes play a fundamental role in cosmological structure formation. The most common approach to model their assembly mechanisms is through N-body simulations. In this work we present an innovative pathway to predict dark matter halo formation from the initial density field using a Deep Learning algorithm. We implement and train a Deep Convolutional Neural Network (DCNN) to solve the task of retrieving Lagrangian patches from which dark matter halos will condense. The volumetric multi-label classification task is turned into a regression problem by means of the euclidean distance transformation. The network is complemented by an adaptive version of the watershed algorithm to form the entire protohalo identification pipeline. We show that splitting the segmentation problem into two distinct sub-tasks allows for training smaller and faster networks, while the predictive power of the pipeline remains the same. The model is trained on synthetic data derived from a single full N-body simulation and achieves deviations of ~10% when reconstructing the dark matter halo mass function at z=0. This approach represents a promising framework for learning highly non-linear relations in the primordial density field. As a practical application, our method can be used to produce mock dark matter halo catalogues directly from the initial conditions of N-body simulations.


INTRODUCTION
The fundamental information source of gravitational dynamics is the cosmic density field described by its non-linear evolution. Cosmological N-body simulations of cold dark matter show how initially over-dense regions collapse through the competition of cosmic expansion and gravity to form virialized structures called dark matter haloes. These haloes form the building blocks of large-scale structure as they define the landscape of potential wells in which baryonic matter flows to form galaxies, groups and clusters of galaxies (e.g. Wechsler & Tinker 2018;Guo et al. 2010). The structure and formation of dark matter haloes is thus an important mechanism to understand when building a complete model of galaxy formation and evolution as intrinsic galaxy properties depend on the host dark matter halo mass and morphology (Wechsler & Tinker E-mail: mauro.bernardini@uzh.ch † E-mail: lmayer@physik.uzh.ch 2018; Feldmann et al. 2019).
The source of structure formation lies in the primordial perturbations seeded in the matter field of the early Universe, where initially small density peaks grow linearly through mass accretion and mergers of smaller structures. As the evolution of the matter density field progresses into the non-linear regime, the continuing gravitational accretion starts to form distinct virialized objects, that strongly affect their corresponding environments and substructures. As individual systems transition into the fully non-linear regime, the assembly history of dark matter haloes becomes generally difficult as complicated gravitational affects like merging or tidal stripping of larger structures can occur. In this regime, N-body simulations provide the only reliable tool for accurately describing structure formation processes.
The development of analytical approximations for describing where and how structure forms in an initial random field has opened the possibility to understand the main physical aspects that drive halo assembly. The fundamental analytical work by Press & Schechter (1974) suggests that dark matter collapse occurs once the spherically smoothed linear density field exceeds a cosmology-dependent threshold value. This idea is complemented by the excursion set formalism by Bond et al. (1991) which describes gravitational collapse as a statistical framework based on density trajectories on varying scales. This framework was further developed by relaxing the assumption of spherical collapse and calibrating with numerical simulations. These semi-analytical schemes have been shown to reproduce halo statistics with acceptable error margins providing a fast method for sampling mock dark matter halo catalogues (Sheth et al. 2001;Reed et al. 2003).
In recent time, a wide variety of machine learning techniques have been investigated to solve tasks related to non-linear structure formation (e.g. Berger & Stein 2019;Aragon-Calvo 2019;He et al. 2019;Lucie-Smith et al. 2018;Zhang et al. 2019;Agarwal et al. 2017). Generally, the approach of examining fully non-linear collapse dynamics by running N-body simulations is computationally expensive. Incorporating machine learning is therefore a natural step, since these algorithms are very adaptive to a large variety of problems and arrive at predictions comparatively fast.
Convolutional Neural Networks (CNNs) form a branch of Deep Learning models that have drastically changed computer vision problems (Krizhevsky et al. 2012). As many other deep learning architectures, CNNs use the compositions of non-linear functions to model complex dependencies between input features and target variables. An interesting subsection of CNNs deals with the task of image segmentation, where networks are trained to identify and retrieve structures in input images. Through the combination of layered non-linear activation functions, CNNs learn derivatives in the input space to isolate objects of interest. Networks trained on segmentation tasks have a large variety of applications that range from identification of every day objects (e.g. Liu et al. 2015;Noh et al. 2015) up to nuclei segmentation in biological images of cancer tissue (e.g. Milletari et al. 2016;Naylor et al. 2019;Isensee et al. 2018).
In this work we present an application of machine learning based on a CNN to identify Lagrangian patches (termed protohaloes) in the primordial density field, from which dark matter haloes condense. We formulate the problem in such a way that the pipeline can be used to produce mock dark matter halo catalogues directly from the initial conditions (IC) of N-body simulations. We design the classification task of distinguishing between collapsing and background regions as a regression problem in the distance space. This alternative technique was first proposed by Naylor et al. (2019) and shows great success for images with overlapping regions. The precision of the entire pipeline is measured by comparing the reconstructed dark matter halo mass function to its corresponding ground truth.
Since identifying collapsing regions in the primordial density field is a highly non-trivial task, the success of training a machine learning model strongly depends on how this mapping is formulated numerically. Lucie-Smith et al. (2018) trained a Random Forest on data retrieved on a particle by particle basis to predict whether or not a particle will be a member of a dark matter halo at z = 0. They successfully showed that the primordial density field contains enough information to predict halo membership as the precision of their algorithm outperformed analytical frameworks like the Extended Press-Schechter formalism. They also showed that the density contrast is an information carrier to predict the final distance of particles from their corresponding halo density peaks.
Berger & Stein (2019) presented a powerful binary classification network (that inspired our own implementation) and paired it with an algorithm to extract halo regions from the reconstructed probability map. Compared to the particle information in Lucie-Smith et al. (2018), their neural network operates upon gridded density contrast information. Regarding halo formation, this approach has the advantage that the network learns to identify the important features in the input map and derivatives thereof itself, rather than being presented with a fixed feature vector as in Lucie-Smith et al. (2018). This allows the network to scan the exact shape of the local neighborhood around density peaks on different scales, in order to assess the underlying collapse dynamics. The downside of formulating halo membership in a binary approach is, that retrieving individual haloes strongly depends on border pixels having smaller probability values then centrally located cells, in order to minimize the undesired effect of haloes artificially clumping together. This indicates that the success of this method relies on the imperfection of the neural network prediction regarding the reconstructed probability map as halo borders are identified by varying probability thresholds.
Inspired by this approach, we seek to formulate a mapping where the network is trained on a target that incorporates the information of protohalo borders by construction. We deliberately focus on retrieving halo information from the initial conditions rather than from the z = 0 snapshot for two very important reasons. First, the central location of haloes changes over time through gravitational interactions as they gradually divert from the initial locations of the primordial density peaks. Second, the final halo sizes at z = 0 in full N-body simulations are small compared to the box size and thus span only a very small subsection of the simulation volume, making the target field sparsely populated. Training on sparsely populated target fields of the exact halo position and mass is a very difficult task as has been pointed out by Zhang et al. (2019) (although in a slightly different application). For these reasons we choose to design the network mapping where input and target both display information at the initial conditions. We show that by introducing gradient fields in the target map, and in particular the distance information, one can train the network to predict membership of central pixels more accurately than border cells. This gives us the advantage of well-defined borders in the target map, as information of individual objects is retained. This paper is structured as follows. In section 2 we discuss the methods we used to construct the training samples for the supervised regression problem from cosmological Nbody simulations. Then, we outline the mapping the network is trained on and in section 3 we present our network implementation and training strategy in great detail. In section 4 we describe an adaptive algorithm to reconstruct the segmentation map from the network output, in order to predict the dark matter halo distribution for a single simulation box. We assess the network precision by comparing the predicted and true halo statistics in form of the halo mass function at z = 0. We furthermore discuss distinct strengths and weak-nesses of the network at its current state when confronted with specific cases of density configurations. We propose future ideas for further improvements and conclude in section 5.

DEEP LEARNING FOR STRUCTURE IDENTIFICATION
In this section, we present the details of our implemented pipeline in the following sequence: • We present the numerical details of the N-body simulations used in this work as well as the identification algorithm of dark matter haloes at z = 0.
• We introduce the concept of a protohalo at the initial conditions, how these objects are retrieved from the simulation and how they are used in the prediction algorithm.
• We describe the numerical concept of the distance transformation as a metric in the context of protohalo boundaries and how this information allows to construct the data samples for the network mapping.

Cosmological N-body simulations
We adopt two N-body simulations of the same ΛCDM cosmology (Ω m = 0.279, Ω Λ = 0.721) produced by pkdgrav (Stadel et al. 2000), one for training and one for testing the algorithm. The comoving box size is 100 h −1 Mpc with a particle resolution of 512 3 , where each individual dark matter particle carries a physical mass of 8.84 × 10 8 M . We use the HOP halo finder by Eisenstein & Hut (1998) to retrieve dark matter halo catalogues from the simulations. This group finding algorithms first associates a density estimate for all particles. Each particle is then iteratively linked to its densest nearest neighbor until the algorithm reaches a particle which is its own densest neighbor. The collection of particles that are traced to the same densest particle forms a single group. In this way, there are no constraints on the morphology of individual haloes and since this method does not rely on a linking length, the undesired effect of bridging discrete haloes together is avoided entirely. We note that the neural network mapping is unaltered when choosing a different halo finder as the network learns the concept of a protohalo from the data samples itself.
We chose a lower halo mass threshold of 4 × 10 12 M (corresponding to 4525 particles) as smaller haloes are not sufficiently resolved for the underlying task. Running the halo finder over the corresponding z = 0 snapshots results in catalogues of 6D phase-space information for each halo. The training simulation contains 1418 dark matter haloes ranging up to ∼10 15 M where the validation simulation contains 1630 thereof.

Formulating the network mapping
The primordial density field is the source of input information from which the network learns to retrieve the collapsing protohalo regions. We define a protohalo as the collection of particles in the initial conditions that will all end up in the same halo at z = 0. The entire particle set at the ICs can therefore be split into background particles (that will not be part of any virialized structure) and multiple progenitor clumps that will collapse into distinct dark matter haloes.
Due to the fact that CNNs operate upon gridded data, we switch from the spatial particle distribution to a gridbased density contrast by means of the cloud-in-cell algorithm (e.g. Howlett et al. 2015). The grid resolution is chosen to be 512 3 , so that each cell contains approximately one particle at the initial conditions. We transform the deposited density contrast by unit-variance scaling, where the data is divided by the standard deviation of all pixel values. The distribution of the resulting dataset is centered around 0 with a standard deviation of 1. This data pre-processing step has been shown to help in faster convergence of deep learning models (LeCun et al. 1998). Naylor et al. (2019) presented a novel approach to turn a pixel-wise classification problem into a distance-based regression task. Let us assume that an image harbors multiple well-defined objects defined by different labels as seen in figure 2 (subplot b). All pixels are associated a label that is unique of the enclosing object they belong to, whereas pixels not belonging to any substructure typically carry a zero label. We define the set K x as the collection of pixels x that belong to a specific object in the image. The residual cells denoted as y define the background of this substructure, i.e. y K x . The euclidean distance transformation (EDT) of a multi-labelled map B is then defined as B d = d(B), where each cell x is assigned the euclidean distance to the closest background pixel as defined above, i.e. (2) The transformation is conducted with the edt 1 package, which implements euclidean distance transformation for multi-label regions. The main problem in binary segmentation is that close or overlapping objects tend to be segmented as one single region. Instead of predicting the protohalo membership of individual pixels in a binary fashion, each cell is assigned the nearest distance to the border of the cluster it resides within. We therefore construct the distance map of protohalo regions in the following way. Apart from general 6D phasespace information, each dark matter halo in the catalogue also carries the unique ID's of its member particles. Conversely, we record for each particle the mass of the halo it resides in at z = 0. If a particle does not belong to a halo it is assigned a halo mass label of 0. This mass information is then traced back for each particle to the initial conditions at z = 100, where it is deposited in the following hierarchical fashion. Each cell of the target map is assigned the largest halo mass label present inside it. If multiple labels are present in a single cell, the deposited value corresponds to the mass of the largest halo, where cells with no halo particles inside are assigned a zero-value corresponding to the background. This top-down approach prioritizes large scale protohaloes by construction, but since at z = 100 approximately one particle resides in each cell, almost all of the halo information is preserved. Moreover, as the labels Figure 1. A schematic overview of the entire pipeline. We describe the network mapping and how the training samples of density contrast δ and EDT (defined in eq. 2) are constructed in section 2. We discuss the network architecture, training strategy and results in section 3. In section 4 we introduce the watershed algorithm as a key element in the post-processing of the neural network predictions and describe how the halo mass function is reconstructed.
are directly retrieved from the halo catalogue at z = 0, the masses of dark matter haloes and their corresponding protohaloes equal in very good approximation. The entire pipeline is schematically shown in figure 2.
Regressing the distance information as the mapping target offers the major advantage that the networks attention is primarily drawn to the central cluster regions where the density contrast is generally largest. As outlined in section 3.2, the training loss function compares the predicted and target distance values in a direct way, such that the prediction on central cells is more weighted compared to border pixels. Additionally, as the dark matter haloes mass ranges from 4 · 10 12 M to ∼10 15 M the size of regions, and thus the distance values, vary across different scales. We therefore decide to normalize the distance map of each individual protohalo cluster, such that the value of the central cell equals 1 for all objects. The individually normalized distance map B d,n forms the regression target in the network mapping. We also investigated training on the target of non-normalized distance maps and found that the network primarily focuses on predicting the correct distance value for large scale objects while often failing on small scales. We show a sample result of these operations in figure 2.

Generating synthetic training samples
We construct the training and validation samples from the two deposited particle fields described above. The input i is the unit-variance scaled density contrast whereas the ground target g corresponds to the normalized distance map. The network is trained on regressing the correct distance value from the density contrast, where both fields represent the state at the initial conditions. The data pairs (i, g) are constructed by a tiling strategy, which is similar to the one presented by Berger & Stein (2019). We divide the entire 512 3 domain into subvolumes of 128 3 where adjacent samples overlap by 64 cells to avoid edge effects. In this manner, the two simulations are each divided into 512 samples, where the innermost 64 3 cells are unique to each subvolume. Each cube can be rotated in 6 random directions (2 per axis), where for each direction exist an additional 4 possibilities to orient the cube around the given axis, giving a total augmentation factor of 24. We augment the training and validation sets by where the associated halo mass is deposited hierarchically as described in section 2.2. This results in distinct protohalo regions marked by different colors to better convey individual structures. We then apply the EDT transformation to the entire label image and normalize on an object by object basis to produce the distance map B d, n shown in the third subplot. Also shown in green is the actual and normalized distance value (12 and 1) to the closest background cell for the innermost cell for an example protohalo. Background cells (shown in black) have zero distance. The normalized euclidean distance information conveys that the target map retains individual objects even though distinct patches might be connected. the aforementioned transformations resulting in a total set of 12'288 samples for training and validation respectively.

NEURAL NETWORK IMPLEMENTATION
CNNs achieve feature extraction from samples by convolving the inputs with filters to produce so called feature maps. The information stored in these feature maps is subsequently down-sampled by pooling operations, which are designed to preserve the important information signals. As data descends deeper into the network, the feature representations generally grow in numbers, while the size of the feature maps decreases due to information pooling. The network architecture is designed in a way that the information of individual feature maps flows together inside deeper convolution layers, allowing the network to learn non-linear combinations of identified features.

U-net architecture
In this work we make use of the U-net first introduced by Ronneberger et al. (2015) for solving bio-medical image segmentation tasks. The network itself is a fully-convolutional autoencoder consisting of two main branches, an encoding and decoding part. As in the general autoencoder case, the information compression is realized by convolutions followed by pooling operations. For image segmentation the output and input dimensions must match. In order to recover the original input dimensions, the decoding blocks are constructed by a single transpose convolution followed by consecutive convolution operations, which results in upsampled feature maps. A schematic depiction of the entire network is shown in figure 3, where the data flows in a U-shape form from the upper left to the lower middle and then reascends along the right hand side. In this process the network is able to extract advanced features, but looses localization information at the same time. In order to correct for this loss, Ronneberger et al. (2015) introduced skip connections (often termed fine-grained feature forwarding connections), that copy and concatenate the information from the corresponding encoder level with the up-flowing data in the decoder part. In this manner, the spatial information from the contraction path is directly transferred to the expanding branch without being passed through the bottleneck and deconvolution operation. Additionally, Ronneberger et al. (2015) also found that the skip connections greatly reduce training time regarding model convergence.
We implement and train our own U-net version in keras, a high-level neural networks API (Chollet et al. 2015). The individual convolution blocks are constructed by two subsequent convolutions of filter size (3 × 3 × 3) and stride 1. The first network layer deploys n f convolution filters, where we choose n f =12 in our implementation. After each convolution a non-linear activation function is applied to the data tensors. We choose LReLu, a leaky variation of ReLu (Rectifier-Linear-Unit), with an α-value of 0.05. The main reason for this choice is that deep networks with native ReLu activations have been found to occasionally suffer from vanishing gradients (e.g. Bengio et al. 1994). The main advantage of LReLu lies in its non-zero gradient, which makes the network more robust as training should in principle never stop. After the second activation in each layer the data is copied and split along two paths. The first path is the skip connection, which concatenates to the up-flowing data on the reascending network part. Along the second path the data flows through a Dropout layer (Srivastava et al. 2014) and finally passes through a single MaxPool operation with filter size (2 × 2 × 2) and stride 1. As the data descends the network reaching deeper levels, the architecture of the convolution blocks remains the same with the exception that a total of (2l × n f ) filters are applied, where l = [1, ..., L] is the corresponding layer number. The final network is constructed with a total of 5 convolution blocks (i.e. L = 5). For deeper layers the network becomes more sensitive to large scale structures in the input image since the pooling operations cut the corresponding image dimensions in half for U-net (L = 5, n f = 12) architecture/operation occurrence no. per layer l total no.
each additional layer. As the physical size of one simulation cell is ∼0.2 Mpc, the network downsampling increases the receptive field by a factor of 2 5 = 32. The deepest network filters of size (3 × 3 × 3) can therefore learn ∼20 Mpc features in the density field. In the upsampling branch features of all scales are used in conjunction with the spatial information provided by the skip connections to assemble the final prediction. We choose a native ReLu for the final activation function as the target only contains values ≥ 0.

Training the neural network
The training is conducted on individual batches of size 16 (train-on-batch strategy). For loss function minimization we use the Adaptive momentum optimizer (Adam) provided in the keras (Chollet et al. 2015) package. By the gradient descent algorithm, the network learns relevant relations to accurately predict the distance map from the input density field. However, since a training sample only covers an area of 128 3 cells, one expects the network to be less accurate when predicting the target value for pixels close to the outer border, because important parts of the density field are stored in the next adjacent subbox. Hence, the approach of splitting the entire simulation domain into subboxes introduces an obstacle in the training algorithm caused by edge effects. To correct for this, we implement a custom loss function based on the normalized L 1 loss (normalized mean-absolute-error), that is only sensitive to pixels inside a predefined receptive field. The buffer size is chosen to be 16, so that only the innermost 96 3 cube is considered when evaluating the cost function. We choose a smaller buffer size compared to the buffer of the individual subboxes, because otherwise larger structures would not entirely fit inside the receptive field of the loss function and their identification would have to span across neighboring subboxes. This is undesired, since the neural network is only presented with a single instance during prediction time. Formally, the selective loss function is constructed aŝ where g k and p k denote individual cells of ground truth and prediction. We choose to normalize the loss function by dividing by the mean ground truth distanceḡ, since different subboxes show largely varying distance values
In figure 4 we show the evolution of the loss function, where after ∼2000 iterations loss minimization stagnates and training is stopped. Despite the comparatively low number of training samples we see that the current strategy is successful as the loss steadily decreases with no overfitting occurring. The network reaches a loss value of ∼0.6 in a relatively short training period. The subsequent gradient descent however is slow resulting in the characteristic exponential tail that is often encountered when training deep learning algorithms. The small training set of only 512 unique samples presents another challenge as the global minimum may be surrounded by local minima that are characteristic for the training set. Moreover, the comparatively small batch-size of 16 as well as the train-on-batch strategy make the gradient descent less smooth. Even though in principle this can be overcome by changing the training strategy, the shortcoming is the loss of quick access control to the training algorithm because the network weights do not get updated as frequently as in the train-on-batch Figure 4. The evolution of the smoothed loss functionL 1,sel for both training and validation simulation over the entire ∼2000 training iterations, after which the loss improvements become marginally small and training is stopped. The dropout rate is set to 0.5 throughout the entire training process.

strategy.
The trained network is then used to predict the distance map for the validation simulation. A slice of the validation box is displayed in figure 5. In general, the network manages to identify the important regions of gravitational collapse and to estimate the corresponding protohalo sizes. For large and intermediate scales the general sizes of connected structures in the distance map agree well with the ground truth. There are however individual cases where the compact and intermingled structure of many nearby clusters poses a great challenge for the network as it is not able to accurately split the individual regions in the correct way. Occasionally, two intermediately sized protohalo regions will be merged by the network to produce a single cluster and vice versa. We show an example of such a faulty network prediction in the zoomed-in subplot A in figure 5, where the network does infact predict the right size of the collapsing regions but splits the density patch into two sub protohaloes. Accurately predicting the distances of small scale structure poses the largest challenge for the network as there are at least three possible interplaying reasons for this behavior: (i) The lack of numerical resolution induces signal-tonoise ratios in the density field that are too low for the network to retrieve any collapse information on small scales.
(ii) The corresponding region in the input density field does in fact show signs of an overdensity but through gravitational dynamics the small cluster is either dispersed or merges with a larger nearby cluster and its isolated signal is lost.
(iii) Regions that will form halos at z = 0 with masses slightly below the halo mass barrier of 4 · 10 12 M will not appear in the ground truth distance map, even though the signal is present in the density field at the initial conditions. This is a natural shortcoming of imposing a hard threshold on small scale structure.

FROM DISTANCE INFORMATION TO SEGMENTATION MAPS
In the following section we describe how the output from the neural network is post-processed to predict the mass of individual protohalo regions and to construct the final halo mass function.
Having trained the network to predict the desired distance information from the density contrast, we construct an adaptive algorithm that retrieves the boundaries of individual protohalo regions as defined by the metric in equation 2. We make use of the watershed algorithm, a fast and reliable tool used in digital image processing for segmenting data with overlapping regions. The name refers metaphorically to a geological watershed, which separates adjacent drainage basins, as it treats the image it operates upon like a topographic map. With the brightness of each pixel representing its height (or distance as defined in equation 2), the algorithm finds the lines that run along the tops of ridges and is able to separate adjacent regions from one another. We suggest that the interested reader consult Kornilov & Safonov (2018) for a detailed overview.
We adopt the marker-based version of the watershed algorithm, which is extremely useful for this purpose as the distance map is the only needed ingredient. In a preliminary step, the algorithm runs a local-maxima-finder to identify the positions of distance peaks, which play the important role of preselected markers. In a subsequent step, sources are placed at the marker positions, from which the image is flooded until water basins attributed to different markers meet on watershed lines. The resulting set of segmented regions constitutes a watershed segmentation by flooding.

Generating halo catalogs and bias correction
The native watershed algorithm however has a natural shortcoming when reconstructing images of structures with different scales. An example is shown in figure 6, where two clusters of different sizes are very close to one another. The middle pane shows the reconstruction from the native watershed algorithm. The general problem is that by simply flooding the image, any region where pixels have nonzero values is filled and gets assigned to one of the two clusters, which generally results in overestimated sizes when compared to the ground truth protohalos. This behaviour introduces a bias that changes with mass scale as it originates from the imperfect network predictions at cluster borders. We correct for this overestimation by introducing an adaptive element to the native watershed output that restricts the regions, within which the algorithm can fill the image. This is achieved by thresholding the distance map of each individual cluster with an adaptive value that is found by calibration with the ground truth halo mass function in the following strategy. We switch back to the unnormalized cell-based distance d by means of a spherical approximation for each cluster, where V denotes the uncorrected size in terms of individual cells andd is the normalized distance value from the prediction. By thresholding the distance map of a cluster with increasing values, we trace the evolution of the mass and the averaged distance values inside it. This results in unique trajectories in the (V,d)-space for every protohalo region (see figure 7). A rapid change in mass, while the mean distance remains constant is an indication for a smeared border that is not well defined. The information of all trajectories is accumulated in a 2-dimensional histogram that represents the occurrences of intersections in each (V,d)-segment. The ground truth halo mass function yields the information of how many halos are situated in a given size bin. Through direct calibration we then identify for each mass scale, the distance bin that offers the smallest residual in terms of halo number with respect to the ground truth. The identified bins (shown as blue data points in figure 7) are then fitted with a piece-wise linear function to understand the bias introduced by watershed as a corrective relation between V andd. We are constricted to choose a comparatively low number of bins, as the small sample number of ∼1500 haloes does not offer enough statistics and is unreliable for finer binning.
We implement the entire adaptive element as a post- Figure 5. An entire slice of the validation box displaying the input density contrast at z = 100 (left), raw and unprocessed prediction of the neural network (right) and ground truth (middle) of the normalized distance map. Also shown are insets of two regions conveying the performance of the neural network in more detail. In case B the network manages to predict the exact number of distinct clusters as well as a good estimate on the corresponding sizes, whereas case A shows a problematic prediction regarding protohalo identification where a major density patch is split into two sub-structures. This behaviour is unwanted as the reconstructed region will suffer from oversegmentation. Figure 6. An example protohalo region that conveys very well the need for an adaptive version of the watershed algorithm. On the left a zoomed in slice of two predicted clusters is shown. The middle pane shows the raw cluster reconstruction from the native watershed algorithm, whereas the result of the adaptive postprocessed version is shown on the right. Given the true protohalo boundaries overlayed in green in all three plots, it is evident that the adaptive watershed algorithm yields better estimates regarding the individual sizes of neighboring clusters, as problematic regions are filtered away.
processing step following the native watershed flooding, meaning that the entire cluster size retrieval includes the following two steps. First, the segmented image is reconstructed by the native watershed algorithm where sizes are generally overestimated. In a second step, the different reconstructed regions are thresholded depending on the mass bin where their corresponding trajectories intersect the piece-wise linear fit. The numerical reason behind this approach is that the correction remains minimal because the actual masses are held as large as possible since where it is apparent that the overestimation follows the same trend in both datasets. We manually fit a piece-wise linear function (red) through the training scatter to retrieve the adaptive relation between V andd. The corrected protohalo sizes then correspond to the bin where their trajectories intersect the fitted relation.
only the outermost borders are cut away. In this manner we construct an algorithm that is sensitive to different cluster scales as small and intermediate scale structures are prevented from being merged with nearby clusters. We find that this adaptive thresholding algorithm recovers the individual sizes much better compared to a uniform threshold value, which would suffice in the case of the neural network achieving close to perfect precision. An example of this correction is showed in figure 6. We emphasize that this step is purely due to the imperfections of the network predictions. Training deeper models with larger training sets at higher resolution may eradicate the need of this adaptive post-processing element entirely. The collection of reconstructed protohalo sizes is then converted to the final mass catalogue by means of the background density at the ICs, i.e. M =ρV. This strategy works very well in the linear regime (δ ≈ 0). In fact, as the constructed grid matches the particle resolution of 512 3 , there is approximately one particle in each grid cell, such that final halo masses at z = 0 equal the protohalo masses in very good agreement . Thus, predicting the protohalo masses allows to reconstruct the halo mass function of dark matter haloes at z = 0. In figure 8 we show the true and predicted accumulated halo mass functions as well as the percentage deviation from the ground truth for the training and validation simulation. The reconstructed statistics match the ground truth catalogs for most mass bins within an uncertainty of 10%. The deviation is largest for very massive protohalo regions of the order ∼10 14 M and beyond. Given that the total numbers of these large scale objects in the entire simulation box is comparatively low, the variety of cases that the neural network is presented with during training is especially limited on these scales. We retrieve a total of 1484 protohalo regions for the training simulation and 1541 thereof in the validation case (5% deviation in both cases). The neural network as well as the piece-wise linear correction fit generalize well to unseen data samples from the validation set ( figure  5 and 7).

CONCLUSIONS AND FUTURE WORK
We have presented a framework for the task of identifying protohalo regions on a one-by-one basis directly from the initial conditions of N-body simulations. We designed this challenging task as a hybrid approach consisting of pure Deep Learning (U-net) paired with an image reconstruction technique (watershed algorithm). We showed that by reformulating the underlying classification task as a distance-based regression problem, the comparatively small network with ∼14.5·10 6 trainable parameters is in fact able to retrieve collapsing regions despite being trained on data samples from only one simulation box.
We argued that the native marker-based watershed algorithm is not sensitive enough for reconstructing the exact halo masses as it generally overpredicts the sizes of individual regions due to blindly flooding the 3D image. We implemented a computationally fast addition to complement the native algorithm with an adaptive post-processing element that is calibrated with the training simulation. We find that this hybrid model is able to reconstruct the halo mass function at z = 0 within an uncertainty of ∼10% for most mass bins. In addition to the statistical precision, the model also achieves good agreement when comparing the actual sizes and positions of protohalo regions on a one-by-one basis as seen in figure 5. The hybrid model is designed to be robust against many complications that arise from fully non-linear dynamics in N-body simulations. The pixel-wise euclidean distance map as the target is completely free from any assumptions on the cluster morphology (e.g. spherical approximation), and bypasses the difficulties of training a multi-label classification task, where each protohalo region would be treated as a different class. Generally, the network achieves good precision on the regressed distance information and is able to isolate distinct protohalo regions.
However, a weakness of the current pipeline is that it occasionally fails to predict halo formation if multiple density patches ending up in a single halo originate from disconnected regions (as seen in case A in figure 5). It is possible that these faulty predictions originate from cases where the density configuration alone might not yield enough information regarding how patches merge or split. For this problem, we propose to include the velocity field as an additional information carrier when training the neural network as it could provide missing information on the fragmentation of individual density patches. The current setup can easily be extended to predict more halo properties since this technique only depends on how the target fields are constructed numerically. Future installments will investigate to what extent an improved hybrid model can predict merger scenarios based on information about the locations of overdensities and their relative velocities and gradients thereof. We expect that training deeper networks on larger datasets with higher resolution will decrease the systematic errors on the halo mass function. In our training strategy we made use of the normalized L 1 loss function, which measures deviations on a pixel-by-pixel level. Formulating a loss function that is more receptive and physically motivated regarding the underlying task of protohalo identification could offer a large opportunity for directing the learning process more towards the physics behind collapse dynamics (e.g. Karpatne et al. 2017).
The problem of fully non-linear structure formation across a wide range of scales offers a challenging opportunity to test and fine-tune different machine learning approaches. From the results of this work, we conclude that Deep Learning models are capable of learning the relevant combinations from the input signals to identify collapsing regions in full N-body simulations. In this sense, the presented hybrid model offers a powerful base framework towards designing more precise algorithms with the eventual objective to build cosmic emulators for fast sampling of dark matter halo catalogues. is the number of haloes with a mass greater than M. The corresponding ground truths are shown in black and the color shaded regions denote the Poisson uncertainty in the corresponding mass bin. Also shown as insets are the differential versions with the predicted and true halo number counts as well as the percentage deviation from the ground truth halo mass functions.