Population assignment from cancer genome profiling data

For a variety of human malignancies, incidence, treatment efficacy and overall prognosis show considerable variation between different populations and ethnic groups. Disentangling the effects related to particular population backgrounds can help in both understanding cancer biology and in tailoring therapeutic interventions. Because self-reported or inferred patient data can be incomplete or misleading due to migration and genomic admixture, a data-driven ancestry estimation should be preferred. While tools to map and utilize ancestry information from healthy individuals have been introduced, a population assignment based on genotyping data from somatic variation profiling of cancer samples is still missing. We analyzed sequencing-based variation data from the 1000 Genomes project, containing 2504 individuals out of 5 continental groups. This reference was then used to extract population-biased SNPs used in genotyping array platforms of varying resolutions. We found that despite widespread and extensive somatic mutations of cancer profiling data, more than 90% of cancer samples can be correctly mapped to one of the population group when compared to their paired unmutated normals. Prefiltering samples for admixed individuals increased the accuracy to 96%. This work provides a data-driven approach to estimate the population background from cancer genome profiling data. This proof-of-concept study will facilitate efforts to understand the interplay between population and ethnicity related genetic background and differences in understanding statistical and molecular differences in cancer entities with respect to possible hereditary contributions. The docker version of the tool is provided through "baudisgroup/tum2pop" in DockerHub and deposited in "baudisgroup/tum2pop-mapping" in GitHub.


Introduction
Cancer arises from the accumulation of genomic aberrations in dividing cells of virtually all types of tissues (somatic variations). The irregular cellular expansion and other hallmarks of cancer (1) can result from a plethora of mechanisms affecting multiple cellular processes. Some of the individual oncogenetic pathways can be initiated by exogenous factors, e.g. tobacco smoke or ultraviolet radiation (2). However, exposure to these carcinogenic factors contribute differently for people with different genetic background, which suggests that somatic variations can be influenced by inherited ("germline") variations (3,4). Statistics on cancer report considerable variation in incidence and prognosis between ethnicity groups (5)(6)(7)(8). While such differences have been attributed to unequal social and economical circumstances influencing risk factors and therapeutic interventions, they may also reflect the impact of population specific genomic variants with predisposing effects on malignant transformation and phenotypic behaviour. Due to the late onset of most cancers, even high-penetrance Mendelian-type variants may not be purged by natural selection and can accumulate in particular populations. Such variants may play key roles in cancer development (9). Notably, mutations on BRCA1/2 genes confer a high risk to develop breast and ovarian carcinomas. Three founder mutations in Ashkenazi Jewish population cause the BRCA1/2 mutation prevalence to be 10-fold higher than all sporadic mutations in the general population (10,11). Mitochondrial aldehyde dehydrogenase (ALDH2) encodes an enzyme in alcohol metabolism. Its "oriental" variant with 36% prevalence in East Asians, ALDH2*504Lys, increases risk for alcoholrelated liver, colorectal and esophageal cancer by alcohol consumption (12,13).
Many other studies have reported prevalent genetic variants in specific population groups which may contribute to the "racial" disparities in occurrence and prognosis (14)(15)(16). Other than these monogenic determinants, polygenic variation models for breast cancer which estimate the combined effect of multiple loci to be highly discriminatory in risk assessment, suggest the benefits of exploring genome-wide risk profiles (17). The potential impact of understanding the germline background of cancer genomes has also been demonstrated in a study which identified disease-associated chromosomal regions from only seven individual samples by using genome-wide relatedness/linkage mapping (18). This type of studies can be conducted population-wise, with sufficient number of samples from the same population/ethnicity group.
With the increasing number of available genome profiles and the decreasing cost to genotype clinical samples, the stratification between patients' genetic backgrounds has become feasible with the promise to guide therapeutic strategies and improve the clinical prognoses. Since several studies have demonstrated the relevance of considering an individual's genomic origin for preventive screening (reviewed in Foulkes et al. 2015 (11)), information about the population background of cancer patients may be an additional factor for individual therapeutic decisions as well as for the stratification of clinical study cohorts. A meta-analysis addressing the interplay between genetic background, cancer development and ther-

D R A F T
apeutic responses is desirable, not only for robust statistical associations in molecular target identification, but also for the rational design of studies incorporating informative biosamples. The "population group" of a sample can be determined based on a geographical location associated with the sample; from self-reported "race", as commonly used in U.S. census data, or based on a computational estimate of ancestry by modelling population related genomic variants. In the context of anonymized or pseudonymized research data, an approximation of a biosample's geographic origin can be achieved by using the location of the study's research facility or alternatively the contact address of its main authors. However, while these data can be easily retrieved, they may not provide an accurate representation of patients' origins for the purpose of population-specific ancestry mapping. Self-reported data is often inconsistent across studies, vague in category description (e.g. "white", "black" v.s. "Caucasian", "African") and misleading when patients do not know the migration and admixture histories of their ancestors. Overall, when associating oncogenic molecular signatures with germline variations, information from the above sources lacks in relevant detail and consistency. A better approach to population assessment would be the direct inference from genomic data. This has been shown previously for germline profiles, achieving 90% accuracy by using as few as 100 population-diverging single nucleotide polymorphisms (SNPs) (19), and nowadays is the standard methodology behind a number of commercial "ancestry" services. We hypothesise that a similar strategy can be applied to cancer genome data, despite the additional cancer-related somatic mutations which leads to both information loss (e.g. large scale homozygous or allelic deletions) and added noise (e.g. somatic mutations masking germline variants). An example of a cancer genome containing copy number loss and copy-neutral loss of heterozygocity (CN-LOH) events and its paired normal sample is shown in Figure 1. Additionally to a general test of feasibility, we also set out to benchmark population mapping procedures for heterogeneous datasets from different genotyping platforms, with varying SNP content.

Results and Discussion
We retrieved the genomic reference data from the 1000 Genomes Project (20), containing sequencing data of 2,504 individuals from 26 populations of five continental ancestries. SNPs of the selected array platforms were extracted from the sequencing data for reference samples. In order to achieve between-study consistency for selection of informative SNPs, we used a model-based approach (21) where an admixture model is optimized with the reference set for each genotyping platform. The allele frequency and ancestry fraction parameters were projected to the incoming cancer dataset of the same platform. Applying a random forest classification, we assigned the population label to the highest voted group and produced a score for the difference between highest and second highest percentage votes ( Figure  2). We benchmarked our method with various normal and cancer datasets to demonstrate the feasibility and reliability of this approach.
Cross-platform benchmarking. We first used the original data from 1000 Genomes to validate the level of resolution needed for accurate population assignment from the pipeline. The number of SNPs per platform ranged from 10,204 (Affymetrix Mapping10K) to 934,946 SNPs (Affymetrix Genome Wide SNP 6). For all nine genotyping platforms (of seven levels of resolution), the model performed equally well in capturing the informative SNPs and predicting the population category. The 26 population groups are assigned into the 5 continental categories with low margin error for all genotyping platforms (less than 12 in 2,504 individuals) (Figure 3). When evaluating the 12 mis-classified individuals by cross-validation, we discovered that they were repetitively assigned into the same aberrant category; therefore, they were removed for the final implementation (Additional File 1). In addition, we removed 396 individuals from the random forest assignment based on the admixture background.
Benchmarking normal genome profile assignment with HapMap data. To validate the general ability to map population origins from non-cancer SNP datasets, we used 112 samples found in Gene Expression Omnibus (GEO (22)) belonging to the HapMap project (23) but not included in the reference set. While most assigned labels matched the metadata from HapMap, five samples labeled "European" were assigned to the "American" category (Table 1).  score threshold for cancer samples to > 0.2, 99.0% of the now 721 remaining samples could be matched correctly (Figure 4). This comparison suggests that a correct assignment of cancer samples to a population category can be achieved, and that the level of accuracy increases with a lower admixture background of the individual. TCGA data. We performed a similar measurement with 436 randomly selected individuals from the TCGA project (24), where at least one normal and one tumor sample per individual were available. 433 out of 436 (99.1%) individuals had matched tumor/normal categories, with the three outliers switching from EUR to AMR between samples 5. Additionally, we compared our results with the values of the "race" attribute provided in the TCGA metadata. There, six categories are being distinguished: "American Indian or Alaska native", "Asian", "Black or African American", "Native Hawaiian or other Pacific Islander", "White or Not Reported". The relevant ratio of these groups is shown in Figure 5. Most assignments were accurate: EUR samples were mostly "White"; AMR has mostly "White" and a few "African American" samples; SAS are all "Asian" samples. Additionally to "Asian" samples, "American Indian or Alaska Natives" or "Pacific Islanders" were all assigned to EAS.
Together, these two validation tests confirmed that a high assignment accuracy for cancer samples can be achieved, mirroring the assignment of their corresponding germline samples. Since samples with scores lower than the threshold had a highly admixed background, we also noted a mixture between AMR and EUR observed in both normal samples and cancer samples. We attribute this to the complex, recent admixture events in the last hundreds of years in the locations from where individuals were recruited as AMR in reference data (e.g. Colombia, Puerto Rico).

Self-reported ethnicity metadata.
After the validation of the method, we tested the accuracy of self-reported metadata from various sources deposited in GEO. We retrieved a total of 1724 samples with interpretable self-reported metadata. Out of those, 1523 samples (88.3%) were correctly assigned. When setting a threshold of the assignment score to 0.2, 92.7% (1310 out of 1412) samples were correctly assigned ( Figure 6). This increase on matched assignments is more limited compared to the previous validation tests on GEO data, suggesting the contribution of curation errors and inaccuracies in self-reported ethnicity metadata.

Conclusions
We demonstrate the feasibility and accuracy of assigning population group provenance based on SNP genotyping array data of cancer samples, where somatic mutations obfuscate parts of the ancestry related SNP signal. This work can facilitate meta-analysis of available cancer data with respect to the association of the genetic background to cancer specific mutations or, as proxy, to the correct assignment of sample provenance. In addition, our method provides the basis for subsequent haplotype phasing and refinement of genomic landscape for emerging somatic variation. Concerning the delicate balance between data utilization and confidentiality protection, it has not escaped our notice that the relative feasibility of such an approach suggests the potential of individual re-identification from cancer genotyping data.

Methods
Data preparation. Reference sequencing data are provided by 1000 Genomes Project, a publicly available reference catalogue of human genotype variation. Data used for benchmarking are accessed through arrayMap database (25), using a collection of re-processed genotyping series from the GEO repository, and the TCGA data repository.
Reference data preparation. The SNP positions for each platform were acquired from Affymetrix annotation files. The allele information was extracted for all positions with vcftools. The 12 mislabeled or admixed individuals were removed from the reference dataset, leaving 2,492 individuals. The SNP positions with duplicated rsIDs in annotation files were removed. The reference/alternative alleles were flipped according to the SNP array annotation. Sites with minor allele frequency (MAF) of less than 5 percent were removed.
SNPs were subsequently pruned with variance inflation factor(VAF) at 1.5 with a sliding window of 50bp and a 5bp shift of window at each step using PLINK 1.9. The result files were stored as PLINK output for each platform in .bed, .bim and .fam formats, of which the .bim files were used to extract SNP positions from target data.
Target data preparation. The SNP array data were processed with ACNE R package (26) to extract allele-specific copy numbers as B-allele frequencies (BAF). SNPs were labeled as homozygous A, heterozygous AB or homozygous B by the BAF value in ranges 0-0.15, 0.15-0.85 or 0.85-1, respectively, to allow both for noise and expected aneuploidy in the biosamples.
Admixture model. While many approaches use principle component analysis (PCA) to select informative SNPs for population assignment, deriving them prior to clustering methods, either by removing correlated SNPs (with Pearson's r > 0.99) or by global fixation index (Fst > 0.45), results in D R A F T varying remaining SNPs between datasets. We used the allele information output from the reference panel to generate an admixture statistical model (21), which estimates the contribution of each SNP to the population category by alternately updating allele frequency and ancestry fraction parameters. Models were built with choosing the number of theoretical ancestor, K = 6, considering the crossvalidation error, iteration steps and runtime. The ancestry fraction plot for reference individuals demonstrates a proper information extraction to distinguish the five continental categories. By projecting a correspondingly learned model derived from the reference dataset to a new sample with the corresponding platform, a robust and consistent output with 6 ancestry fractions was generated.
Random Forest label assignment. 396 samples were selected out of the random forest training set, due to the admixture structure found in the ancestry fraction (Additional File 2). 2,096 (2,492 less 396) samples were used as training set. The six ancestry fractions from the reference population per platform were used to build a random forest model to predict the five population categories. The model was then used to predict a label to the population category from 6 fractions of the target sample. The score was calculated as the difference in percentage votes between the best and the second best predicted labels.