Header

UZH-Logo

Maintenance Infos

Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers


Hofmann, Ariane L; Behr, Jonas; Singer, Jochen; Kuipers, Jack; Beisel, Christian; Schraml, Peter; Moch, Holger; Beerenwinkel, Niko (2017). Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers. BMC Bioinformatics, 18(1):8.

Abstract

BACKGROUND Next-generation sequencing of matched tumor and normal biopsy pairs has become a technology of paramount importance for precision cancer treatment. Sequencing costs have dropped tremendously, allowing the sequencing of the whole exome of tumors for just a fraction of the total treatment costs. However, clinicians and scientists cannot take full advantage of the generated data because the accuracy of analysis pipelines is limited. This particularly concerns the reliable identification of subclonal mutations in a cancer tissue sample with very low frequencies, which may be clinically relevant. RESULTS Using simulations based on kidney tumor data, we compared the performance of nine state-of-the-art variant callers, namely deepSNV, GATK HaplotypeCaller, GATK UnifiedGenotyper, JointSNVMix2, MuTect, SAMtools, SiNVICT, SomaticSniper, and VarScan2. The comparison was done as a function of variant allele frequencies and coverage. Our analysis revealed that deepSNV and JointSNVMix2 perform very well, especially in the low-frequency range. We attributed false positive and false negative calls of the nine tools to specific error sources and assigned them to processing steps of the pipeline. All of these errors can be expected to occur in real data sets. We found that modifying certain steps of the pipeline or parameters of the tools can lead to substantial improvements in performance. Furthermore, a novel integration strategy that combines the ranks of the variants yielded the best performance. More precisely, the rank-combination of deepSNV, JointSNVMix2, MuTect, SiNVICT and VarScan2 reached a sensitivity of 78% when fixing the precision at 90%, and outperformed all individual tools, where the maximum sensitivity was 71% with the same precision. CONCLUSIONS The choice of well-performing tools for alignment and variant calling is crucial for the correct interpretation of exome sequencing data obtained from mixed samples, and common pipelines are suboptimal. We were able to relate observed substantial differences in performance to the underlying statistical models of the tools, and to pinpoint the error sources of false positive and false negative calls. These findings might inspire new software developments that improve exome sequencing pipelines and further the field of precision cancer treatment.

Abstract

BACKGROUND Next-generation sequencing of matched tumor and normal biopsy pairs has become a technology of paramount importance for precision cancer treatment. Sequencing costs have dropped tremendously, allowing the sequencing of the whole exome of tumors for just a fraction of the total treatment costs. However, clinicians and scientists cannot take full advantage of the generated data because the accuracy of analysis pipelines is limited. This particularly concerns the reliable identification of subclonal mutations in a cancer tissue sample with very low frequencies, which may be clinically relevant. RESULTS Using simulations based on kidney tumor data, we compared the performance of nine state-of-the-art variant callers, namely deepSNV, GATK HaplotypeCaller, GATK UnifiedGenotyper, JointSNVMix2, MuTect, SAMtools, SiNVICT, SomaticSniper, and VarScan2. The comparison was done as a function of variant allele frequencies and coverage. Our analysis revealed that deepSNV and JointSNVMix2 perform very well, especially in the low-frequency range. We attributed false positive and false negative calls of the nine tools to specific error sources and assigned them to processing steps of the pipeline. All of these errors can be expected to occur in real data sets. We found that modifying certain steps of the pipeline or parameters of the tools can lead to substantial improvements in performance. Furthermore, a novel integration strategy that combines the ranks of the variants yielded the best performance. More precisely, the rank-combination of deepSNV, JointSNVMix2, MuTect, SiNVICT and VarScan2 reached a sensitivity of 78% when fixing the precision at 90%, and outperformed all individual tools, where the maximum sensitivity was 71% with the same precision. CONCLUSIONS The choice of well-performing tools for alignment and variant calling is crucial for the correct interpretation of exome sequencing data obtained from mixed samples, and common pipelines are suboptimal. We were able to relate observed substantial differences in performance to the underlying statistical models of the tools, and to pinpoint the error sources of false positive and false negative calls. These findings might inspire new software developments that improve exome sequencing pipelines and further the field of precision cancer treatment.

Statistics

Citations

Dimensions.ai Metrics
2 citations in Web of Science®
1 citation in Scopus®
3 citations in Microsoft Academic
Google Scholar™

Altmetrics

Downloads

35 downloads since deposited on 10 Jan 2017
33 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:04 Faculty of Medicine > University Hospital Zurich > Institute of Pathology and Molecular Pathology
Dewey Decimal Classification:610 Medicine & health
Language:English
Date:3 January 2017
Deposited On:10 Jan 2017 17:22
Last Modified:08 Dec 2017 22:01
Publisher:BioMed Central
ISSN:1471-2105
Free access at:PubMed ID. An embargo period may apply.
Publisher DOI:https://doi.org/10.1186/s12859-016-1417-7
PubMed ID:28049408

Download

Download PDF  'Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers'.
Preview
Content: Published Version
Filetype: PDF
Size: 1MB
View at publisher
Licence: Creative Commons: Attribution 4.0 International (CC BY 4.0)