Header

UZH-Logo

Maintenance Infos

Positional bias in variant calls against draft reference assemblies


Briskine, Roman V; Shimizu, Kentaro K (2017). Positional bias in variant calls against draft reference assemblies. BMC Genomics, 18(1):263.

Abstract

Background: Whole genome resequencing projects may implement variant calling using draft reference genomes assembled de novo from short-read libraries. Despite lower quality of such assemblies, they allowed researchers to extend a wide range of population genetic and genome-wide association analyses to non-model species. As the variant calling pipelines are complex and involve many software packages, it is important to understand inherent biases and limitations at each step of the analysis.
Results: In this article, we report a positional bias present in variant calling performed against draft reference assemblies constructed from de Bruijn or string overlap graphs. We assessed how frequently variants appeared at each position counted from ends of a contig or scaffold sequence, and discovered unexpectedly high number of variants at the positions related to the length of either k-mers or reads used for the assembly. We detected the bias in both publicly available draft assemblies from Assemblathon 2 competition as well as in the assemblies we generated from our simulated short-read data. Simulations confirmed that the bias causing variants are predominantly false positives induced by reads from spatially distant repeated sequences. The bias is particularly strong in contig assemblies. Scaffolding does not eliminate the bias but tends to mitigate it because of the changes in variants’ relative positions and alterations in read alignments. The bias can be effectively reduced by filtering out the variants that reside in repetitive elements.
Conclusions: Draft genome sequences generated by several popular assemblers appear to be susceptible to the positional bias potentially affecting many resequencing projects in non-model species. The bias is inherent to the assembly algorithms and arises from their particular handling of repeated sequences. It is recommended to reduce the bias by filtering especially if higher-quality genome assembly cannot be achieved. Our findings can help other researchers to improve the quality of their variant data sets and reduce artefactual findings in downstream analyses.

Abstract

Background: Whole genome resequencing projects may implement variant calling using draft reference genomes assembled de novo from short-read libraries. Despite lower quality of such assemblies, they allowed researchers to extend a wide range of population genetic and genome-wide association analyses to non-model species. As the variant calling pipelines are complex and involve many software packages, it is important to understand inherent biases and limitations at each step of the analysis.
Results: In this article, we report a positional bias present in variant calling performed against draft reference assemblies constructed from de Bruijn or string overlap graphs. We assessed how frequently variants appeared at each position counted from ends of a contig or scaffold sequence, and discovered unexpectedly high number of variants at the positions related to the length of either k-mers or reads used for the assembly. We detected the bias in both publicly available draft assemblies from Assemblathon 2 competition as well as in the assemblies we generated from our simulated short-read data. Simulations confirmed that the bias causing variants are predominantly false positives induced by reads from spatially distant repeated sequences. The bias is particularly strong in contig assemblies. Scaffolding does not eliminate the bias but tends to mitigate it because of the changes in variants’ relative positions and alterations in read alignments. The bias can be effectively reduced by filtering out the variants that reside in repetitive elements.
Conclusions: Draft genome sequences generated by several popular assemblers appear to be susceptible to the positional bias potentially affecting many resequencing projects in non-model species. The bias is inherent to the assembly algorithms and arises from their particular handling of repeated sequences. It is recommended to reduce the bias by filtering especially if higher-quality genome assembly cannot be achieved. Our findings can help other researchers to improve the quality of their variant data sets and reduce artefactual findings in downstream analyses.

Statistics

Altmetrics

Downloads

9 downloads since deposited on 07 Apr 2017
9 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:04 Faculty of Medicine > Functional Genomics Center Zurich
07 Faculty of Science > Institute of Evolutionary Biology and Environmental Studies
Dewey Decimal Classification:570 Life sciences; biology
590 Animals (Zoology)
Uncontrolled Keywords:Reseqencing Variants Polymorphisms SNPs Positional bias Draft reference genome Repetitive elements
Language:English
Date:2017
Deposited On:07 Apr 2017 10:01
Last Modified:07 Aug 2017 02:11
Publisher:BioMed Central
ISSN:1471-2164
Free access at:PubMed ID. An embargo period may apply.
Publisher DOI:https://doi.org/10.1186/s12864-017-3637-2
Related URLs:https://doi.org/10.6084/m9.figshare.c.3728611
PubMed ID:28351369

Download

Preview Icon on Download
Preview
Content: Published Version
Filetype: PDF
Size: 1MB
View at publisher
Licence: Creative Commons: Attribution 4.0 International (CC BY 4.0)
Preview Icon on Download
Preview
Content: Supplemental Material
Filetype: PDF
Size: 1MB

TrendTerms

TrendTerms displays relevant terms of the abstract of this publication and related documents on a map. The terms and their relations were extracted from ZORA using word statistics. Their timelines are taken from ZORA as well. The bubble size of a term is proportional to the number of documents where the term occurs. Red, orange, yellow and green colors are used for terms that occur in the current document; red indicates high interlinkedness of a term with other terms, orange, yellow and green decreasing interlinkedness. Blue is used for terms that have a relation with the terms in this document, but occur in other documents.
You can navigate and zoom the map. Mouse-hovering a term displays its timeline, clicking it yields the associated documents.

Author Collaborations