Header

UZH-Logo

Maintenance Infos

Batch effects in a multiyear sequencing study: false biological trends due to changes in read lengths


Leigh, D M; Lischer, H E L; Grossen, C; Keller, L F (2018). Batch effects in a multiyear sequencing study: false biological trends due to changes in read lengths. Molecular Ecology Resources, 18(4):778-788.

Abstract

High-throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long-term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high-throughput sequencing studies.

Abstract

High-throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long-term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high-throughput sequencing studies.

Statistics

Citations

Dimensions.ai Metrics
3 citations in Web of Science®
4 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

35 downloads since deposited on 19 Aug 2018
34 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:07 Faculty of Science > Institute of Evolutionary Biology and Environmental Studies
08 Research Priority Programs > Evolution in Action: From Genomes to Ecosystems
Dewey Decimal Classification:570 Life sciences; biology
590 Animals (Zoology)
Uncontrolled Keywords:GWAS; RADseq; genotyping error; long-term data; outlier; sequencing error
Language:English
Date:30 July 2018
Deposited On:19 Aug 2018 14:00
Last Modified:24 Sep 2019 23:33
Publisher:Wiley-Blackwell Publishing, Inc.
ISSN:1755-0998
OA Status:Green
Publisher DOI:https://doi.org/10.1111/1755-0998.12779
PubMed ID:29573184

Download

Green Open Access

Download PDF  'Batch effects in a multiyear sequencing study: false biological trends due to changes in read lengths'.
Preview
Content: Accepted Version
Language: English
Filetype: PDF
Size: 389kB
View at publisher