Header

UZH-Logo

Maintenance Infos

Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences


Schaper, Elke; Kajava, Andrey V; Hauser, Alain; Anisimova, Maria (2012). Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research, 40(20):10005-10017.

Abstract

Tandem repeats (TRs) represent one of the most prevalent features of genomic sequences. Due to their abundance and functional significance, a plethora of detection tools has been devised over the last two decades. Despite the longstanding interest, TR detection is still not resolved. Our large-scale tests reveal that current detectors produce different, often nonoverlapping inferences, reflecting characteristics of the underlying algorithms rather than the true distribution of TRs in genomic data. Our simulations show that the power of detecting TRs depends on the degree of their divergence, and repeat characteristics such as the length of the minimal repeat unit and their number in tandem. To reconcile the diverse predictions of current algorithms, we propose and evaluate several statistical criteria for measuring the quality of predicted repeat units. In particular, we propose a model-based phylogenetic classifier, entailing a maximum-likelihood estimation of the repeat divergence. Applied in conjunction with the state of the art detectors, our statistical classification scheme for inferred repeats allows to filter out false-positive predictions. Since different algorithms appear to specialize at predicting TRs with certain properties, we advise applying multiple detectors with subsequent filtering to obtain the most complete set of genuine repeats

Abstract

Tandem repeats (TRs) represent one of the most prevalent features of genomic sequences. Due to their abundance and functional significance, a plethora of detection tools has been devised over the last two decades. Despite the longstanding interest, TR detection is still not resolved. Our large-scale tests reveal that current detectors produce different, often nonoverlapping inferences, reflecting characteristics of the underlying algorithms rather than the true distribution of TRs in genomic data. Our simulations show that the power of detecting TRs depends on the degree of their divergence, and repeat characteristics such as the length of the minimal repeat unit and their number in tandem. To reconcile the diverse predictions of current algorithms, we propose and evaluate several statistical criteria for measuring the quality of predicted repeat units. In particular, we propose a model-based phylogenetic classifier, entailing a maximum-likelihood estimation of the repeat divergence. Applied in conjunction with the state of the art detectors, our statistical classification scheme for inferred repeats allows to filter out false-positive predictions. Since different algorithms appear to specialize at predicting TRs with certain properties, we advise applying multiple detectors with subsequent filtering to obtain the most complete set of genuine repeats

Statistics

Citations

Dimensions.ai Metrics
32 citations in Web of Science®
32 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

24 downloads since deposited on 08 Nov 2018
5 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:National licences > 142-005
Dewey Decimal Classification:570 Life sciences; biology
610 Medicine & health
Scopus Subject Areas:Life Sciences > Genetics
Language:English
Date:1 November 2012
Deposited On:08 Nov 2018 17:22
Last Modified:26 Jan 2022 17:55
Publisher:Oxford University Press
ISSN:0305-1048
OA Status:Gold
Free access at:PubMed ID. An embargo period may apply.
Publisher DOI:https://doi.org/10.1093/nar/gks726
PubMed ID:22923522
  • Content: Published Version
  • Language: English
  • Description: Nationallizenz 142-005
  • Licence: Creative Commons: Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)