UZH-Logo

Maintenance Infos

Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations


Boulesteix, Anne-Laure; Bender, Andreas; Bermejo, Justo Lorenzo; Strobl, Carolin (2012). Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations. Briefings in Bioinformatics, 13(3):292-304.

Abstract

The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with re- spect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this method- ology that may be considered as an important pitfall, namely that common variants are systematically favoured by the widely used Gini VIM. As a consequence, researchers may overlook rare variants that contribute to the missing heritability. The goal of the present article is 3-fold: (i) to assess this effect quantitatively using simulation studies for different types of random forests (classical random forests and conditional inference forests, that employ un- biased variable selection criteria) as well as for different importance measures (Gini and permutation based); (ii) to explore the trees and to compare the behaviour of random forests and the standard logistic regression model in order to understand the statistical mechanisms behind the preference for common variants; and (iii) to summarize these results and previously investigated properties of random forest VIMs in the context of genetic association studies and to make practical recommendations regarding the choice of the random forest and variable import- ance type. All our analyses can be reproduced using R code available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/ginibias/.

The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with re- spect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this method- ology that may be considered as an important pitfall, namely that common variants are systematically favoured by the widely used Gini VIM. As a consequence, researchers may overlook rare variants that contribute to the missing heritability. The goal of the present article is 3-fold: (i) to assess this effect quantitatively using simulation studies for different types of random forests (classical random forests and conditional inference forests, that employ un- biased variable selection criteria) as well as for different importance measures (Gini and permutation based); (ii) to explore the trees and to compare the behaviour of random forests and the standard logistic regression model in order to understand the statistical mechanisms behind the preference for common variants; and (iii) to summarize these results and previously investigated properties of random forest VIMs in the context of genetic association studies and to make practical recommendations regarding the choice of the random forest and variable import- ance type. All our analyses can be reproduced using R code available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/ginibias/.

Citations

15 citations in Web of Science®
18 citations in Scopus®
Google Scholar™

Altmetrics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Psychology
Dewey Decimal Classification:150 Psychology
Date:2012
Deposited On:26 Nov 2012 09:50
Last Modified:05 Apr 2016 16:06
Publisher:Oxford University Press
Series Name:Briefings in Bioinformatics
ISSN:1467-5463
Additional Information:gleichzeitig erschienen als: Boulesteix, Anne-Laure und Bender, Andreas und Lorenzo Bermejo, Justo und Strobl, Carolin (2011): Random forest Gini importance favors SNPs with large minor allele frequency. Department of Statistics: Technical Reports, Nr. 106 siehe: http://www.zora.uzh.ch/67135/
Free access at:Related URL. An embargo period may apply.
Publisher DOI:https://doi.org/10.1093/bib/bbr053
Related URLs:http://epub.ub.uni-muenchen.de/12224/

Download

Full text not available from this repository.View at publisher

TrendTerms

TrendTerms displays relevant terms of the abstract of this publication and related documents on a map. The terms and their relations were extracted from ZORA using word statistics. Their timelines are taken from ZORA as well. The bubble size of a term is proportional to the number of documents where the term occurs. Red, orange, yellow and green colors are used for terms that occur in the current document; red indicates high interlinkedness of a term with other terms, orange, yellow and green decreasing interlinkedness. Blue is used for terms that have a relation with the terms in this document, but occur in other documents.
You can navigate and zoom the map. Mouse-hovering a term displays its timeline, clicking it yields the associated documents.

Author Collaborations