UZH-Logo

Distance-based clustering of CGH data


Liu, J; Mohammed, J; Carter, J; Ranka, S; Kahveci, T; Baudis, M (2006). Distance-based clustering of CGH data. Bioinformatics, 22(16):1971-1978.

Abstract

MOTIVATION: We consider the problem of clustering a population of Comparative Genomic Hybridization (CGH) data samples. The goal is to develop a systematic way of placing patients with similar CGH imbalance profiles into the same cluster. Our expectation is that patients with the same cancer types will generally belong to the same cluster as their underlying CGH profiles will be similar. RESULTS: We focus on distance-based clustering strategies. We do this in two steps. (1) Distances of all pairs of CGH samples are computed. (2) CGH samples are clustered based on this distance. We develop three pairwise distance/similarity measures, namely raw, cosine and sim. Raw measure disregards correlation between contiguous genomic intervals. It compares the aberrations in each genomic interval separately. The remaining measures assume that consecutive genomic intervals may be correlated. Cosine maps pairs of CGH samples into vectors in a high-dimensional space and measures the angle between them. Sim measures the number of independent common aberrations. We test our distance/similarity measures on three well known clustering algorithms, bottom-up, top-down and k-means with and without centroid shrinking. Our results show that sim consistently performs better than the remaining measures. This indicates that the correlation of neighboring genomic intervals should be considered in the structural analysis of CGH datasets. The combination of sim with top-down clustering emerged as the best approach. AVAILABILITY: All software developed in this article and all the datasets are available from the authors upon request. CONTACT: juliu@cise.ufl.edu.

MOTIVATION: We consider the problem of clustering a population of Comparative Genomic Hybridization (CGH) data samples. The goal is to develop a systematic way of placing patients with similar CGH imbalance profiles into the same cluster. Our expectation is that patients with the same cancer types will generally belong to the same cluster as their underlying CGH profiles will be similar. RESULTS: We focus on distance-based clustering strategies. We do this in two steps. (1) Distances of all pairs of CGH samples are computed. (2) CGH samples are clustered based on this distance. We develop three pairwise distance/similarity measures, namely raw, cosine and sim. Raw measure disregards correlation between contiguous genomic intervals. It compares the aberrations in each genomic interval separately. The remaining measures assume that consecutive genomic intervals may be correlated. Cosine maps pairs of CGH samples into vectors in a high-dimensional space and measures the angle between them. Sim measures the number of independent common aberrations. We test our distance/similarity measures on three well known clustering algorithms, bottom-up, top-down and k-means with and without centroid shrinking. Our results show that sim consistently performs better than the remaining measures. This indicates that the correlation of neighboring genomic intervals should be considered in the structural analysis of CGH datasets. The combination of sim with top-down clustering emerged as the best approach. AVAILABILITY: All software developed in this article and all the datasets are available from the authors upon request. CONTACT: juliu@cise.ufl.edu.

Citations

33 citations in Web of Science®
41 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

95 downloads since deposited on 09 Jun 2009
19 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:04 Faculty of Medicine > Functional Genomics Center Zurich
07 Faculty of Science > Institute of Molecular Life Sciences
04 Faculty of Medicine > University Hospital Zurich > Division of Cellular and Molecular Pathology (former)
04 Faculty of Medicine > Institute of Molecular Cancer Research
07 Faculty of Science > Institute of Molecular Cancer Research

08 University Research Priority Programs > Systems Biology / Functional Genomics
Dewey Decimal Classification:570 Life sciences; biology
610 Medicine & health
Language:English
Date:2006
Deposited On:09 Jun 2009 14:20
Last Modified:05 Apr 2016 13:15
Publisher:Oxford University Press
ISSN:1367-4803
Additional Information:Oxford Journals – free final text
Publisher DOI:10.1093/bioinformatics/btl185
PubMed ID:16705014
Permanent URL: http://doi.org/10.5167/uzh-18924

Download

[img]
Preview
Filetype: PDF
Size: 1MB
View at publisher

TrendTerms

TrendTerms displays relevant terms of the abstract of this publication and related documents on a map. The terms and their relations were extracted from ZORA using word statistics. Their timelines are taken from ZORA as well. The bubble size of a term is proportional to the number of documents where the term occurs. Red, orange, yellow and green colors are used for terms that occur in the current document; red indicates high interlinkedness of a term with other terms, orange, yellow and green decreasing interlinkedness. Blue is used for terms that have a relation with the terms in this document, but occur in other documents.
You can navigate and zoom the map. Mouse-hovering a term displays its timeline, clicking it yields the associated documents.

Author Collaborations