Header

UZH-Logo

Maintenance Infos

Automatic topography of high-dimensional data sets by non-parametric density peak clustering


d’Errico, Maria; Facco, Elena; Laio, Alessandro; Rodriguez, Alex (2021). Automatic topography of high-dimensional data sets by non-parametric density peak clustering. Information Sciences, 560:476-492.

Abstract

Data analysis in high-dimensional spaces aims at obtaining a synthetic description of a data set, revealing its main structure and its salient features. We here introduce an approach providing this description in the form of a topography of the data, namely a human-readable chart of the probability density from which the data are harvested. The approach is based on an unsupervised extension of Density Peak clustering and on a non-parametric density estimator that measures the probability density in the manifold containing the data. This allows finding automatically the number and the height of the peaks of the probability density, and the depth of the “valleys” separating them. Importantly, the density estimator provides a measure of the error, which allows distinguishing genuine density peaks from density fluctuations due to finite sampling. The approach thus provides robust and visual information about the density peaks height, their statistical reliability and their hierarchical organization, offering a conceptually powerful extension of the standard clustering partitions. We show that this framework is particularly useful in the analysis of complex data sets.

Abstract

Data analysis in high-dimensional spaces aims at obtaining a synthetic description of a data set, revealing its main structure and its salient features. We here introduce an approach providing this description in the form of a topography of the data, namely a human-readable chart of the probability density from which the data are harvested. The approach is based on an unsupervised extension of Density Peak clustering and on a non-parametric density estimator that measures the probability density in the manifold containing the data. This allows finding automatically the number and the height of the peaks of the probability density, and the depth of the “valleys” separating them. Importantly, the density estimator provides a measure of the error, which allows distinguishing genuine density peaks from density fluctuations due to finite sampling. The approach thus provides robust and visual information about the density peaks height, their statistical reliability and their hierarchical organization, offering a conceptually powerful extension of the standard clustering partitions. We show that this framework is particularly useful in the analysis of complex data sets.

Statistics

Citations

Dimensions.ai Metrics
25 citations in Web of Science®
24 citations in Scopus®
Google Scholar™

Altmetrics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:04 Faculty of Medicine > Functional Genomics Center Zurich
Dewey Decimal Classification:570 Life sciences; biology
610 Medicine & health
Scopus Subject Areas:Physical Sciences > Software
Physical Sciences > Control and Systems Engineering
Physical Sciences > Theoretical Computer Science
Physical Sciences > Computer Science Applications
Social Sciences & Humanities > Information Systems and Management
Physical Sciences > Artificial Intelligence
Uncontrolled Keywords:Artificial Intelligence, Information Systems and Management, Computer Science Applications, Theoretical Computer Science, Control and Systems Engineering, Software
Language:English
Date:1 June 2021
Deposited On:26 Jan 2022 05:35
Last Modified:26 Apr 2024 01:39
Publisher:Elsevier
ISSN:0020-0255
OA Status:Closed
Publisher DOI:https://doi.org/10.1016/j.ins.2021.01.010
Full text not available from this repository.