Data summarization and clustering are key techniques to query and analyze large amounts of multidimensional data. However, the effectiveness of existing methods is limited by high intermediate memory cost and difficult to choose input parameters. This Ph.D. thesis develops a novel approach to hierarchical data summarization and clustering that overcomes these limitations.

We propose the AD-Tree, an innovative data summary structure that summarizes multidimensional data in terms of its density on a hierarchy of grids. The computation of the AD-Tree is an iterative process that requires a minimal amount of intermediate memory. We start computing the AD-Tree from a sparse initial grid, which gives a rough estimate of the density function of the data. We iteratively increase the estimation quality by splitting cells along dimensions where the density function is non-linear. This ensures a minimal consumption of intermediate memory: instead of overestimating the density on a fine grid and afterwards removing unnecessary grid points, we put new grid points only in places where it increases the precision of the estimation. The key challenges of our approach are the identification of areas and dimensions where the density function exhibits a non-linear behavior and a fast organization of new grid points into a hierarchy of grids that ensures an optimal memory utilization. We introduce shape error, dimensional split, grouping and compact representation of multidimensional grids to successfully solve these problems: the shape error measures the deviation of the density function from being linear on a grid, the dimensional split implements the splitting of cells in selected dimensions, the grouping organizes new grid points into large grids, and the compact representation of multidimensional grids reduces their storage costs by a factor of the dimensionality. We develop an efficient solution to approximately answer aggregate range queries from the AD-Tree.

We develop CORE, a novel clustering technique that clusters multidimensional data without any input parameters. The salient property of CORE is the explicit computation and representation of local density maxima, which permits a high-quality nonparametric clustering. CORE uses the local density maxima to determine cores of clusters. The AD-Tree, rectangular neighborhoods and gradients enable the efficient and robust computation of cores: the AD-Tree provides a uniform and compact estimation of the density of the data, the rectangular neighborhood localizes stationary points in the AD-Tree, and gradients distinguish local maxima from other types of stationary points and connect maximal cores. CORE is the first clustering technique that bases the clustering on a semantically rich data summary structure.

We investigate overlapping clusters and develop an efficient solution to separate them. The separation of overlapping clusters makes it necessary to cluster the data at all levels of the density and to consider the orientation of clusters. We use the AD-Tree, which allows CORE to find fragments and overlapping cores at all levels of the density. We restore complete cores from their fragments with the help of gradient paths. Gradient paths connect fragments through overlaps and quantify the orientation of fragments.

We analytically investigate our techniques and confirm the results with extensive experimental evaluations on synthetic and real world datasets. The results show the advantage of our techniques compared to existing methods

Taliun, A. *Hierarchical summarization of multidimensional data.* 2009, Free University of Bozen-Bolzano, Italy.

## Abstract

Data summarization and clustering are key techniques to query and analyze large amounts of multidimensional data. However, the effectiveness of existing methods is limited by high intermediate memory cost and difficult to choose input parameters. This Ph.D. thesis develops a novel approach to hierarchical data summarization and clustering that overcomes these limitations.

We propose the AD-Tree, an innovative data summary structure that summarizes multidimensional data in terms of its density on a hierarchy of grids. The computation of the AD-Tree is an iterative process that requires a minimal amount of intermediate memory. We start computing the AD-Tree from a sparse initial grid, which gives a rough estimate of the density function of the data. We iteratively increase the estimation quality by splitting cells along dimensions where the density function is non-linear. This ensures a minimal consumption of intermediate memory: instead of overestimating the density on a fine grid and afterwards removing unnecessary grid points, we put new grid points only in places where it increases the precision of the estimation. The key challenges of our approach are the identification of areas and dimensions where the density function exhibits a non-linear behavior and a fast organization of new grid points into a hierarchy of grids that ensures an optimal memory utilization. We introduce shape error, dimensional split, grouping and compact representation of multidimensional grids to successfully solve these problems: the shape error measures the deviation of the density function from being linear on a grid, the dimensional split implements the splitting of cells in selected dimensions, the grouping organizes new grid points into large grids, and the compact representation of multidimensional grids reduces their storage costs by a factor of the dimensionality. We develop an efficient solution to approximately answer aggregate range queries from the AD-Tree.

We develop CORE, a novel clustering technique that clusters multidimensional data without any input parameters. The salient property of CORE is the explicit computation and representation of local density maxima, which permits a high-quality nonparametric clustering. CORE uses the local density maxima to determine cores of clusters. The AD-Tree, rectangular neighborhoods and gradients enable the efficient and robust computation of cores: the AD-Tree provides a uniform and compact estimation of the density of the data, the rectangular neighborhood localizes stationary points in the AD-Tree, and gradients distinguish local maxima from other types of stationary points and connect maximal cores. CORE is the first clustering technique that bases the clustering on a semantically rich data summary structure.

We investigate overlapping clusters and develop an efficient solution to separate them. The separation of overlapping clusters makes it necessary to cluster the data at all levels of the density and to consider the orientation of clusters. We use the AD-Tree, which allows CORE to find fragments and overlapping cores at all levels of the density. We restore complete cores from their fragments with the help of gradient paths. Gradient paths connect fragments through overlaps and quantify the orientation of fragments.

We analytically investigate our techniques and confirm the results with extensive experimental evaluations on synthetic and real world datasets. The results show the advantage of our techniques compared to existing methods

## Downloads

## Additional indexing

Item Type: | Dissertation |
---|---|

Referees: | Böhlen M, Mazeika A |

Communities & Collections: | 03 Faculty of Economics > Department of Informatics |

Dewey Decimal Classification: | 000 Computer science, knowledge & systems |

Language: | English |

Date: | 2009 |

Deposited On: | 06 Mar 2010 15:43 |

Last Modified: | 05 Apr 2016 14:02 |

Official URL: | http://www.inf.unibz.it/~andtaliun/_files/thesis.pdf |

## Download

TrendTerms displays relevant terms of the abstract of this publication and related documents on a map. The terms and their relations were extracted from ZORA using word statistics. Their timelines are taken from ZORA as well. The bubble size of a term is proportional to the number of documents where the term occurs. Red, orange, yellow and green colors are used for terms that occur in the current document; red indicates high interlinkedness of a term with other terms, orange, yellow and green decreasing interlinkedness. Blue is used for terms that have a relation with the terms in this document, but occur in other documents.

You can navigate and zoom the map. Mouse-hovering a term displays its timeline, clicking it yields the associated documents.