Header

UZH-Logo

Maintenance Infos

Measuring structural similarity of semistructured data based on information-theoretic approaches


Helmer, Sven; Augsten, Nikolaus; Böhlen, Michael Hanspeter (2012). Measuring structural similarity of semistructured data based on information-theoretic approaches. VLDB Journal, 21(5):677-702.

Abstract

We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.

Abstract

We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.

Statistics

Citations

3 citations in Web of Science®
9 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

1 download since deposited on 29 Jan 2013
0 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:03 Faculty of Economics > Department of Informatics
Dewey Decimal Classification:000 Computer science, knowledge & systems
Language:English
Date:2012
Deposited On:29 Jan 2013 07:55
Last Modified:05 Apr 2016 16:25
Publisher:Springer
ISSN:1066-8888
Publisher DOI:https://doi.org/10.1007/s00778-012-0263-0
Other Identification Number:merlin-id:7762

Download