UZH-Logo

Maintenance Infos

The pq-gram distance between ordered labeled trees


Augsten, N; Böhlen, M; Gamper, J (2010). The pq-gram distance between ordered labeled trees. ACM Transactions on Database Systems (TODS), 35(1):4:1-4:36.

Abstract

When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ. We propose pq-grams to approximately match hierarchical data from autonomous sources and define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pq-gram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pq-gram distance for which the triangle inequality holds. Experiments on synthetic and real-world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pq-grams.

Abstract

When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ. We propose pq-grams to approximately match hierarchical data from autonomous sources and define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pq-gram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pq-gram distance for which the triangle inequality holds. Experiments on synthetic and real-world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pq-grams.

Citations

13 citations in Web of Science®
46 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

83 downloads since deposited on 11 Feb 2011
35 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:03 Faculty of Economics > Department of Informatics
Dewey Decimal Classification:000 Computer science, knowledge & systems
Date:February 2010
Deposited On:11 Feb 2011 16:58
Last Modified:05 Apr 2016 14:43
Publisher:Association for Computing Machinery
ISSN:0362-5915
Additional Information:© ACM, 2010. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Trans. Database Syst., VOL 35, ISS 1, (2010)
Publisher DOI:https://doi.org/10.1145/1670243.1670247
Other Identification Number:1441

Download

[img]
Preview
Filetype: PDF
Size: 822kB
View at publisher

TrendTerms

TrendTerms displays relevant terms of the abstract of this publication and related documents on a map. The terms and their relations were extracted from ZORA using word statistics. Their timelines are taken from ZORA as well. The bubble size of a term is proportional to the number of documents where the term occurs. Red, orange, yellow and green colors are used for terms that occur in the current document; red indicates high interlinkedness of a term with other terms, orange, yellow and green decreasing interlinkedness. Blue is used for terms that have a relation with the terms in this document, but occur in other documents.
You can navigate and zoom the map. Mouse-hovering a term displays its timeline, clicking it yields the associated documents.

Author Collaborations