UZH-Logo

Maintenance Infos

Approximate matching of hierarchical data using pq-grams


Augsten, Nikolaus; Böhlen, Michael H; Gamper, Johann (2005). Approximate matching of hierarchical data using pq-grams. In: 31st International Conference on Very Large Data Bases, Trondheim, Norway, 30 August 2005 - 2 September 2005, 301-312.

Abstract

When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. As a running example we use residential address information. Addresses are hierarchical structures and are present in many databases. Often they are the best, if not only, relationship between autonomous data sources. Typically the matching has to be approximate since the representations in the sources differ.We propose pq-grams to approximately match hierarchical information from autonomous sources. We define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the well-known tree edit distance. We analyze the properties of the pq-gram distance and compare it with the edit distance and alternative approximations. Experiments with synthetic and real world data confirm the analytic results and the scalability of our approach.

When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. As a running example we use residential address information. Addresses are hierarchical structures and are present in many databases. Often they are the best, if not only, relationship between autonomous data sources. Typically the matching has to be approximate since the representations in the sources differ.We propose pq-grams to approximately match hierarchical information from autonomous sources. We define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the well-known tree edit distance. We analyze the properties of the pq-gram distance and compare it with the edit distance and alternative approximations. Experiments with synthetic and real world data confirm the analytic results and the scalability of our approach.

Citations

Downloads

0 downloads since deposited on 29 May 2012
0 downloads since 12 months

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:03 Faculty of Economics > Department of Informatics
Dewey Decimal Classification:000 Computer science, knowledge & systems
Event End Date:2 September 2005
Deposited On:29 May 2012 08:55
Last Modified:05 Apr 2016 15:26
Publisher:VLDB Endowment
Series Name:VLDB '05
ISBN:1-59593-154-6
Official URL:http://dl.acm.org/citation.cfm?id=1083592.1083630
Related URLs:http://www.vldb2005.org/
Other Identification Number:merlin-id:5868
Permanent URL: http://doi.org/10.5167/uzh-56101

Download

[img]
Content: Published Version
Filetype: PDF - Registered users only
Size: 252kB

TrendTerms

TrendTerms displays relevant terms of the abstract of this publication and related documents on a map. The terms and their relations were extracted from ZORA using word statistics. Their timelines are taken from ZORA as well. The bubble size of a term is proportional to the number of documents where the term occurs. Red, orange, yellow and green colors are used for terms that occur in the current document; red indicates high interlinkedness of a term with other terms, orange, yellow and green decreasing interlinkedness. Blue is used for terms that have a relation with the terms in this document, but occur in other documents.
You can navigate and zoom the map. Mouse-hovering a term displays its timeline, clicking it yields the associated documents.

Author Collaborations