Header

UZH-Logo

Maintenance Infos

Approximate Joins for Data-Centric XML


Augsten, Nikolaus; Böhlen, Michael Hanspeter; Dyreson, Curtis; Gamper, Johann (2008). Approximate Joins for Data-Centric XML. In: ICDE 2008: 24th International Conference on 7-12 April, Auckland, New Zealand, 23 August 2008 - 28 August 2008, 814-823.

Abstract

In data integration applications, a join matches elements thatare common to two data sources. Often, however, elements are represented slightly different in each source, so an approximate join must be used. For XML data, most approxi- mate join strategies are based on some ordered tree matching technique. But in data-centric XML the order is irrelevant: two elements should match even if their subelement order varies. In this paper we give a solution for the approximate join of unordered trees. Our solution is based on windowed pq-grams. We develop an efficient technique to systematically generate win- dowed pq-grams in a three-step process: sorting the unordered tree, extending the sorted tree with dummy nodes, and computing the windowed pq-grams on the extended tree. The windowed pq-gram distance between two sorted trees approximates the tree edit distance between the respective unordered trees. The approximate join algorithm based on windowed pq-grams is implemented as an equality join on strings which avoids the costly computation of the distance between every pair of input trees. Our experiments with synthetic and real world data confirm the analytic results and suggest that our technique is both useful and scalable.

Abstract

In data integration applications, a join matches elements thatare common to two data sources. Often, however, elements are represented slightly different in each source, so an approximate join must be used. For XML data, most approxi- mate join strategies are based on some ordered tree matching technique. But in data-centric XML the order is irrelevant: two elements should match even if their subelement order varies. In this paper we give a solution for the approximate join of unordered trees. Our solution is based on windowed pq-grams. We develop an efficient technique to systematically generate win- dowed pq-grams in a three-step process: sorting the unordered tree, extending the sorted tree with dummy nodes, and computing the windowed pq-grams on the extended tree. The windowed pq-gram distance between two sorted trees approximates the tree edit distance between the respective unordered trees. The approximate join algorithm based on windowed pq-grams is implemented as an equality join on strings which avoids the costly computation of the distance between every pair of input trees. Our experiments with synthetic and real world data confirm the analytic results and suggest that our technique is both useful and scalable.

Statistics

Citations

17 citations in Web of Science®
31 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

1 download since deposited on 14 Mar 2013
0 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:03 Faculty of Economics > Department of Informatics
Dewey Decimal Classification:000 Computer science, knowledge & systems
Language:English
Event End Date:28 August 2008
Deposited On:14 Mar 2013 11:36
Last Modified:08 Aug 2017 10:27
Publisher DOI:https://doi.org/10.1109/ICDE.2008.4497490
Other Identification Number:merlin-id:2297

Download

Preview Icon on Download
Filetype: PDF - Registered users only
Size: 1MB
View at publisher