Header

UZH-Logo

Maintenance Infos

Federated SPARQL Query Processing Reconciling Diversity, Flexibility and Performance on the Web of Data


Basca, C. Federated SPARQL Query Processing Reconciling Diversity, Flexibility and Performance on the Web of Data. 2015, University of Zurich, Faculty of Economics.

Abstract

Querying the ever-growing Web of Data poses a significant challenge in today’s Semantic Web. The complete lack of any centralised control leads to potentially arbitrary data distribution, high variability of latency between hosts participating in query answering, and, in the extreme, even the (sudden) unavailability of some hosts during query execution. In this thesis we address the question of how to efficiently query the Web of Data while taking into account its scale, diversity and unreliable and uncontrollable nature. We begin by first introducing Avalanche, a federated SPARQL engine which: 1) makes no assumptions about RDF data distribution to SPARQL endpoints, 2) is adaptive to changing network conditions, i.e, can adapt to slow network connections or endpoint unavailability, 3) retrieves up-to-date results from SPARQL endpoints, and 4) is flexible by making limiting assumptions about the structure of participating triple stores.

Tailored to address the semantic heterogeneity derived from the Web of Data’s rich and broad semantic diversity, coupled with its characteristic lack of guarantees, Avalanche employs a fragmented query planning approach, under a concurrent and parallel execution model. By fragmented execution, we refer to the fact that the original SPARQL query is rewritten as the union of all fragments which comprise it. A query fragment is defined as the conjunction of all query triple patterns, where a triple pattern can be resolved by only one endpoint.

As the Web of Data continues to grow, we postulate that so is the likelihood that large numbers of endpoints will index data, sharing the same vocabularies, thus forming semantically homogenous partitions of the Semantic Web. Focusing on this scenario and in order to address some of Avalanche’s limitations, we introduce x-Avalanche an extension of our original system. Here, we add support for disjunctions by using a distributed union operator capable of scaling to hundreds or thousands of endpoints. Furthermore, we enhance the distributed state management with: a) remote caches aimed to reduce the high latency typical of SPARQL endpoints, b) multicast parallel bind-joins exploiting the SPARQL 1.1 VALUES clause, and c) proxy based execution of x-Avalanche operators.

Finally, in x-Avalanche, we introduce a novel and parallel-friendly optimisation paradigm designed not only to offer an optimal tradeoff between total query execution time and fast first results, but also to consider an extended planning space unexplored so far, thus taking the fragmented execution model first introduced in Avalanche to its logical conclusion. Combined, x-Avalanche’s enhancements and optimisations can lead to dramatic performance improvements over top performing state of the art federated SPARQL engines. To conclude, our results show that on average x-Avalanche can be more than one order of magnitude faster when executing SPARQL queries.

Abstract

Querying the ever-growing Web of Data poses a significant challenge in today’s Semantic Web. The complete lack of any centralised control leads to potentially arbitrary data distribution, high variability of latency between hosts participating in query answering, and, in the extreme, even the (sudden) unavailability of some hosts during query execution. In this thesis we address the question of how to efficiently query the Web of Data while taking into account its scale, diversity and unreliable and uncontrollable nature. We begin by first introducing Avalanche, a federated SPARQL engine which: 1) makes no assumptions about RDF data distribution to SPARQL endpoints, 2) is adaptive to changing network conditions, i.e, can adapt to slow network connections or endpoint unavailability, 3) retrieves up-to-date results from SPARQL endpoints, and 4) is flexible by making limiting assumptions about the structure of participating triple stores.

Tailored to address the semantic heterogeneity derived from the Web of Data’s rich and broad semantic diversity, coupled with its characteristic lack of guarantees, Avalanche employs a fragmented query planning approach, under a concurrent and parallel execution model. By fragmented execution, we refer to the fact that the original SPARQL query is rewritten as the union of all fragments which comprise it. A query fragment is defined as the conjunction of all query triple patterns, where a triple pattern can be resolved by only one endpoint.

As the Web of Data continues to grow, we postulate that so is the likelihood that large numbers of endpoints will index data, sharing the same vocabularies, thus forming semantically homogenous partitions of the Semantic Web. Focusing on this scenario and in order to address some of Avalanche’s limitations, we introduce x-Avalanche an extension of our original system. Here, we add support for disjunctions by using a distributed union operator capable of scaling to hundreds or thousands of endpoints. Furthermore, we enhance the distributed state management with: a) remote caches aimed to reduce the high latency typical of SPARQL endpoints, b) multicast parallel bind-joins exploiting the SPARQL 1.1 VALUES clause, and c) proxy based execution of x-Avalanche operators.

Finally, in x-Avalanche, we introduce a novel and parallel-friendly optimisation paradigm designed not only to offer an optimal tradeoff between total query execution time and fast first results, but also to consider an extended planning space unexplored so far, thus taking the fragmented execution model first introduced in Avalanche to its logical conclusion. Combined, x-Avalanche’s enhancements and optimisations can lead to dramatic performance improvements over top performing state of the art federated SPARQL engines. To conclude, our results show that on average x-Avalanche can be more than one order of magnitude faster when executing SPARQL queries.

Statistics

Downloads

122 downloads since deposited on 15 Jan 2016
75 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Dissertation
Referees:Bernstein Abraham
Communities & Collections:03 Faculty of Economics > Department of Informatics
Dewey Decimal Classification:000 Computer science, knowledge & systems
Language:English
Date:2015
Deposited On:15 Jan 2016 07:10
Last Modified:23 Jun 2016 15:07
Number of Pages:148
Other Identification Number:merlin-id:12955

Download

Preview Icon on Download
Preview
Content: Accepted Version
Filetype: PDF
Size: 3MB

TrendTerms

TrendTerms displays relevant terms of the abstract of this publication and related documents on a map. The terms and their relations were extracted from ZORA using word statistics. Their timelines are taken from ZORA as well. The bubble size of a term is proportional to the number of documents where the term occurs. Red, orange, yellow and green colors are used for terms that occur in the current document; red indicates high interlinkedness of a term with other terms, orange, yellow and green decreasing interlinkedness. Blue is used for terms that have a relation with the terms in this document, but occur in other documents.
You can navigate and zoom the map. Mouse-hovering a term displays its timeline, clicking it yields the associated documents.

Author Collaborations