Navigation auf zora.uzh.ch

Search

ZORA (Zurich Open Repository and Archive)

Estimating the selectivity of approximate string queries

Mazeika, Arturas; Böhlen, Michael H; Koudas, Nick; Srivastava, Divesh (2007). Estimating the selectivity of approximate string queries. ACM Transactions on Database Systems, 32(2):12.

Abstract

Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures.We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings.

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:03 Faculty of Economics > Department of Informatics
Dewey Decimal Classification:000 Computer science, knowledge & systems
Scopus Subject Areas:Physical Sciences > Information Systems
Scope:Discipline-based scholarship (basic research)
Language:English
Date:2007
Deposited On:01 Jun 2012 12:57
Last Modified:19 Jul 2024 03:33
Publisher:Association for Computing Machinery
ISSN:0362-5915
OA Status:Closed
Publisher DOI:https://doi.org/10.1145/1242524.1242529
Other Identification Number:merlin-id:2809

Metadata Export

Statistics

Citations

Dimensions.ai Metrics
17 citations in Web of Science®
28 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

1 download since deposited on 01 Jun 2012
0 downloads since 12 months
Detailed statistics

Authors, Affiliations, Collaborations

Similar Publications