Permanent URL to this publication: http://dx.doi.org/10.5167/uzh-56109
Mazeika, Arturas; Böhlen, Michael H (2006). Cleansing databases of misspelled proper nouns. In: CleanDB 2006, Seoul, Korea, 11 September 2006 - 11 September 2006.
The paper presents a data cleansing technique for string databases. We propose and evaluate an algorithm that identifies a group of strings that consists of (multiple) occurrences of a correctly spelled string plus nearby misspelled strings. All strings in a group are replaced by the most frequent string of this group. Our method targets proper noun databases, including names and addresses, which are not handled by dictionaries. At the technical level we give an efficient solution for computing the center of a group of strings and determine the border of the group. We use inverse strings together with sampling to efficiently identify and cleanse a database. The experimental evaluation shows that for proper nouns the center calculation and border detection algorithms are robust and even very small sample sizes yield good results.
18 downloads since deposited on 29 May 2012
4 downloads since 12 months
|Item Type:||Conference or Workshop Item (Paper), refereed, original work|
|Communities & Collections:||03 Faculty of Economics > Department of Informatics|
|Dewey Decimal Classification:||000 Computer science, knowledge & systems|
|Event End Date:||11 September 2006|
|Deposited On:||29 May 2012 09:19|
|Last Modified:||17 Oct 2012 15:38|
|Other Identification Number:||merlin-id:6164|
Users (please log in): suggest update or correction for this item
Repository Staff Only: item control page