Header

UZH-Logo

Maintenance Infos

Machine learning for cross-gazetteer matching of natural features


Acheson, Elise; Volpi, Michele; Purves, Ross S (2019). Machine learning for cross-gazetteer matching of natural features. International Journal of Geographical Information Science:1-27.

Abstract

Defining and identifying duplicate records in a dataset is a challenging task which grows more complex when the modeled entities themselves are hard to delineate. In the geospatial domain, it may not be clear where a mountain, stream, or valley ends and begins, a problem carried over when such entities are catalogued in gazetteers. In this paper, we take two gazetteers, GeoNames and SwissNames3D, and perform matching – identifying records in each that are about the same entity – across a sample of natural feature records. We first perform rule-based matching, establishing competitive results, then apply machine learning using Random Forests, a method well-suited to the matching task. We report on the performance of a wider array of matching features than has been previously studied, including domain-specific ones such as feature type, land cover class, and elevation. Our results show an increase in performance using machine learning over rules, with a notable performance gain from considering feature types, but negligible gains from other specialized matching features. We argue that future work in this area should strive to be more reproducible and report results on a realistic testing pipeline including candidate selection, feature extraction, and classification.

Abstract

Defining and identifying duplicate records in a dataset is a challenging task which grows more complex when the modeled entities themselves are hard to delineate. In the geospatial domain, it may not be clear where a mountain, stream, or valley ends and begins, a problem carried over when such entities are catalogued in gazetteers. In this paper, we take two gazetteers, GeoNames and SwissNames3D, and perform matching – identifying records in each that are about the same entity – across a sample of natural feature records. We first perform rule-based matching, establishing competitive results, then apply machine learning using Random Forests, a method well-suited to the matching task. We report on the performance of a wider array of matching features than has been previously studied, including domain-specific ones such as feature type, land cover class, and elevation. Our results show an increase in performance using machine learning over rules, with a notable performance gain from considering feature types, but negligible gains from other specialized matching features. We argue that future work in this area should strive to be more reproducible and report results on a realistic testing pipeline including candidate selection, feature extraction, and classification.

Statistics

Citations

Dimensions.ai Metrics
10 citations in Web of Science®
16 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

102 downloads since deposited on 05 Sep 2019
17 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:07 Faculty of Science > Institute of Geography
Dewey Decimal Classification:910 Geography & travel
Scopus Subject Areas:Physical Sciences > Information Systems
Social Sciences & Humanities > Geography, Planning and Development
Social Sciences & Humanities > Library and Information Sciences
Uncontrolled Keywords:Geography, Planning and Development, Library and Information Sciences, Information Systems
Language:English
Date:22 April 2019
Deposited On:05 Sep 2019 15:07
Last Modified:22 Sep 2023 01:45
Publisher:Taylor & Francis
ISSN:1365-8816
OA Status:Hybrid
Publisher DOI:https://doi.org/10.1080/13658816.2019.1599123
Project Information:
  • : FunderSNSF
  • : Grant ID200021E-166788
  • : Project TitleExtraction and visually driven analysis of geography and dynamics of people's reaction to events
  • Content: Published Version
  • Language: English
  • Licence: Creative Commons: Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)