Header

UZH-Logo

Maintenance Infos

Identifying landscape relevant natural language using actively crowdsourced landscape descriptions and sentence-transformers


Baer, Manuel F; Purves, Ross S (2023). Identifying landscape relevant natural language using actively crowdsourced landscape descriptions and sentence-transformers. Künstliche Intelligenz, 37(1):55-67.

Abstract

Natural language has proven to be a valuable source of data for various scientific inquiries including landscape perception and preference research. However, large high quality landscape relevant corpora are scare. We here propose and discuss a natural language processing workflow to identify landscape relevant documents in large collections of unstructured text. Using a small curated high quality collection of actively crowdsourced landscape descriptions we identify and extract similar documents from two different corpora (Geograph and WikiHow) using sentence-transformers and cosine similarity scores. We show that 1) sentence-transformers combined with cosine similarity calculations successfully identify similar documents in both Geograph and WikiHow effectively opening the door to the creation of new landscape specific corpora, 2) the proposed sentence-transformer approach outperforms traditional Term Frequency - Inverse Document Frequency based approaches and 3) the identified documents capture similar topics when compared to the original high quality collection. The presented workflow is transferable to various scientific disciplines in need of domain specific natural language corpora as underlying data.

Abstract

Natural language has proven to be a valuable source of data for various scientific inquiries including landscape perception and preference research. However, large high quality landscape relevant corpora are scare. We here propose and discuss a natural language processing workflow to identify landscape relevant documents in large collections of unstructured text. Using a small curated high quality collection of actively crowdsourced landscape descriptions we identify and extract similar documents from two different corpora (Geograph and WikiHow) using sentence-transformers and cosine similarity scores. We show that 1) sentence-transformers combined with cosine similarity calculations successfully identify similar documents in both Geograph and WikiHow effectively opening the door to the creation of new landscape specific corpora, 2) the proposed sentence-transformer approach outperforms traditional Term Frequency - Inverse Document Frequency based approaches and 3) the identified documents capture similar topics when compared to the original high quality collection. The presented workflow is transferable to various scientific disciplines in need of domain specific natural language corpora as underlying data.

Statistics

Citations

Altmetrics

Downloads

4 downloads since deposited on 03 Nov 2023
4 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:07 Faculty of Science > Institute of Geography
06 Faculty of Arts > Zurich Center for Linguistics
Dewey Decimal Classification:910 Geography & travel
Scopus Subject Areas:Physical Sciences > Artificial Intelligence
Uncontrolled Keywords:Artificial Intelligence
Language:English
Date:1 March 2023
Deposited On:03 Nov 2023 15:48
Last Modified:29 Apr 2024 01:40
Publisher:Springer
ISSN:0933-1875
OA Status:Hybrid
Free access at:Publisher DOI. An embargo period may apply.
Publisher DOI:https://doi.org/10.1007/s13218-022-00793-3
Project Information:
  • : FunderURPP - Language and Space
  • : Grant ID
  • : Project Title
  • : FunderUniversity of Zurich
  • : Grant ID
  • : Project Title
  • Content: Published Version
  • Language: English
  • Licence: Creative Commons: Attribution 4.0 International (CC BY 4.0)