Navigation auf zora.uzh.ch

Search ZORA

ZORA (Zurich Open Repository and Archive)

Curating global datasets of structural linguistic features for independence

Graff, Anna; Chousou-Polydouri, Natalia; Inman, David; Skirgård, Hedvig; Lischka, Marc; Zakharko, Taras; Barbieri, Chiara; Bickel, Balthasar (2025). Curating global datasets of structural linguistic features for independence. Scientific Data, 12(1):106.

Abstract

The increasing availability of cross-linguistic databases dedicated to documenting morphosyntactic, lexical and phonological features has proliferated the use of such data for studies on language evolution and human history. However, most of these databases were not designed to ensure independence of features, such that it is not valid to jointly use all their features in large-scale statistical analyses assuming independence of inputs. Here, we curate published data from five large linguistic databases to generate two global-scale cross-linguistic datasets: GBI (from the Grambank dataset), and TLI (using inputs from the World Atlas of Language Structures, AUTOTYP, PHOIBLE and Lexibank). The datasets minimize logical dependencies of features and forms of strong statistical dependencies that go beyond phylogenetic and geographical signal. They are also made available in densified form, reducing the proportion of missing data. We document our curation principles and workflows to ensure reusability of this framework with other inputs or thresholds of independence. Our curation steps on both datasets reveal robust and comparable global patterns of structural linguistic diversity.

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:06 Faculty of Arts > Department of Comparative Language Science
07 Faculty of Science > Institute of Evolutionary Biology and Environmental Studies
06 Faculty of Arts > Zurich Center for Linguistics
Special Collections > NCCR Evolving Language
Special Collections > Centers of Competence > Center for the Interdisciplinary Study of Language Evolution
08 Research Priority Programs > Evolution in Action: From Genomes to Ecosystems
Dewey Decimal Classification:490 Other languages
410 Linguistics
890 Other literatures
590 Animals (Zoology)
570 Life sciences; biology
Language:English
Date:18 January 2025
Deposited On:22 Jan 2025 14:49
Last Modified:28 Feb 2025 02:43
Publisher:Nature Publishing Group
ISSN:2052-4463
OA Status:Gold
Free access at:Publisher DOI. An embargo period may apply.
Publisher DOI:https://doi.org/10.1038/s41597-024-04319-4
PubMed ID:39827249
Project Information:
  • Funder: Max-Planck-Institut für Evolutionäre Anthropologie
  • Grant ID:
  • Project Title:
  • Funder: Department of Linguistic and Cultural Evolution, Max Planck Institute for Evolutionary Anthropology
  • Grant ID:
  • Project Title:
  • Funder: CB was supported by the University Research Priority Program "Evolution in Action of the University of Zurich
  • Grant ID:
  • Project Title:
Download PDF  'Curating global datasets of structural linguistic features for independence'.
Preview
  • Content: Published Version
  • Language: English
  • Licence: Creative Commons: Attribution 4.0 International (CC BY 4.0)

Metadata Export

Statistics

Citations

Altmetrics

Downloads

7 downloads since deposited on 22 Jan 2025
7 downloads since 12 months
Detailed statistics

Authors, Affiliations, Collaborations

Similar Publications