Publication:

Lexedata: A toolbox to edit CLDF lexical datasets

Date

Date

Date
2022
Journal Article
Published version
cris.virtual.orcid0000-0002-8155-9089
cris.virtual.orcid0000-0002-5693-975X
cris.virtualsource.orcid590d98c9-1b4c-4b84-b89e-eefe9cfbf98d
cris.virtualsource.orcid18ce48f2-74b5-4864-86c4-b301d8896017
dc.contributor.institutionUniversity of Zurich
dc.date.accessioned2022-06-16T12:47:15Z
dc.date.available2022-06-16T12:47:15Z
dc.date.issued2022-04-20
dc.description.abstract

Lexedata is a collection of tools to support the editing process of comparative lexical data. Wordlists are a comparatively easily collected type of language documentation that is nonetheless quite data-rich and useful for the systematic comparison of languages (List et al., 2021). They are an important resource in comparative and historical linguistics, including their use as raw data for language phylogenetics (Gray et al., 2009; Grollemund et al., 2015).

The lexedata package uses the “Cross-Linguistic Data Format” (CLDF, Forkel et al. (2021), Forkel et al. (2018)) as the main data format for a relational database containing forms, languages, concepts, and etymological relationships. The CLDF specification builds on top of the CSV for the Web (CSVW, Pollock et al. (2015)) specs by the W3C, and as such consists of one or more comma-separated value (CSV) files that get their semantics from a metadata file in JSON format.

Implemented in Python as a set of command line tools, Lexedata provides various helper functions to address issues that frequently arise when working with comparative wordlists for multiple languages, as shown in Figure 1. These include importing from and exporting to formats more familiar to linguists, as well as bulk edit functions and associated integrity checks. For example, there are scripts for importing data from MS Excel sheets of various common formats into CLDF, checking for homophones, manipulating etymological judgements, and exporting coded datasets for use in phylogenetic software.

dc.identifier.doi10.21105/joss.04140
dc.identifier.issn2475-9066
dc.identifier.urihttps://www.zora.uzh.ch/handle/20.500.14742/196193
dc.language.isoeng
dc.subject.ddc400 Language
dc.subject.ddc490 Other languages
dc.subject.ddc890 Other literatures
dc.subject.ddc410 Linguistics
dc.subject.ddc910 Geography & travel
dc.title

Lexedata: A toolbox to edit CLDF lexical datasets

dc.typearticle
dcterms.accessRightsinfo:eu-repo/semantics/openAccess
dcterms.bibliographicCitation.journaltitleJournal of Open Source Software
dcterms.bibliographicCitation.number72
dcterms.bibliographicCitation.originalpublishernameOpen Journals
dcterms.bibliographicCitation.pagestart4140
dcterms.bibliographicCitation.volume7
dspace.entity.typePublicationen
uzh.contributor.authorKaiping, Gereon A
uzh.contributor.authorSteiger, Melvin S
uzh.contributor.authorChousou-Polydouri, Natalia
uzh.contributor.correspondenceYes
uzh.contributor.correspondenceNo
uzh.contributor.correspondenceNo
uzh.document.availabilitypublished_version
uzh.eprint.datestamp2022-06-16 12:47:15
uzh.eprint.lastmod2022-06-16 12:47:18
uzh.eprint.statusChange2022-06-16 12:47:15
uzh.harvester.ethYes
uzh.harvester.nbNo
uzh.identifier.doi10.5167/uzh-219030
uzh.jdb.eprintsId42654
uzh.oastatus.unpaywallgold
uzh.oastatus.zoraGold
uzh.publication.citationKaiping, G. A., Steiger, M. S., & Chousou-Polydouri, N. (2022). Lexedata: A toolbox to edit CLDF lexical datasets. Journal of Open Source Software, 7, 4140. https://doi.org/10.21105/joss.04140
uzh.publication.freeAccessAtdoi
uzh.publication.originalworkoriginal
uzh.publication.publishedStatusfinal
uzh.workflow.doajuzh.workflow.doaj.true
uzh.workflow.eprintid219030
uzh.workflow.fulltextStatuspublic
uzh.workflow.revisions6
uzh.workflow.rightsCheckkeininfo
uzh.workflow.sourceCrossref:10.21105/joss.04140
uzh.workflow.statusarchive
Files

Original bundle

Name:
ZORA_joss_04140.pdf
Size:
312.82 KB
Format:
Adobe Portable Document Format
Publication available in collections: