Abstract
Lexedata is a collection of tools to support the editing process of comparative lexical data. Wordlists are a comparatively easily collected type of language documentation that is nonetheless quite data-rich and useful for the systematic comparison of languages (List et al., 2021). They are an important resource in comparative and historical linguistics, including their use as raw data for language phylogenetics (Gray et al., 2009; Grollemund et al., 2015).
The lexedata package uses the “Cross-Linguistic Data Format” (CLDF, Forkel et al. (2021), Forkel et al. (2018)) as the main data format for a relational database containing forms, languages, concepts, and etymological relationships. The CLDF specification builds on top of the CSV for the Web (CSVW, Pollock et al. (2015)) specs by the W3C, and as such consists of one or more comma-separated value (CSV) files that get their semantics from a metadata file in JSON format.
Implemented in Python as a set of command line tools, Lexedata provides various helper functions to address issues that frequently arise when working with comparative wordlists for multiple languages, as shown in Figure 1. These include importing from and exporting to formats more familiar to linguists, as well as bulk edit functions and associated integrity checks. For example, there are scripts for importing data from MS Excel sheets of various common formats into CLDF, checking for homophones, manipulating etymological judgements, and exporting coded datasets for use in phylogenetic software.