Abstract
How can infants detect where words or morphemes start and end in the continuous stream of speech? Previous computational studies have investigated this question mainly for English, where morpheme and word boundaries are often isomorphic. Yet in many languages, words are often multimorphemic, such that word and morpheme boundaries do not align. Our study employed corpora of two languages that differ in the complexity of inflectional morphology, Chintang (Sino-Tibetan) and Japanese (in Experiment 1), as well as corpora of artificial languages ranging in morphological complexity, as measured by the ratio and distribution of morphemes per word (in Experiments 2 and 3). We used two baselines and three conceptually diverse word segmentation algorithms, two of which rely purely on sublexical information using distributional cues, and one that builds a lexicon. The algorithms' performance was evaluated on both word- and morpheme-level representations of the corpora. Segmentation results were better for the morphologically simpler languages than for the morphologically more complex languages, in line with the hypothesis that languages with greater inflectional complexity could be more difficult to segment into words. We further show that the effect of morphological complexity is relatively small, compared to that of algorithm and evaluation level. We therefore recommend that infant researchers look for signatures of the different segmentation algorithms and strategies, before looking for differences in infant segmentation landmarks across languages varying in complexity.