Abstract
We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We first show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information- theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse differences between genres of native language use, and learner language at different levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open-choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension difficulty. Our goal to abstract away from word sequences also leads us to language models as models of processing, first in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely confirmed, we also observe that advanced learners bundle most, and that scientific language may show lower surprisal than spoken language.