Header

UZH-Logo

Maintenance Infos

From Lexical Bundles to Surprisal and Language Models: measuring the idiom principle on native and learner language


Schneider, Gerold; Grigonyte, Gintare (2018). From Lexical Bundles to Surprisal and Language Models: measuring the idiom principle on native and learner language. In: Kopaczyk, Joanna; Tyrkkö, Jukka. Applications of Pattern-driven Methods in Corpus Linguistics. Amsterdam: Benjamins, 15-56.

Abstract

We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We rst show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information- theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse di erences between genres of native language use, and learner language at di erent levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open- choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension di culty. Our goal to abstract away from word sequences also leads us to language models as models of processing, rst in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely con rmed, we also observe that advanced learners bundle most, and that scienti c language may show lower surprisal than spoken language.

Abstract

We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We rst show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information- theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse di erences between genres of native language use, and learner language at di erent levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open- choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension di culty. Our goal to abstract away from word sequences also leads us to language models as models of processing, rst in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely con rmed, we also observe that advanced learners bundle most, and that scienti c language may show lower surprisal than spoken language.

Statistics

Citations

Dimensions.ai Metrics

Altmetrics

Downloads

1 download since deposited on 07 May 2018
1 download since 12 months
Detailed statistics

Additional indexing

Item Type:Book Section, original work
Communities & Collections:06 Faculty of Arts > English Department
06 Faculty of Arts > Institute of Computational Linguistics
06 Faculty of Arts > Center for Linguistics
Dewey Decimal Classification:820 English & Old English literatures
Language:English
Date:2018
Deposited On:07 May 2018 08:48
Last Modified:07 May 2018 08:48
Publisher:Benjamins
Series Name:Studies in Corpus Linguistics
Number:82
Publisher DOI:https://doi.org/10.1075/scl.82.02sch
Related URLs:https://www.benjamins.com/catalog/scl.82 (Publisher)

Download