Navigation auf zora.uzh.ch

Search ZORA

ZORA (Zurich Open Repository and Archive)

From Lexical Bundles to Surprisal and Language Models: measuring the idiom principle on native and learner language

Schneider, Gerold; Grigonyte, Gintare (2018). From Lexical Bundles to Surprisal and Language Models: measuring the idiom principle on native and learner language. In: Kopaczyk, Joanna; Tyrkkö, Jukka. Applications of Pattern-driven Methods in Corpus Linguistics. Amsterdam: Benjamins, 15-56.

Abstract

We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We first show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information- theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse differences between genres of native language use, and learner language at different levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open-choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension difficulty. Our goal to abstract away from word sequences also leads us to language models as models of processing, first in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely confirmed, we also observe that advanced learners bundle most, and that scientific language may show lower surprisal than spoken language.

Additional indexing

Item Type:Book Section, original work
Communities & Collections:06 Faculty of Arts > English Department
06 Faculty of Arts > Institute of Computational Linguistics
06 Faculty of Arts > Zurich Center for Linguistics
Dewey Decimal Classification:820 English & Old English literatures
Scopus Subject Areas:Social Sciences & Humanities > Language and Linguistics
Social Sciences & Humanities > Linguistics and Language
Social Sciences & Humanities > Education
Social Sciences & Humanities > Management of Technology and Innovation
Language:English
Date:2018
Deposited On:07 May 2018 08:48
Last Modified:23 Nov 2024 04:39
Publisher:Benjamins
Series Name:Studies in Corpus Linguistics
Number:82
OA Status:Closed
Publisher DOI:https://doi.org/10.1075/scl.82.02sch
Related URLs:https://www.benjamins.com/catalog/scl.82 (Publisher)

Metadata Export

Statistics

Citations

Dimensions.ai Metrics

Altmetrics

Downloads

10 downloads since deposited on 07 May 2018
0 downloads since 12 months
Detailed statistics

Authors, Affiliations, Collaborations

Similar Publications