Abstract
How well can we predict reading times and thus cognitive processing load? This study first assesses correlations between reading times and then uses linear regression to predict reading times from two corpora. We suggest noise reduction methods using reader means and medians to obtain generalisations across individuals. This leads to much higher correlations, prediction accuracy and model fit. Our best models reach a prediction accuracy that is, on average, 37 % off, or that explains up to 54 % of the variation in our data, according to R^2. As the offness is smaller than the standard deviation, we accurately predict a potential reader. We use surprisal, part-of-speech (POS) tags, syntax, and many other features such as word length as a language model. Discourse-related features, for which we use distributional semantic similarity and the distance to previous occurrences, are shown to play a significant role Morphosyntactic (POS tags) and syntactic features (dependency labels) are also significant, though with a smaller weight. We also observe that fast readers correlate better to surprisal and our models.