Header

UZH-Logo

Maintenance Infos

Part-Of-Speech in Historical Corpora: Tagger Evaluation and Ensemble Systems on ARCHER


Schneider, Gerold; Hundt, Marianne; Oppliger, Rahel (2016). Part-Of-Speech in Historical Corpora: Tagger Evaluation and Ensemble Systems on ARCHER. In: KONVENS 2016, Bochum, 19 September 2016 - 21 September 2016.

Abstract

Tagger accuracy deteriorates when applied to texts different from the training corpus, e.g. with respect to register or time period. On historical data, accuracy can drop to and below 90%. We are tagging and parsing ARCHER, a historical corpus sampled from British and American texts from 1600-1999. We improve tagging accuracy by (1) using a version of the corpus that has been automatically mapped to PDE spelling with VARD, (2) by combining several part-of-speech taggers in an ensemble system – which improves tagging by about 1% over CLAWS and 2% over Tree-Tagger, and (3) by using a small amount of human intervention – which allows us to reach 98% accuracy from 1700 on.

Abstract

Tagger accuracy deteriorates when applied to texts different from the training corpus, e.g. with respect to register or time period. On historical data, accuracy can drop to and below 90%. We are tagging and parsing ARCHER, a historical corpus sampled from British and American texts from 1600-1999. We improve tagging accuracy by (1) using a version of the corpus that has been automatically mapped to PDE spelling with VARD, (2) by combining several part-of-speech taggers in an ensemble system – which improves tagging by about 1% over CLAWS and 2% over Tree-Tagger, and (3) by using a small amount of human intervention – which allows us to reach 98% accuracy from 1700 on.

Statistics

Downloads

14 downloads since deposited on 16 Feb 2017
14 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), not refereed, original work
Communities & Collections:06 Faculty of Arts > English Department
06 Faculty of Arts > Institute of Computational Linguistics
06 Faculty of Arts > Center for Linguistics
Dewey Decimal Classification:820 English & Old English literatures
Language:English
Event End Date:21 September 2016
Deposited On:16 Feb 2017 08:15
Last Modified:16 Feb 2017 08:15
Publisher:RUB

Download

Preview Icon on Download
Preview
Filetype: PDF
Size: 292kB