UZH-Logo

Maintenance Infos

Unsupervised Text Segmentation for Automated Error Reduction


Furrer, Lenz (2014). Unsupervised Text Segmentation for Automated Error Reduction. In: KONVENS 2014, Hildesheim, 8 October 2014 - 10 October 2014, 178-185.

Abstract

Challenging the assumption that traditional whitespace/punctuation-based tokenisation is the best solution for any NLP application, I propose an alternative approach to segmenting text into processable units. The proposed approach is nearly knowledge-free, in that it does not rely on language-dependent, man-made resources. The text segmentation approach is applied to the task of automated error reduction in texts with high noise. The results are compared to conventional tokenisation.

Challenging the assumption that traditional whitespace/punctuation-based tokenisation is the best solution for any NLP application, I propose an alternative approach to segmenting text into processable units. The proposed approach is nearly knowledge-free, in that it does not rely on language-dependent, man-made resources. The text segmentation approach is applied to the task of automated error reduction in texts with high noise. The results are compared to conventional tokenisation.

Altmetrics

Downloads

57 downloads since deposited on 03 Dec 2014
40 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Uncontrolled Keywords:Unsupervised Segmentation OCR Error Correction
Language:English
Event End Date:10 October 2014
Deposited On:03 Dec 2014 17:18
Last Modified:20 May 2016 21:17
Publisher:Universität Hildesheim
ISBN:978-3-934105-46-1
Free access at:Official URL. An embargo period may apply.
Official URL:http://nbn-resolving.de/urn:nbn:de:gbv:hil2-opus-2893
Permanent URL: https://doi.org/10.5167/uzh-101471

Download

[img]
Preview
Content: Accepted Version
Language: English
Filetype: PDF
Size: 961kB
Licence: Creative Commons: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)

TrendTerms

TrendTerms displays relevant terms of the abstract of this publication and related documents on a map. The terms and their relations were extracted from ZORA using word statistics. Their timelines are taken from ZORA as well. The bubble size of a term is proportional to the number of documents where the term occurs. Red, orange, yellow and green colors are used for terms that occur in the current document; red indicates high interlinkedness of a term with other terms, orange, yellow and green decreasing interlinkedness. Blue is used for terms that have a relation with the terms in this document, but occur in other documents.
You can navigate and zoom the map. Mouse-hovering a term displays its timeline, clicking it yields the associated documents.

Author Collaborations