Header

UZH-Logo

Maintenance Infos

Cutter – a Universal Multilingual Tokenizer


Graën, Johannes; Bertamini, Mara; Volk, Martin (2018). Cutter – a Universal Multilingual Tokenizer. In: Swiss Text Analytics Conference, Winterthur, 12 June 2018 - 13 June 2018. CEUR-WS, 75-81.

Abstract

Tokenization is the process of splitting running texts into minimal meaningful units. In writing systems where a space character is used for word separation, this blank character typically acts as token boundary. A simple tokenizer that only splits texts at space characters already achieves a notable accuracy, although it misses unmarked token boundaries and erroneously splits tokens that contain space characters.
Different languages use the same characters for different purposes. Tokenization is thus a language-specific task (with code-switching being a particular challenge). Extralinguistic tokens, however, are similar in many languages. These tokens include numbers, XML elements, email addresses and identifiers of concepts that are idiosyncratic to particular text variants (e.g., patent numbers).
We present a framework for tokenization that makes use of language-specific and language-independent token identification rules. These rules are stacked and applied recursively, yielding a complete trace of the tokenization process in form of a tree structure. Rules are easily adaptable to different languages and text types. Unit tests reliably detect if new token identification rules conflict with existing ones and thus assure consistent tokenization when extending the rule sets.

Abstract

Tokenization is the process of splitting running texts into minimal meaningful units. In writing systems where a space character is used for word separation, this blank character typically acts as token boundary. A simple tokenizer that only splits texts at space characters already achieves a notable accuracy, although it misses unmarked token boundaries and erroneously splits tokens that contain space characters.
Different languages use the same characters for different purposes. Tokenization is thus a language-specific task (with code-switching being a particular challenge). Extralinguistic tokens, however, are similar in many languages. These tokens include numbers, XML elements, email addresses and identifiers of concepts that are idiosyncratic to particular text variants (e.g., patent numbers).
We present a framework for tokenization that makes use of language-specific and language-independent token identification rules. These rules are stacked and applied recursively, yielding a complete trace of the tokenization process in form of a tree structure. Rules are easily adaptable to different languages and text types. Unit tests reliably detect if new token identification rules conflict with existing ones and thus assure consistent tokenization when extending the rule sets.

Statistics

Citations

Downloads

142 downloads since deposited on 25 Oct 2018
37 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Other), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Scopus Subject Areas:Physical Sciences > General Computer Science
Language:English
Event End Date:13 June 2018
Deposited On:25 Oct 2018 08:36
Last Modified:13 Feb 2022 08:05
Publisher:CEUR-WS
Series Name:CEUR Workshop Proceedings
Number:2226
ISSN:1613-0073
OA Status:Green
Free access at:Publisher DOI. An embargo period may apply.
Official URL:http://ceur-ws.org/Vol-2226/
Other Identification Number:urn:nbn:de:0074-2226-7
  • Content: Published Version
  • Language: English
  • Licence: Creative Commons: Public Domain Dedication: CC0 1.0 Universal (CC0 1.0)