Abstract
Tokenization is the process of splitting running texts into minimal meaningful units. In writing systems where a space character is used for word separation, this blank character typically acts as token boundary. A simple tokenizer that only splits texts at space characters already achieves a notable accuracy, although it misses unmarked token boundaries and erroneously splits tokens that contain space characters.
Different languages use the same characters for different purposes. Tokenization is thus a language-specific task (with code-switching being a particular challenge). Extralinguistic tokens, however, are similar in many languages. These tokens include numbers, XML elements, email addresses and identifiers of concepts that are idiosyncratic to particular text variants (e.g., patent numbers).
We present a framework for tokenization that makes use of language-specific and language-independent token identification rules. These rules are stacked and applied recursively, yielding a complete trace of the tokenization process in form of a tree structure. Rules are easily adaptable to different languages and text types. Unit tests reliably detect if new token identification rules conflict with existing ones and thus assure consistent tokenization when extending the rule sets.