Publication:

Cutter – a Universal Multilingual Tokenizer

Date

Date

Date
2018
Conference or Workshop Item
Published version
cris.lastimport.scopus2025-05-24T03:34:57Z
cris.virtual.orcidhttps://orcid.org/0000-0002-2063-4516
cris.virtual.orcidhttps://orcid.org/0000-0002-0459-5086
cris.virtualsource.orcid8fbbe5f4-ab2a-4bbe-a533-e6a4112e86d8
cris.virtualsource.orcid21fff778-6a26-4132-abdc-3ceff710ddb2
dc.contributor.institutionUniversity of Zurich
dc.date.accessioned2018-10-25T08:36:56Z
dc.date.available2018-10-25T08:36:56Z
dc.date.issued2018-06-13
dc.description.abstract

Tokenization is the process of splitting running texts into minimal meaningful units. In writing systems where a space character is used for word separation, this blank character typically acts as token boundary. A simple tokenizer that only splits texts at space characters already achieves a notable accuracy, although it misses unmarked token boundaries and erroneously splits tokens that contain space characters. Different languages use the same characters for different purposes. Tokenization is thus a language-specific task (with code-switching being a particular challenge). Extralinguistic tokens, however, are similar in many languages. These tokens include numbers, XML elements, email addresses and identifiers of concepts that are idiosyncratic to particular text variants (e.g., patent numbers). We present a framework for tokenization that makes use of language-specific and language-independent token identification rules. These rules are stacked and applied recursively, yielding a complete trace of the tokenization process in form of a tree structure. Rules are easily adaptable to different languages and text types. Unit tests reliably detect if new token identification rules conflict with existing ones and thus assure consistent tokenization when extending the rule sets.

dc.identifier.issn1613-0073
dc.identifier.otherurn:nbn:de:0074-2226-7
dc.identifier.scopus2-s2.0-85055475494
dc.identifier.urihttps://www.zora.uzh.ch/handle/20.500.14742/147003
dc.language.isoeng
dc.subject.ddc000 Computer science, knowledge & systems
dc.subject.ddc410 Linguistics
dc.title

Cutter – a Universal Multilingual Tokenizer

dc.typeconference_item
dcterms.accessRightsinfo:eu-repo/semantics/openAccess
dcterms.bibliographicCitation.journaltitleCEUR Workshop Proceedings
dcterms.bibliographicCitation.number2226
dcterms.bibliographicCitation.originalpublishernameCEUR-WS
dcterms.bibliographicCitation.pageend81
dcterms.bibliographicCitation.pagestart75
dcterms.bibliographicCitation.urlhttp://ceur-ws.org/Vol-2226/
dspace.entity.typePublicationen
oairecerif.event.endDate2018-06-13
oairecerif.event.placeWinterthur
oairecerif.event.startDate2018-06-12
uzh.contributor.affiliationUniversity of Zurich
uzh.contributor.affiliationUniversity of Zurich
uzh.contributor.affiliationUniversity of Zurich
uzh.contributor.authorGraën, Johannes
uzh.contributor.authorBertamini, Mara
uzh.contributor.authorVolk, Martin
uzh.contributor.correspondenceYes
uzh.contributor.correspondenceNo
uzh.contributor.correspondenceNo
uzh.contributor.editorCieliebak, Mark
uzh.contributor.editorTuggener, Don
uzh.contributor.editorBenites, Fernando
uzh.contributor.editorcorrespondenceYes
uzh.contributor.editorcorrespondenceNo
uzh.contributor.editorcorrespondenceNo
uzh.contributor.editoremail#PLACEHOLDER_PARENT_METADATA_VALUE#
uzh.contributor.editoremail#PLACEHOLDER_PARENT_METADATA_VALUE#
uzh.contributor.editoremail#PLACEHOLDER_PARENT_METADATA_VALUE#
uzh.document.availabilitypublished_version
uzh.eprint.datestamp2018-10-25 08:36:56
uzh.eprint.lastmod2025-05-24 03:34:57
uzh.eprint.statusChange2018-10-25 08:36:56
uzh.event.presentationTypeother
uzh.event.titleSwiss Text Analytics Conference
uzh.event.typeconference
uzh.harvester.ethYes
uzh.harvester.nbNo
uzh.identifier.doi10.5167/uzh-157243
uzh.jdb.eprintsId35599
uzh.oastatus.zoraGreen
uzh.publication.citationGraën, Johannes; Bertamini, Mara; Volk, Martin (2018). Cutter – a Universal Multilingual Tokenizer. In: Swiss Text Analytics Conference, Winterthur, 12 June 2018 - 13 June 2018. CEUR-WS, 75-81.
uzh.publication.freeAccessAtdoi
uzh.publication.originalworkoriginal
uzh.publication.publishedStatusfinal
uzh.publication.seriesTitleCEUR Workshop Proceedings
uzh.scopus.impact6
uzh.scopus.subjectsGeneral Computer Science
uzh.workflow.doajuzh.workflow.doaj.false
uzh.workflow.eprintid157243
uzh.workflow.fulltextStatuspublic
uzh.workflow.revisions28
uzh.workflow.rightsCheckoffen
uzh.workflow.statusarchive
Files

Original bundle

Name:
paper9.pdf
Size:
352.79 KB
Format:
Adobe Portable Document Format
Publication available in collections: