Header

UZH-Logo

Maintenance Infos

Multilingual Workflows in Bullinger Digital: Data Curation for Latin and Early New High German


Ströbel, Phillip Benjamin; Fischer, Lukas; Müller, Raphael; Scheurer, Patricia; Schroffenegger, Bernard; Suter, Benjamin; Volk, Martin (2024). Multilingual Workflows in Bullinger Digital: Data Curation for Latin and Early New High German. Journal of Open Humanities Data, 10(12):12.

Abstract

This paper presents how we enhanced the accessibility and utility of historical linguistic data in the project Bullinger Digital. The project involved the transformation of 3,100 letters, primarily available as scanned PDFs, into a dynamic, fully digital format. The expanded digital collection now includes 12,000 letters, 3,100 edited, 5,400 transcribed, and 3,500 represented through detailed metadata and results from handwritten text recognition. Central to our discussion is the innovative workflow developed for this multilingual corpus. This includes strategies for text normalisation, machine translation, and handwritten text recognition, particularly focusing on the challenges of code-switching within historical documents. The resulting digital platform features an advanced search system, offering users various filtering options such as correspondent names, time periods, languages, and locations. It also incorporates fuzzy and exact search capabilities, with the ability to focus searches within specific text parts, like summaries or footnotes. Beyond detailing the technical process, this paper underscores the project’s contribution to historical research and digital humanities. While the Bullinger Digital platform serves as a model for similar projects, the corpus behind it demonstrates the vast potential for data reuse in historical linguistics. The project exemplifies how digital humanities methodologies can revitalise historical text collections, offering researchers access to and interaction with historical data. This paper aims to provide readers with a comprehensive understanding of our project’s scope and broader implications for the field of digital humanities, highlighting the transformative potential of such digital endeavours in historical linguistic research.

Abstract

This paper presents how we enhanced the accessibility and utility of historical linguistic data in the project Bullinger Digital. The project involved the transformation of 3,100 letters, primarily available as scanned PDFs, into a dynamic, fully digital format. The expanded digital collection now includes 12,000 letters, 3,100 edited, 5,400 transcribed, and 3,500 represented through detailed metadata and results from handwritten text recognition. Central to our discussion is the innovative workflow developed for this multilingual corpus. This includes strategies for text normalisation, machine translation, and handwritten text recognition, particularly focusing on the challenges of code-switching within historical documents. The resulting digital platform features an advanced search system, offering users various filtering options such as correspondent names, time periods, languages, and locations. It also incorporates fuzzy and exact search capabilities, with the ability to focus searches within specific text parts, like summaries or footnotes. Beyond detailing the technical process, this paper underscores the project’s contribution to historical research and digital humanities. While the Bullinger Digital platform serves as a model for similar projects, the corpus behind it demonstrates the vast potential for data reuse in historical linguistics. The project exemplifies how digital humanities methodologies can revitalise historical text collections, offering researchers access to and interaction with historical data. This paper aims to provide readers with a comprehensive understanding of our project’s scope and broader implications for the field of digital humanities, highlighting the transformative potential of such digital endeavours in historical linguistic research.

Statistics

Citations

Dimensions.ai Metrics

Altmetrics

Downloads

25 downloads since deposited on 28 Jan 2024
25 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:410 Linguistics
000 Computer science, knowledge & systems
Uncontrolled Keywords:correspondence, digital humanities, editions, databases, digitisation, XML, TEI, code-switching, machine translation, handwritten text recognition
Language:English
Date:24 January 2024
Deposited On:28 Jan 2024 14:08
Last Modified:29 Jan 2024 12:03
Publisher:Ubiquity Press
ISSN:2059-481X
OA Status:Gold
Free access at:Publisher DOI. An embargo period may apply.
Publisher DOI:https://doi.org/10.5334/johd.174
  • Content: Published Version
  • Language: English
  • Licence: Creative Commons: Attribution 4.0 International (CC BY 4.0)