Header

UZH-Logo

Maintenance Infos

Tools and Methods for Processing and Visualizing Large Corpora


Schneider, Gerold; El-Assady, Menna; Lehmann, Hans Martin (2017). Tools and Methods for Processing and Visualizing Large Corpora. Studies in Variation, Contacts and Change in English, 19:online.

Abstract

We present several approaches and methods which we develop or use to create workflows from data to evidence. They start with looking for specific items in large corpora, exploring overuse of particular items, and using off-the-shelf visualization such as GoogleViz. Second, we present the advanced visualization tools and pipelines which the Visualization Group at University of Konstanz is developing. After an overview, we apply statistical visualizations, Lexical Episode Plots and Interactive Hierarchical Modeling to the vast historical linguistics data offered by the Corpus of Historical American English (COHA), which ranges from 1800 to 2000. We investigate on the one hand the increase of noun compounds and visually illustrate correlations in the data over time. On the other hand we compute and visualize trends and topics in society from 1800 to 2000. We apply an incremental topic modeling algorithm to the extracted compound nouns to detect thematic changes throughout the investigated time period of 200 years. In this paper, we utilize various tailored analysis and visualization approaches to gain insight into the data from different perspectives.

Abstract

We present several approaches and methods which we develop or use to create workflows from data to evidence. They start with looking for specific items in large corpora, exploring overuse of particular items, and using off-the-shelf visualization such as GoogleViz. Second, we present the advanced visualization tools and pipelines which the Visualization Group at University of Konstanz is developing. After an overview, we apply statistical visualizations, Lexical Episode Plots and Interactive Hierarchical Modeling to the vast historical linguistics data offered by the Corpus of Historical American English (COHA), which ranges from 1800 to 2000. We investigate on the one hand the increase of noun compounds and visually illustrate correlations in the data over time. On the other hand we compute and visualize trends and topics in society from 1800 to 2000. We apply an incremental topic modeling algorithm to the extracted compound nouns to detect thematic changes throughout the investigated time period of 200 years. In this paper, we utilize various tailored analysis and visualization approaches to gain insight into the data from different perspectives.

Statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:06 Faculty of Arts > English Department
06 Faculty of Arts > Institute of Computational Linguistics
06 Faculty of Arts > Center for Linguistics
Dewey Decimal Classification:820 English & Old English literatures
Language:English
Date:December 2017
Deposited On:23 Jan 2018 12:19
Last Modified:31 Mar 2018 05:14
Publisher:Research Unit for Variation, Contacts, and Change in English
ISSN:1797-4453
OA Status:Closed
Free access at:Official URL. An embargo period may apply.
Official URL:http://www.helsinki.fi/varieng/series/volumes/19/schneider_el-assady_lehmann/

Download

Full text not available from this repository.
Get full-text in a library