Abstract
Data-driven approaches are used increasingly in linguistics and Digital Humanities, for instance to illustrate semantic and conceptual changes across time. We address the question how semantic relations can be visualized onto maps. We compare several content analysis methods applied to the COHA corpus.
While the baseline method of document classification is straightforward to evaluate, evaluations of Topic Modeling are more contested; and Distributional Semantics and Kernel Density Estimation are harder to evaluate. We suggest criteria of internal and external coherence and global interpretation. We present an evaluation which also considers inter-annotator agreement.
Results indicate that each method brings different aspects to the surface. Document Classification delivers meaningful lexical features, but long lists need to be sifted. Topic Modeling manages to abstract from words to concepts, but the similarity between topics remains hidden. Distributional Semantics offers considerable detail but no overview. Kernel Density Estimation offers an overview, and clusters by associations rather than synonymity, which is apt for the investigation of social trends. It offers global interpretation in the sense that opposing ends of the map correspond to semantically opposed concepts.