Header

UZH-Logo

Maintenance Infos

Large-Scale Hierarchical Alignment for Data-driven Text Rewriting


Nikolov, Nikola I; Hahnloser, Richard H R (2019). Large-Scale Hierarchical Alignment for Data-driven Text Rewriting. In: RANLP 2019 : Recent Advances in Natural Language Processing, Varna, Bulgaria, 31 August 2019 - 6 September 2019, RANLP.

Abstract

We propose a simple unsupervised method for extracting pseudo-parallel monolingual sentence pairs from comparable corpora representative of two different text styles, such as news articles and scientific papers. Our approach does not require a seed parallel corpus, but instead relies solely on hierarchical search over pre-trained embeddings of documents and sentences. We demonstrate the effectiveness of our method through automatic and extrinsic evaluation on text simplification from the normal to the Simple Wikipedia. We show that pseudo-parallel sentences extracted with our method not only supplement existing parallel data, but can even lead to competitive performance on their own.

Abstract

We propose a simple unsupervised method for extracting pseudo-parallel monolingual sentence pairs from comparable corpora representative of two different text styles, such as news articles and scientific papers. Our approach does not require a seed parallel corpus, but instead relies solely on hierarchical search over pre-trained embeddings of documents and sentences. We demonstrate the effectiveness of our method through automatic and extrinsic evaluation on text simplification from the normal to the Simple Wikipedia. We show that pseudo-parallel sentences extracted with our method not only supplement existing parallel data, but can even lead to competitive performance on their own.

Statistics

Citations

Downloads

15 downloads since deposited on 14 Feb 2020
6 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:07 Faculty of Science > Institute of Neuroinformatics
Dewey Decimal Classification:570 Life sciences; biology
Scopus Subject Areas:Physical Sciences > Software
Physical Sciences > Computer Science Applications
Physical Sciences > Artificial Intelligence
Physical Sciences > Electrical and Electronic Engineering
Language:English
Event End Date:6 September 2019
Deposited On:14 Feb 2020 10:34
Last Modified:16 Jun 2022 07:05
Publisher:RANLP
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:https://acl-bg.org/proceedings/2019/RANLP%202019/pdf/RANLP098.pdf
  • Content: Published Version