Navigation auf zora.uzh.ch

Search ZORA

ZORA (Zurich Open Repository and Archive)

How to Compare Apples to Oranges : Integrating Heterogeneous Data Sources with Representation Learning

Baumgartner, Matthias. How to Compare Apples to Oranges : Integrating Heterogeneous Data Sources with Representation Learning. 2022, University of Zurich, Faculty of Economics.

Abstract

Data is collected, processed, and published in a multitude of distinct data sources. These must be integrated to gain a comprehensive, complete, and uniform understanding of the data ecosystem. Therefore, data integration aims at identifying and linking objects across data sources that describe the same real-world concept. Previous work on data integration focused on structured sources such as knowledge graphs (KGs) or databases. However, vast amounts of data reside in unstructured sources like text corpora. As of now, such sources have been largely neglected, as they introduce the additional challenge of heterogeneous data representations to the data integration problem. This thesis studies how to integrate data in such a heterogeneous scenario. We employ learned numeric representations, specifically embedding methods, to compare objects across structured and unstructured sources. Embedding models learn a vector space for a given data source such that related objects have similar embedding vectors. In our work, we discuss three problem settings on how to exploit embedding methods for heterogeneous data integration. Our first problem is to convert data sources into an integrated latent representation. Specifically, we aim to learn an integrated embedding space over a KG and a document corpus. We present KADE, which jointly embeds a document corpus and a KG with off-the-shelf embedding methods for either source. It shows that KADE embeddings can be used to find links across the two sources and perform better on per-source tasks than independently learned embeddings. Our second problem is to integrate multiple embedding spaces into a uniform space. We address two challenges of this problem: heterogeneity among embedding spaces and the scalability of prospective solutions. We present FedCoder, which learns a latent representation over the given embedding spaces and transformation functions between each source and the latent space. Our results show that FedCoder outperforms its baselines on heterogeneous KG embedding spaces, and it prevails over the state-of-the-art when many embedding spaces are integrated. Our third problem is how to convert insights gained in an integrated latent representation back into the data source's native representation. We study this problem in the KG completion task by introducing novel entities into an existing graph. Our solution leverages an embedding space that combines a document corpus with a KG to identify novel entities and reconstruct their triples in the graph. We demonstrate that our approach delivers better performance than baseline methods and that we can further increase its performance by exploiting user feedback and the graph's link statistics.

Additional indexing

Item Type:Dissertation (monographical)
Referees:Baumgartner Matthias, Cudré-Mauroux Philippe, Dell'Aglio Daniele
Communities & Collections:03 Faculty of Economics > Department of Informatics
UZH Dissertations
Dewey Decimal Classification:000 Computer science, knowledge & systems
Scope:Discipline-based scholarship (basic research)
Language:English
Date:2022
Deposited On:11 Jan 2023 08:09
Last Modified:03 Dec 2024 15:41
Number of Pages:124
OA Status:Green
Other Identification Number:merlin-id:23102
Download PDF  'How to Compare Apples to Oranges : Integrating Heterogeneous Data Sources with Representation Learning'.
Preview
  • Content: Published Version
  • Language: English

Metadata Export

Statistics

Downloads

51 downloads since deposited on 11 Jan 2023
25 downloads since 12 months
Detailed statistics

Authors, Affiliations, Collaborations

Similar Publications