Abstract
Data is collected, processed, and published in a multitude of distinct data sources. These must be integrated to gain a comprehensive, complete, and uniform understanding of the data ecosystem. Therefore, data integration aims at identifying and linking objects across data sources that describe the same real-world concept. Previous work on data integration focused on structured sources such as knowledge graphs (KGs) or databases. However, vast amounts of data reside in unstructured sources like text corpora. As of now, such sources have been largely neglected, as they introduce the additional challenge of heterogeneous data representations to the data integration problem. This thesis studies how to integrate data in such a heterogeneous scenario. We employ learned numeric representations, specifically embedding methods, to compare objects across structured and unstructured sources. Embedding models learn a vector space for a given data source such that related objects have similar embedding vectors. In our work, we discuss three problem settings on how to exploit embedding methods for heterogeneous data integration. Our first problem is to convert data sources into an integrated latent representation. Specifically, we aim to learn an integrated embedding space over a KG and a document corpus. We present KADE, which jointly embeds a document corpus and a KG with off-the-shelf embedding methods for either source. It shows that KADE embeddings can be used to find links across the two sources and perform better on per-source tasks than independently learned embeddings. Our second problem is to integrate multiple embedding spaces into a uniform space. We address two challenges of this problem: heterogeneity among embedding spaces and the scalability of prospective solutions. We present FedCoder, which learns a latent representation over the given embedding spaces and transformation functions between each source and the latent space. Our results show that FedCoder outperforms its baselines on heterogeneous KG embedding spaces, and it prevails over the state-of-the-art when many embedding spaces are integrated. Our third problem is how to convert insights gained in an integrated latent representation back into the data source's native representation. We study this problem in the KG completion task by introducing novel entities into an existing graph. Our solution leverages an embedding space that combines a document corpus with a KG to identify novel entities and reconstruct their triples in the graph. We demonstrate that our approach delivers better performance than baseline methods and that we can further increase its performance by exploiting user feedback and the graph's link statistics.