The digitalization of our society creates a large number of data streams, such as stock tickers, tweets, and sensor data. Making use of these streams has tremendous values. In the Semantic Web context, live information is queried from the streams in real-time. Knowledge is discovered by integrating streams with data from heterogeneous sources. Moreover, insights hidden in the streams are inferred and extracted by logical reasoning.
Handling large and complex streams in real-time challenges the capabilities of current systems. Therefore, this thesis studies how to improve the efficiency of processing and reasoning over semantic streams. It is composed of three projects that deal with different research problems motivated by real-world use cases. We propose new methods to address these problems and implement systems to test our hypotheses based on real datasets.
The first project focuses on the problem that sudden increases in the input stream rate overload the system, causing a reduced or unacceptable performance. We propose an eviction technique that, when a spike in the input data rate happens, discards data from the system to ensure the response latency at the cost of a lower recall. The novelty of our solution lies in a data-aware approach that carefully prioritizes the data and evicts the less important ones to achieve a high result recall.
The second project studies complex queries that need to integrate streams with remote and external background data (BGD). Accessing remote BGD is a very expensive process in terms of both latency and financial cost. We propose several methods to minimize the cost by exploiting the query and the data patterns. Our system only needs to retrieve data that are more critical to answer the query and avoids wasting resources on the remaining data in BGD.
Lastly, as noise is inevitable in real-world semantic streams, the third project inves- tigates how to use logical reasoning to identify and exclude the noise from high-volume streams. We adopt a distributed stream processing engine (DSPE) to achieve scalability. On top of a DSPE, we optimize the reasoning procedures by balancing the costs of com- putation and communication. Therefore, reasoning tasks are compiled into efficient DSPE workflows that can be deployed across large-scale computing clusters.