As more aspects of our daily lives are being computerized, ever larger amounts of data are being produced at ever greater speeds. In this data lies great value, and we need technologies that enable us to extract this value. This thesis is concerned with one type of technology that allows us to do this: Distributed Stream Processing Systems (DSPS) are systems consisting of many computers that jointly process, and hence extract value from, large amounts of data at high speeds.
This dissertation consists of three research projects that investigate two aspects of DSPS: In two projects, different approaches to increase the efficiency of DSPS were studied and in one project, the value of increased efficiency in stream processing was evaluated. All of these projects have been conducted on real computer systems and they are all of quantitative nature. In the first study, a graph partitioning algorithm was leveraged to schedule the workload within a DSPS. This reduced the communication load between hosts, while maintaining or increasing the throughput of the system. The second study was concerned with the auto-configuration of DSPS. We used a probabilistic black-box optimization strategy called Bayesian Optimization to increase throughput performance of DSPSs through configuration. In the third study, we investigated the value of increased efficiency of a DSPS. This was done by building a DSPS based entity ranking system and by evaluating the effect of timely data processing on the quality of the generated rankings.