On November 26 last year I successfully defended my dissertation at Karlsruhe Institute of Technology. The full text has been published through the KIT library with the DOI 10.5445/IR/1000127719.

If you find any mistakes or have corrections of any kind, please email me at my name at this domain. I will add them to the list of corrections and remarks.

## Abstract

This dissertation focuses on three fundamental problem families in big data
systems, for which we develop communication-efficient probabilistic algorithms.
In the first part, we consider various *selection* problems, in the second,
*weighted sampling* problems, and the third, *checking* of basic operations in
big data systems. This is motivated by a growing need for communication
efficiency as the network and its usage increasingly dominate supercomputer
system cost and energy consumption as well as running times of distributed
applications. Surprisingly few communication-efficient algorithms are known for
fundamental big data problems, and we close several of these gaps.

We first consider different selection problems, starting with selecting the
element with rank *k* from a large, distributed input. There, we show how this
can be achieved without assuming a random distribution of the input by
redesigning the pivot selection step. Next, we show that selection from locally
sorted sequences—also known as *multisequence selection*—becomes considerably
easier if we allow the precise rank of the output to vary in some range. We then
describe how this can be used to construct a bulk priority queue, which we later
use to construct an algorithm for weighted reservoir sampling. Lastly, we
consider finding the globally most frequent objects as well as those whose
associated values add up to the highest sums with a sample-based approach.

In the weighted sampling chapter, we begin by giving new construction algorithms
for a classical data structure for weighted sampling, *alias tables*. After
presenting the first linear-time construction algorithm with only constant
memory overhead, we proceed to parallelise this algorithm for shared memory,
obtaining the first parallel construction algorithm for alias tables. We then
show how to approach the problem in distributed memory with a two-level
algorithm. Next, we present an output-sensitive algorithm for weighted sampling
with replacement, meaning that it requires time linear in the number of *unique*
items in the sample. This algorithm works sequentially as well as in shared and
distributed memory and is the first such algorithm in all three categories. We
subsequently adapt it to weighted sampling without replacement by combining it
with an estimator for the number of unique items in a sample with
replacement. Poisson sampling, a generalisation of Bernoulli sampling to
weighted items, can be reduced to integer sorting, and we show how an existing
approach can be parallelised. For sampling from data streams, we adapt a
sequential approach and show how it can be parallelised in a mini-batch model
using our bulk priority queue introduced in the selection chapter. The chapter
is concluded by an extensive evaluation of our alias table construction
algorithms, our output-sensitive algorithm for weighted sampling with
replacement, and our weighted reservoir sampling algorithm.

To probabilistically verify the correctness of distributed algorithms, we
propose *checkers* for fundamental operations of big data systems. We show that
checking numerous operations can be reduced onto two ‘core’ checkers, namely
checking aggregations and whether one sequence is a permutation of
another. While multiple approaches exist for the latter and can be easily
parallelised, our sum aggregation checker is a novel application of the same
data structure that underpins counting Bloom filters and count-min sketches. We
implemented both checkers in Thrill, a big data framework. Experiments with
deliberately introduced errors confirm the detection accuracy predicted by
theory, even when using real-world hash functions with non-ideal properties.
Scaling experiments on a supercomputer show that our checkers have very little
runtime overhead, in the area of 2 % for near-certainty in the correctness of
the result.