A simple random sampling strategy to produce a sample without replacement from a stream of data - that is, in one pass: O(N)

Want to sample s instances - uniformly at random without replacement - from a population size of n records, where n is not known.

Figuring out n would require 2 passes. Reservoir sampling achieves this in 1 pass.
A reservoir R here is simply an array of size s. Let D be data stream of size n

Algorithm:

  • Store first s elements into R.
  • for each element in position k = s+1 to n ,
    • accept it with probability s/k
    • if accepted, choose a random element from R to replace.

Partial analysis:

Base case is trivial. For the k+1st case, the probability a given element i with position <= k is in R is s/k. The prob. i is replaced is the probability k+1st element is chosen multiplied by i being chosen to be replaced, which is: s/(k+1) * 1/s = 1/(k+1), and prob that i is not replaced is k/k+1.

So any given element's probability of lasting after k+1 rounds is: (chosen in k steps, and not removed in k steps)

= s/k * k/(k+1), which is s/(k+1).

So, when k+1 = n, any element is present with probability s/n.

Distributing Reservoir Sampling

It is very simple to distribute the reservoir sampling algorithm to n nodes.

Split the data stream into n partitions, one for each node. Apply reservoir sampling with reservoir size s, the final reservoir size, on each of the partitions. Finally, aggregate each reservoir into a final reservoir sample by carrying out reservoir sampling on them.

Lets say you split data of size n into 2 nodes, where each partition is of size n/2. Sub-reservoirs R1 and R2 are each of size s.

Probability that a record will be in sub-reservoir is:
s / (n/2) = 2s/n

The Probability that a record will end up in the final reservoir given it is in a sub-reservoir is: s/(2s) = 1/2.

It follows that the probability any given record will end up in the final reservoir is:
2s/n * 1/2 = s/n