Efficiently selecting a set of random elements from a linked list

There’s a very nice and efficient algorithm for this using a method called reservoir sampling.

Let me start by giving you its history:

Knuth calls this Algorithm R on p. 144 of his 1997 edition of Seminumerical Algorithms (volume 2 of The Art of Computer Programming), and provides some code for it there. Knuth attributes the algorithm to Alan G. Waterman. Despite a lengthy search, I haven’t been able to find Waterman’s original document, if it exists, which may be why you’ll most often see Knuth quoted as the source of this algorithm.

McLeod and Bellhouse, 1983 (1) provide a more thorough discussion than Knuth as well as the first published proof (that I’m aware of) that the algorithm works.

Vitter 1985 (2) reviews Algorithm R and then presents an additional three algorithms which provide the same output, but with a twist. Rather than making a choice to include or skip each incoming element, his algorithm predetermines the number of incoming elements to be skipped. In his tests (which, admittedly, are out of date now) this decreased execution time dramatically by avoiding random number generation and comparisons on each in-coming number.

In pseudocode the algorithm is:

Let R be the result array of size s
Let I be an input queue

> Fill the reservoir array
for j in the range [1,s]:
  R[j]=I.pop()

elements_seen=s
while I is not empty:
  elements_seen+=1
  j=random(1,elements_seen)       > This is inclusive
  if j<=s:
    R[j]=I.pop()
  else:
    I.pop()

Note that I’ve specifically written the code to avoid specifying the size of the input. That’s one of the cool properties of this algorithm: you can run it without needing to know the size of the input beforehand and it still assures you that each element you encounter has an equal probability of ending up in R (that is, there is no bias). Furthermore, R contains a fair and representative sample of the elements the algorithm has considered at all times. This means you can use this as an online algorithm.

Why does this work?

McLeod and Bellhouse (1983) provide a proof using the mathematics of combinations. It’s pretty, but it would be a bit difficult to reconstruct it here. Therefore, I’ve generated an alternative proof which is easier to explain.

We proceed via proof by induction.

Say we want to generate a set of s elements and that we have already seen n>s elements.

Let’s assume that our current s elements have already each been chosen with probability s/n.

By the definition of the algorithm, we choose element n+1 with probability s/(n+1).

Each element already part of our result set has a probability 1/s of being replaced.

The probability that an element from the n-seen result set is replaced in the n+1-seen result set is therefore (1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1).

Thus, the n+1-seen result set contains an element either if it was part of the n-seen result set and was not replaced—this probability is (s/n)*n/(n+1)=s/(n+1)—or if the element was chosen—with probability s/(n+1).

The definition of the algorithm tells us that the first s elements are automatically included as the first n=s members of the result set. Therefore, the n-seen result set includes each element with s/n (=1) probability giving us the necessary base case for the induction.

References

  1. McLeod, A. Ian, and David R. Bellhouse. “A convenient algorithm for drawing a simple random sample.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 32.2 (1983): 182-184. (Link)

  2. Vitter, Jeffrey S. “Random sampling with a reservoir.” ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57. (Link)

Leave a Comment