Reservoir sampling

I actually did not realize there was a name for this, so I proved and implemented this from scratch:

import random
def random_subset( iterator, K ):
    result = []
    N = 0

    for item in iterator:
        N += 1
        if len( result ) < K:
            result.append( item )
        else:
            s = int(random.random() * N)
            if s < K:
                result[ s ] = item

    return result

From: http://web.archive.org/web/20141026071430/http://propersubset.com:80/2010/04/choosing-random-elements.html

With a proof near the end.

Leave a Comment