Pythonic way to select list elements with different probability [duplicate]

Weights define a probability distribution function (pdf). Random numbers from any such pdf can be generated by applying its associated inverse cumulative distribution function to uniform random numbers between 0 and 1.

See also this SO explanation, or, as explained by Wikipedia:

If Y has a U[0,1] distribution then F⁻¹(Y) is distributed as F. This is
used in random number generation using
the inverse transform sampling-method.

import random
import bisect
import collections

def cdf(weights):
    total = sum(weights)
    result = []
    cumsum = 0
    for w in weights:
        cumsum += w
        result.append(cumsum / total)
    return result

def choice(population, weights):
    assert len(population) == len(weights)
    cdf_vals = cdf(weights)
    x = random.random()
    idx = bisect.bisect(cdf_vals, x)
    return population[idx]

weights=[0.3, 0.4, 0.3]
population = 'ABC'
counts = collections.defaultdict(int)
for i in range(10000):
    counts[choice(population, weights)] += 1
print(counts)

# % test.py
# defaultdict(<type 'int'>, {'A': 3066, 'C': 2964, 'B': 3970})

The choice function above uses bisect.bisect, so selection of a weighted random variable is done in O(log n) where n is the length of weights.

Note that as of version 1.7.0, NumPy has a Cythonized np.random.choice function. For example, this generates 1000 samples from the population [0,1,2,3] with weights [0.1, 0.2, 0.3, 0.4]:

import numpy as np
np.random.choice(4, 1000, p=[0.1, 0.2, 0.3, 0.4])

np.random.choice also has a replace parameter for sampling with or without replacement.

A theoretically better algorithm is the Alias Method. It builds a table which requires O(n) time, but after that, samples can be drawn in O(1) time. So, if you need to draw many samples, in theory the Alias Method may be faster. There is a Python implementation of the Walker Alias Method here, and a numpy version here.

More Related Contents:

Leave a Comment Cancel reply