One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.
Let’s us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)
To create the alias lookup:
-
Normalize the weights such that they sum to
1.0
.(a:0.2 b:0.2 c:0.2 d:0.2 e:0.2)
This is the probability of choosing each weight. -
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions,
|p|
. Each partition represents a probability mass of1/|p|
. In this case, we create8
partitions, each able to contain0.125
. -
Take the variable with the least remaining weight, and place as much of it’s mass as possible in an empty partition. In this example, we see that
a
fills the first partition.(p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8)
with(a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
-
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.
Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.
For example, if we run another iteration of 3 and 4, we see
(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8)
with (a:0, b:0.15 c:0.2 d:0.2 e:0.2)
left to be assigned
At runtime:
-
Get a
U(0,1)
random number, say binary0.001100000
-
bitshift it
lg2(p)
, finding the index partition. Thus, we shift it by3
, yielding001.1
, or position 1, and thus partition 2. -
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is
0.5
, and0.5 < 0.6
, so returna
.
Here is some code and another explanation, but unfortunately it doesn’t use the bitshifting technique, nor have I actually verified it.