How does the HyperLogLog algorithm work?

The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).

That is, in a random stream of integers, ~50% of the numbers (in binary) starts with “1”, 25% starts with “01”, 12,5% starts with “001”. This means that if you observe a random stream and see a “001”, there is a higher chance that this stream has a cardinality of 8.

(The prefix “00..1” has no special meaning. It’s there just because it’s easy to find the most significant bit in a binary number in most processors)

Of course, if you observe just one integer, the chance this value is wrong is high. That’s why the algorithm divides the stream in “m” independent substreams and keep the maximum length of a seen “00…1” prefix of each substream. Then, estimates the final value by taking the mean value of each substream.

That’s the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it’s all well written in the paper. Sorry for the terrible english.

Leave a Comment