Probability of getting a duplicate value when calling GetHashCode() on strings

Large.

(Sorry Jon!)

The probability of getting a hash collision among short strings is extremely large. Given a set of only ten thousand distinct short strings drawn from common words, the probability of there being at least one collision in the set is approximately 1%. If you have eighty thousand strings, the probability of there being at least one collision is over 50%.

For a graph showing the relationship between set size and probability of collision, see my article on the subject:

https://learn.microsoft.com/en-us/archive/blogs/ericlippert/socks-birthdays-and-hash-collisions

Leave a Comment