I first heard about HyperLogLog on some Redis podcast quite a few years ago. I recall that I thought it seemed magical. They said that this data structure can accurately 'guess' number of distinct elements added to the set using some small constant memory. Of course, it would be a trivial problem for small sets, but apparently, HyperLogLog solved the problem even for enormous sets with billions of elements. At last, I decided to implement it myself. In this post, I demystify core algorithms of HyperLogLog and provide an educational C# implementation.
The problem of cardinality estimation
At first, let's step back and think about the problem HyperLogLog solves. The problem is called cardinality estimation. The task is to accurately estimate the number of elements in some growing set of elements. Solution to the problem is essential for tasks such as database query optimization based on statistics or calculation of the number of unique visitors to a page.
The optimal solution must have a small constant memory footprint and small linear cost of adding an element to set. HyperLogLog provides those properties and has few more that are very useful.
HyperLogLog core ideas
Even though the task may seem complex, there are just a few tricky parts to understand.First one is that algorithm is based on the idea that every bit of every object added to HyperLogLog is equally probable. Without that assumption, no property of the data structure would hold. It is imperative to hash every object before adding to HyperLogLog. Note, that because we are using hashes, theoretically we can get collisions i.e. two distinct objects could get the same hash and the data structure would consider them as one. That is why HyperLogLog is called cardinality estimator.
Second, and probably the most important idea is that by analysing bit patterns of some small constant memory it is possible to estimate the number of distinct elements in a set. The best real-world analogy I read about is the analogy of throwing a coin. If we upfront choose some pattern of results, like 'only heads', then, we can estimate how many series of throws person did only by asking "what was the longest
only heads streak?". Imagine that person responded with "my longest streak of heads was 3". Then we could guess, that he did not try that hard, it is almost certain that he had only a few tries, most probably 8. On the other hand, if the response would be much greater, let's say "my longest streak of heads was 20 throws", then we could estimate that the person had an enormous number of series to achieve that. The idea is always directly translated into code. A bucket is a place in memory when we keep the number of the longest streak of leading zeros. Our current estimate of the number of distinct elements in a set is just two to the power of value in a bucket.
The third one, a closing idea of the HyperLogLog is that in order to decrease variance we use many buckets, instead of one. That way is the result in one bucket is largely inaccurate, because, for example, the first value had 10 leading zeroes(which is very improbable), the error in that bucket is responsible only for part of the global result. A number of buckets is a tradeoff. The more the buckets you have, the more memory it takes to keep them, but the lower variance is.
To sum up, there are three ideas that constitute the core of HyperLogLog:
- using hashes in place of real objects, to make bits independent
- remembering just the longest streak of leading zeroes for a bucket
- using a set of buckets to lower the variance and increase estimation accuracy.
In practice, there are some details(like some constants apply here and there) and optimization(like using sparse tables for lower cardinalities), but those are more like implementation details then the core of the algorithm.
Education implementation of HyperLogLog in C# is available at my hyperloglog repository. The code is as simple as possible, well documented and tested.