written 5.8 years ago by |
Suppose we have a window of length N on a binary stream. We want at all times to be able to answer queries of the form “how many 1’s are there in the last k bits?” for any k≤ N. For this purpose we use the DGIM algorithm.
The basic version of the algorithm uses O(log2 N) bits to represent a window of N bits, and allows us to estimate the number of 1’s in the window with an error of no more than 50%.
To begin, each bit of the stream has a timestamp, the position in which it arrives. The first bit has timestamp 1, the second has timestamp 2, and so on.
Since we only need to distinguish positions within the window of length N, we shall represent timestamps modulo N, so they can be represented by log2 N bits. If we also store the total number of bits ever seen in the stream (i.e., the most recent timestamp) modulo N, then we can determine from a timestamp modulo N where in the current window the bit with that timestamp is.
We divide the window into buckets, 5 consisting of:
The timestamp of its right (most recent) end.
The number of 1’s in the bucket. This number must be a power of 2, and we refer to the number of 1’s as the size of the bucket.
To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its right end. To represent the number of 1’s we only need log2 log2 N bits. The reason is that we know this number i is a power of 2, say 2j , so we can represent i by coding j in binary. Since j is at most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent a bucket. There are six rules that must be followed when representing a stream by buckets.
The right end of a bucket is always a position with a 1.
Every position with a 1 is in some bucket.
No position is in more than one bucket.
There are one or two buckets of any given size, up to some maximum size.
All sizes must be a power of 2.
Buckets cannot decrease in size as we move to the left (back in time).