written 8.5 years ago by |
An alternative approach is to redefine the question so that we are not asking for a count of 1’s in a window. Rather, let us compute a smooth aggregation of all the 1’s ever seen in the stream, with decaying weights, so the further back in the stream, the less weight is given. Formally, let a stream currently consist of the elements a1, a2, . . . , at, where a1 is the first element to arrive and at is the current element. Let c be a small constant, such as 10−6 or 10−9. Define the exponentially decaying window for this stream to be the sum
The effect of this definition is to spread out the weights of the stream elements as far back in time as the stream goes.
Finding the Most Popular Elements
Considering the example of finding the most popular movies in an stream of ticket sales:
We shall use an exponentially decaying window with a constant c, which you might think of as 10−9. That is, we approximate a sliding window holding the last one billion ticket sales. For each movie, we imagine a separate stream with a 1 each time a ticket for that movie appears in the stream, and a 0 each time a ticket for some other movie arrives. The decaying sum of the 1’s measures the current popularity of the movie.
We imagine that the number of possible movies in the stream is huge, so we do not want to record values for the unpopular movies. Therefore, we establish a threshold, say 1/2, so that if the popularity score for a movie goes below this number, its score is dropped from the counting. For reasons that will become obvious, the threshold must be less than 1, although it can be any number less than 1. When a new ticket arrives on the stream, do the following:
For each movie whose score we are currently maintaining, multiply its score by (1 − c).
Suppose the new ticket is for movie M. If there is currently a score for M, add 1 to that score. If there is no score for M, create one and initialize it to 1.
If any score is below the threshold 1/2, drop that score.
It may not be obvious that the number of movies whose scores are maintained at any time is limited. However, note that the sum of all scores is 1/c. There cannot be more than 2/c movies with score of 1/2 or more, or else the sum of the scores would exceed 1/c. Thus, 2/c is a limit on the number of movies being counted at any time. Of course in practice, the ticket sales would be concentrated on only a small number of movies at any time, so the number of actively counted movies would be much less than 2/c.