written 8.5 years ago by |
Multistage Algorithm:
The Multistage Algorithm improves upon PCY by using several successive hash tables to reduce further the number of candidate pairs. The tradeoff is that Multistage takes more than two passes to find the frequent pairs. An outline of the Multistage Algorithm is shown in Fig:
Figure: The Multistage Algorithm uses additional hash tables to reduce the number of candidate pairs
The first pass of Multistage is the same as the first pass of PCY. After that pass, the frequent buckets are identified and summarized by a bitmap.
But the second pass of Multistage does not count the candidate pairs. Rather, it uses the available main memory for another hash table, using another hash function. Since the bitmap from the first hash table takes up 1/32 of the available main memory, the second hash table has almost as many buckets as the first.
On the second pass of Multistage, we again go through the file of baskets. There is no need to count the items again, since we have those counts from the first pass.
However, we must retain the information about which items are frequent, since we need it on both the second and third passes. During the second pass, we hash certain pairs of items to buckets of the second hash table.
A pair is hashed only if it meets the two criteria for being counted in the second pass that is, we hash {i, j} if and only if i and j are both frequent, and the pair hashed to a frequent bucket on the first pass.
As a result, the sum of the counts in the second hash table should be significantly less than the sum for the first pass. The result is that, even though the second hash table has only 31/32 of the number of buckets that the first table has, we expect there to be many fewer frequent buckets in the second hash table than in the first.
After the second pass, the second hash table is also summarized as a bitmap, and that bitmap is stored in main memory. The two bitmaps together take up slightly less than 1/16th of the available main memory, so there is still plenty of space to count the candidate pairs on the third pass. A pair {i, j} is in C2 if and only if:
- i and j are both frequent items.
- {i, j} hashed to a frequent bucket in the first hash table.
- {i, j} hashed to a frequent bucket in the second hash table.
The Multihash Algorithm:
Sometimes, we can get most of the benefit of the extra passes of the Multistage Algorithm in a single pass. This variation of PCY is called the Multihash Algorithm.
Instead of using two different hash tables on two successive passes, use two hash functions and two separate hash tables that share main memory on the first pass as suggested in the following figure:
Example: Suppose that if we run PCY, the average bucket will have a count s/10, where s is the support threshold. Then if we used the Multihash approach with two half-sized hash tables, the average count would be s/5. As a result, at most 1/5th of the buckets in either hash table could be frequent, and a random infrequent pair has at most probability (1/5)2 = 0.04 of being in a frequent bucket in both hash tables.
By the same reasoning, the upper bound on the infrequent pair being in a frequent bucket in the one PCY hash table is at most 1/10. That is, we might expect to have to count 2.5 times as many infrequent pairs in PCY as in the version of Multihash suggested above. We would therefore expect Multihash to have a smaller memory requirement for the second pass than would PCY.
There may be many fewer frequent buckets than the maximum for either algorithm, since the presence of some very frequent pairs will skew the distribution of bucket counts. However, this analysis is suggestive of the possibility that for some data and support thresholds, we can do better by running several hash functions in main memory at once.
For the second pass of Multihash, each hash table is converted to a bitmap, as usual.
The conditions for a pair {i, j} to be in C2, and thus to require a count on the second pass, are the same as for the third pass of Multistage: i and j must both be frequent, and the pair must have hashed to a frequent bucket according to both hash tables.