Multistage Frequent Itemset Mining Algorithm

60views

written 6.0 years ago by

teamques10 ★ 68k

Multistage Algorithm:

The Multistage Algorithm improves upon PCY by using several successive hash tables to reduce further the number of candidate pairs. The tradeoff is that Multistage takes more than two passes to find the frequent pairs. An outline of the Multistage Algorithm is shown in Fig:

enter image description here

Figure: The Multistage Algorithm uses additional hash tables to reduce the number of candidate pairs

The first pass of Multistage is the same as the first pass of PCY. After that pass, the frequent buckets are identified and summarized by a bitmap.

But the second pass of Multistage does not count the candidate pairs. Rather, it uses the available main memory for another hash table, using another hash function.

Since the bitmap from the first hash table takes up 1/32 of the available main memory, the second hash table has almost as many buckets as the first.

On the second pass of Multistage, we again go through the file of baskets. There is no need to count the items again, since we have those counts from the first pass.

However, we must retain the information about which items are frequent, since we need it on both the second and third passes. During the second pass, we hash certain pairs of items to buckets of the second hash table.

A pair is hashed only if it meets the two criteria for being counted in the second pass that is, we hash {i, j} if and only if i and j are both frequent, and the pair hashed to a frequent bucket on the first pass.

As a result, the sum of the counts in the second hash table should be significantly less than the sum for the first pass. The result is that, even though the second hash table has only 31/32 of the number of buckets that the first table has, we expect there to be many fewer frequent buckets in the second hash table than in the first.

After the second pass, the second hash table is also summarized as a bitmap, and that bitmap is stored in main memory. The two bitmaps together take up slightly less than 1/16th of the available main memory, so there is still plenty of space to count the candidate pairs on the third pass. A pair {i, j} is in C2 if and only if:

i and j are both frequent items.
{i, j} hashed to a frequent bucket in the first hash table.
{i, j} hashed to a frequent bucket in the second hash table.

ADD COMMENT EDIT