written 7.0 years ago by | modified 2.5 years ago by |
Mumbai University > Information Technology > Sem 8 > Big Data Analysis
Marks: 10M
written 7.0 years ago by | modified 2.5 years ago by |
Mumbai University > Information Technology > Sem 8 > Big Data Analysis
Marks: 10M
written 7.0 years ago by |
This algorithm, which we call PCY after its authors, exploits the observation that there may be much unused space in main memory on the first pass. If there are a million items and gigabytes of main memory, we do not need more than 10% of the main memory for the two tables suggested in above Figure.
The PCY Algorithm uses that space for an array of integers that generalizes the idea of a Bloom filter.
The idea is shown schematically in Fig. 2.Think of this array as a hash table, whose buckets hold integers rather than sets of keys (as in an ordinary hash table) or bits (as in a Bloom filter).
Pairs of items are hashed to buckets of this hash table. As we examine a basket during the first pass, we not only add 1 to the count for each item in the basket, but we generate all the pairs, using a double loop.
We hash each pair, and we add 1 to the bucket into which that pair hashes. Note that the pair itself doesn’t go into the bucket; the pair only affects the single integer in the bucket.
At the end of the first pass, each bucket has a count, which is the sum of the counts of all the pairs that hash to that bucket. If the count of a bucket is at least as great as the support threshold s, it is called a frequent bucket.
We can say nothing about the pairs that hash to a frequent bucket; they could all be frequent pairs from the information available to us. But if the count of the bucket is less than s (an infrequent bucket), we know no pair that hashes to this bucket can be frequent, even if the pair consists of two frequent items.
That fact gives us an advantage on the second pass. We can define the set of candidate pairs C2 to be those pairs {i, j} such that: