written 2.7 years ago by | • modified 2.7 years ago |
Explain simple and randomized algorithm to find most frequent itemsets using at most two passes.
written 2.7 years ago by | • modified 2.7 years ago |
Explain simple and randomized algorithm to find most frequent itemsets using at most two passes.
written 2.7 years ago by |
Simple and randomized algorithm :-
In simple and randorimized algorithm, we pick a random subset of the baskets and pretend it is the entire dataset instead of using the entire file of baskets.
We must adjust the support threshold to reflect the smaller number of baskets.
For instance, if the support threshold for the full dataset is s, and we choose a sample of 1% of the baskets, then we should examine the sample for itemsets that appear in at least s/100 of the baskets.
The best way to pick the sample is to read the entire dataset, and for each basket, select that basket for the sample with some fixed probability p.
Suppose there are m baskets in the entire file. At the end, we shall have a sample whose size is very close to pm baskets.
However, if the baskets appear in random order in the file already, then we do not even have to read the entire file.
We can select the first pm baskets for our sample. Or, if the file is part of a distributed file system, we can pick some chunks at random to serve as the sample.
Having selected our sample of the baskets, we use part of main memory to store these baskets.
Remaining main memory is used to execute one of the algorithms such as A-Priori or PCY. However, the algorithm must run passes over the main-memory sample for each itemset size, until we find a size with no frequent items