written 2.5 years ago by | • modified 2.5 years ago |
Explain SON algorithm to find all or most frequent itemsets using at most two passes.
written 2.5 years ago by | • modified 2.5 years ago |
Explain SON algorithm to find all or most frequent itemsets using at most two passes.
written 2.5 years ago by |
SON Algorithm :-
The idea is to divide the input file into chunks.
Treat each chunk as a sample, and run the simple and randomized algorithm on that chunk.
We use ps as the threshold, if each chunk is fraction p of the whole file, and s is the support threshold.
Store on disk all the frequent itemsets found for each chunk.
Once all the chunks have been processed in that way, take the union of all the itemsets that have been found frequent for one or more chunks. These are the candidate itemsets.
If an itemset is not frequent in any chunk, then its support is less than ps in each chunk. Since the number of chunks is 1/p, we conclude that the total support for that itemset is less than (1/p)ps = s.
Thus, every itemset that is frequent in the whole is frequent in at least one chunk, and we can be sure that all the truly frequent itemsets are among the candidates; i.e., there are no false negatives. We have made a total of one pass through the data as we read each chunk and processed it.
In a second pass, we count all the candidate itemsets and select those that have support at least s as the frequent itemsets.