SON algorithm using MapReduce :-
The SON algorithm work well in a parallel-computing environment.
Each of the chunks can be processed in parallel, and the frequent
itemsets from each chunk combined to form the candidates.
We can distribute the candidates to many processors, have each
processor count the support for each candidate in a subset of the
baskets, and finally sum those supports to get the support for each
candidate itemset in the whole dataset.
There is a natural way of expressing each of the two passes as a
MapReduce operation
MapReduce-MapReduce sequence :-
First Map function :-
- Take the assigned subset of the baskets and find the itemsets frequent
in the subset using the simple and randomized algorithm.
- Lower the support threshold from s to ps if each Map task gets fraction
p of the total input file.
- The output is a set of key-value pairs (F, 1), where F is a frequent
itemset from the sample.
First Reduce Function :-
- Each Reduce task is assigned a set of keys, which are itemsets.
- The value is ignored, and the Reduce task simply produces those keys
(itemsets) that appear one or more times. Thus, the output of the first
Reduce function is the candidate itemsets.
Second Map function :-
- The Map tasks for the second Map function take all the output from
the first Reduce Function (the candidate itemsets) and a portion of the
input data file.
- Each Map task counts the number of occurrences of each of the
candidate itemsets among the baskets in the portion of the dataset
that it was assigned.
- The output is a set of key-value pairs (C, v), where C is one of the
candidate sets and v is the support for that itemset among the baskets
that were input to this Map task.
Second Reduce function :-
- The Reduce tasks take the itemsets they are given as keys and sum
the associated values.
- The result is the total support for each of the itemsets that the Reduce
task was assigned to handle.
- Those itemsets whose sum of values is at least s are frequent in the
whole dataset, so the Reduce task outputs these itemsets with their
counts.
- Itemsets that do not have total support at least s are not transmitted to
the output of the Reduce task.