Explain SON algorithm usng MapReduce.

116views

written 2.7 years ago by

Krishna_Agrawal • 2.7k

SON algorithm using MapReduce :-

The SON algorithm work well in a parallel-computing environment.
Each of the chunks can be processed in parallel, and the frequent itemsets from each chunk combined to form the candidates.
We can distribute the candidates to many processors, have each processor count the support for each candidate in a subset of the baskets, and finally sum those supports to get the support for each candidate itemset in the whole dataset.
There is a natural way of expressing each of the two passes as a MapReduce operation

MapReduce-MapReduce sequence :-

First Map function :-

Take the assigned subset of the baskets and find the itemsets frequent in the subset using the simple and randomized algorithm.
Lower the support threshold from s to ps if each Map task gets fraction p of the total input file.
The output is a set of key-value pairs (F, 1), where F is a frequent itemset from the sample.

First Reduce Function :-

Each Reduce task is assigned a set of keys, which are itemsets.
The value is ignored, and the Reduce task simply produces those keys (itemsets) that appear one or more times. Thus, the output of the first Reduce function is the candidate itemsets.

Second Map function :-

The Map tasks for the second Map function take all the output from the first Reduce Function (the candidate itemsets) and a portion of the input data file.
Each Map task counts the number of occurrences of each of the candidate itemsets among the baskets in the portion of the dataset that it was assigned.
The output is a set of key-value pairs (C, v), where C is one of the candidate sets and v is the support for that itemset among the baskets that were input to this Map task.

Second Reduce function :-

The Reduce tasks take the itemsets they are given as keys and sum the associated values.
The result is the total support for each of the itemsets that the Reduce task was assigned to handle.
Those itemsets whose sum of values is at least s are frequent in the whole dataset, so the Reduce task outputs these itemsets with their counts.
Itemsets that do not have total support at least s are not transmitted to the output of the Reduce task.

ADD COMMENT EDIT