written 8.5 years ago by | modified 2.6 years ago by |
This question appears in Mumbai University > Big Data Analytics
Marks: 10 M
written 8.5 years ago by | modified 2.6 years ago by |
This question appears in Mumbai University > Big Data Analytics
Marks: 10 M
written 8.5 years ago by |
The SON algorithm lends itself well to a parallel-computing environment. Each of the chunks can be processed in parallel, and the frequent itemsets from each chunk combined to form the candidates.
We can distribute the candidates to many processors, have each processor count the support for each candidate in a subset of the baskets, and finally sum those supports to get the support for each candidate itemset in the whole dataset.
This process does not have to be implemented in MapReduce, but there is a natural way of expressing each of the two passes as a MapReduce operation. We shall summarize this MapReduce-MapReduce sequence below.
First Map Function: Take the assigned subset of the baskets and find the itemsets frequent in the subset using the simple Randomized Algorithm. As described there, lower the support threshold from s to ps if each Map task gets fraction p of the total input file. The output is a set of key-value pairs (F, 1), where F is a frequent itemset from the sample. The value is always 1 and is irrelevant.
First Reduce Function: Each Reduce task is assigned a set of keys, which are itemsets. The value is ignored, and the Reduce task simply produces those keys (itemsets) that appear one or more times. Thus, the output of the first Reduce function is the candidate itemsets.
Second Map Function: The Map tasks for the second Map function take all the output from the first Reduce Function (the candidate itemsets) and a portion of the input data file. Each Map task counts the number of occurrences of each of the candidate itemsets among the baskets in the portion of the dataset that it was assigned. The output is a set of key-value pairs (C, v), where C is one of the candidate sets and v is the support for that itemset among the baskets that were input to this Map task.
Second Reduce Function: The Reduce tasks take the itemsets they are given as keys and sum the associated values. The result is the total support for each of the itemsets that the Reduce task was assigned to handle. Those itemsets whose sum of values is at least s are frequent in the whole dataset, so the Reduce task outputs these itemsets with their counts. Itemsets that do not have total support at least s are not transmitted to the output of the Reduce task.
written 2.6 years ago by |
SON Algorithm:
It is an improvement over PCY to count frequent item sets
The idea is to divide input file into chunks
Treat each chunk as sample and then find set of frequent item sets in chunks
We use Ps as a threshold, if each chunk is fraction P of the whole file, and s is the support threshold.
Store on the disk all the frequent item sets found for each chunk.
Once all the chunks have been processed in this way, take the union of all the item sets that have been found frequent for one or more chunks. These are the candidate item sets.
Here every item set that is frequent in the whole is frequent in at least one chunk. Thus there are no false negatives.
We made a total one pass through the data as we need each chunk and processed it.
In the second pass, we count all the candidate item sets and select those that have support at least S as the frequent item sets.
The MR version of SON is as follows:
Pass 1:
Map (key, vale)
• Find the item set frequent in the partition sample p with support threshold ps.
• The output is key valve pair ( F, ) where f is frequent item set. Reduce ( key, valves)
• Valves are ignored and reduce task will produce those keys which appear one or more times.
PASS 2:
MAP (key, valve)
• Count occurrences of item sets
• For internet in item sets:
• Emit (item set, support (item set))
Reduce (key, values)
• Result = 0
• For value in values :
Result = result + value
• If result > = s then emit (key, result)