written 5.6 years ago by | • modified 5.6 years ago |
Association Rule Mining:
- The items or objects in Relational databases, transactional databases or other information repositories are considered for finding frequent patterns, associations correlations, or causal structures.
- It searches for interesting relationships among items in a given data set by examining transactions, or shop carts, we can find which items are commonly purchased together. This knowledge can be used in advertising or in goods placement in stores.
Association rules have the general form
$\mathrm{I}_{1} \rightarrow \mathrm{I}_{2}$ (where $\mathrm{I}_{1} \cap \mathrm{I}_{2}=0 )$
Where, I $_{n}$ are sets of items, for example can be purchased in a store.
The rule should be read as "Given that someone has bought the items in the set $I_1$, they are likely to also buy the items in the set $I_{2}$ ".
Large itemsets
An itemset is a set of single items from the database of transactions.
If some items often occur together they can form an association rule.
Support:
The support of an itemset is the count of that itemset in the total number of transactions, or in other words it is the percentage of the transactions in whichtte items appear.
If $\quad A \Rightarrow B$
$Support \ (A \Rightarrow B)=\frac{\# \text { tuples containing both } A \text { and } B}{\text { total } \# \text { of tuples }}$
The support(s) for an association rule $X \Rightarrow Y$ is the percentage of transactions in the database that contains $X \cup Y$ i.e. (X and Y together).
An itemset is considered to be a large itemset if its support is above some threshold.
Confidence
The confidence or strength for an association rule $A \Rightarrow B$ is the ratio of the number of transactions that contain $A \cup B$ the number of transactions that contain A.
Consider a rule $A \Rightarrow B$, it is measure of ratio of the number of tuples containing both A and B to the number of tuples containing A.
Confidence:
$( A \Rightarrow B) = \frac{\text{ # tuples containing both A and B}}{\text{ # tuples containing A}}$
Apriori Algorithm
- The Apriori Algorithm solves the frequent item sets problem.
- The algorithm analyzes a data set to determine which combinations of items occur together frequently.
- The Apriori algorithm is at the core of various algorithms for data mining problems. best known problem is finding the association rules that hold in a basket- item relation.
Algorithm:
Input:
D: a database of transactions;
min_sup: the minimum support count thresold
Output:
L: frequent itemsets in D.
Method:
(1) $L_1$ = find_frequent_1 - itemsets(D);
(2) for (k = 2; $L_{K-1} \not{=} \phi; k++$) {
(3) $\quad$ $C_K$ = apriori_gen($L_{K-1}$);
(4) $\quad$ for each transaction t $\in$ D { // scan D for counts
(5) $\quad$ $\quad$ $C_t$ = subset($C_K$, t); // get the subsets of t that are canditates
(6) $\quad$ $\quad$ for each candidate c $\in$ $C_t$
(7) $\quad$ $\quad$ $\quad$ c.count++;
(8) $\quad$ }
(9) $\quad$ $L_K$ = { c $\in C_K$ | c.count $\ge$ min_sup }
(10) }
(11) return $L = U_k L_k;$
Procedure apriori_gen ($L_{k-1}:$ frequent (k-1) - itemsets)
(1) for each itemset $L_1 \in L_{K-1}$
(2) for each itemset $L_2 \in L_{K-1}$
(3) if $L_1 (1) = L_2 (L_1) \wedge (L_1(2) = L_2 (2)) $
$\quad \wedge ... \wedge (L_1 [k - 2] = L_2 [k - 2] \wedge (L_1 [k - 1] \lt L_2 [k - 1])$ then {
(4) $\quad \quad c = L_1 \bowtie L_2;$ // join step:generate candidates
(5) $\quad \quad$ if has_infrequent_subset(c, $L_{k-1}$) then
(6) $\quad \quad \quad$ delete c; // prune step:remove unfruitful candidate
(7) $\quad \quad$ else add c to $C_K$
(8) $\quad$ }
(9) return $C_K;$
Procedure has_infrequent_subset(c:candidate k-itemset; $L_{k-1} : $ frequent (k-1) - itemsets); //use prior knowledge
(1) for each (k-1)-subsets of c
(2) $\quad$ if s $\not{\in} L_{K-1}$ then
(3) $\quad \quad$ return TRUE;
(4) return FALSE;