written 8.6 years ago by | modified 3.2 years ago by |
written 3.2 years ago by |
KDD :- Knowledge discovery in databases Knowledge Discovery in Databases also known as Data Mining , refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data stored in databases.
Steps Involved in KDD process:-
Data cleaning :- Data Cleaning is defined as removal of noisy and inconsistent data.
Data integration :- Data integration is defined as data from multiple data sources are combined into one common data source in the data integration process. This step can be avoided if collected from 1 source.
Data selection :- Data selection is the process where data relevant for data analysis task are retrieved from the database.
Data transformation:- The data is transferred from large volume to small volume using loss or lossless compression. If we avoid this step it will cause complex algorithm.
Data mining :- It is an essential process where intelligent methods are applied to extract data patterns potentially useful.
Pattern evaluation:- Check and keep only required pattern representing knowledge based on given measures.
Knowledge Presentation:- Knowledge presentation is known as technique which involves utilizing data visualization tools to represent the pattern which would be potentially useful.
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data .Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD
The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:
(1) Selection (2) Pre-processing (3) Transformation (4) Data Mining (5) Interpretation/Evaluation.
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant representation
Choosing functions of data mining
summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
DBSCAN ALGORITHM
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm Clustering based on density (local cluster criterion), such as density-connected points
Major features: - Handle noise - One scan - Discover clusters of arbitrary shape
- Need density parameters as termination condition
Density-reachable:
A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi
Density-connected A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts
ALGORITHM
Input: D : a data set containing n objects, ε : the radius parameter, and MinPts: the neighborhood density threshold.
Output: A set of density-based clusters.
Method:
1) mark all objects as unvisited;
2) do
3) randomly select an unvisited object p;
4) mark p as visited;
5) if the ε -neighborhood of p has at least MinPts objects
6) create a new cluster C, and add p to C;
7) let N be the set of objects in the ε -neighborhood of p;
8) for each point p' in N
9) if p' is unvisited
10) mark p' as visited;
11) if the -neighborhood of p' has at least MinPts points,add those points to N ;
12) if p' is not yet a member of any cluster, add p' to C;
13) end for
14)output C;
15) else mark p as noise;
16) until no object is unvisited;
Advantages