Fig: Text Mining
- Text Mining is the procedure of synthesizing information, by analyzing relations, patterns, and rules among textual data-semi structured or unstructured text.
- This procedure contains text summarization, text categorization and text clustering.
- Text summarization is the procedure to extract its partial content reflection to its whole contents automatically.
- Text categorization is the procedure of assigning a category to the text among categories predefined by users.
- Text clustering is the procedure of segmenting texts into several clusters, depending on the substantial relevance.
Techniques:
- Data mining
- Machine learning
- Information retrieval
- Statistics
- Natural –language understanding
- Case-based reasoning
Text Mining Approaches:
Keyword based Association Analysis:
- Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationship among them.
- First preprocess the text data by parsing, stemming, removing stop words, etc.
- Then evoke association mining algorithms
-Consider each document as a transaction
-View a set of keywords in the document as set of items in the transaction.
- Term level association mining
- No need for human effort in tagging documents.
-The number of meaningless results and the execution time is greatly reduced.
Document Classification Analysis:
Automatic document classification:
- Automatic classification for the tremendous number of on-line text documents(Web pages, emails, etc)
- Text document classification differs from the classification of relational data as document databases are not structured according to attribute-value pairs.
Association-Based Document Classification:
- Extract keywords and terms by information retrieval and simple association analysis techniques.
- Obtain concept hierarchies of keywords and terms using Available term classes such as WordNet, Expert knowledge.
- Classify documents in the training set into class hierarchies.
- Apply term association mining method to discover sets of associated terms.
- Use the term to maximally distinguish one class of documents from others.
- Derive a set of association rules associated with each document class.
- Order the classification rule based on their occurrence frequency and discriminative power.
- Used the rules to classify new documents.
3. Document Clustering Analysis:
- Automatically group related documents based on their contents.
- Require no training sets or predetermined taxonomies, generate a taxonomy at runtime,
- Major steps:
- Preprocessing: Remove stop words, stem, feature extraction.
- Hierarchical clustering: Compute similarities applying clustering algorithms.
- Slicing: Fan out controls; flatten the tree to configurable number of levels.