written 7.8 years ago by |
Data Mining & Business Intelligence - May 2016
Information Technology (Semester 6)
TOTAL MARKS: 80
TOTAL TIME: 3 HOURS
(1) Question 1 is compulsory.
(2) Attempt any three from the remaining questions.
(3) Assume data if required.
(4) Figures to the right indicate full marks.
1(a) Define 'Data Mining'. Enumerate five example applications that can benefit by using Data Mining.(5 marks)
1(b) What is Data Preprocessing? Explain the different methods for the Data Cleansing phase.(5 marks)
1(c) What is hierarchical clustering? Explain any two techniques for finding distance between the clusters in hierarchical clustering.(5 marks)
1(d) Explain the concept of a decision support system with the help of an example application.(5 marks)
2(a) Partition the given data into 4 bins using Equi-depth binning method and perform smoothing according to the following methods.
Smoothing by bin mean
Smoothing by bin median
Smoothing by bin boundaries.
Data: 11,13,13,15,15,16,19,20,20,20,21,21,22,23,24,30,40,45,45,45,71,72,73,75.(10 marks)
2(b) For the same set of data points in question 2.(a)
(a) Find Mean, Median and Mode.
(b) Show a boxplot of the data. Clearly indicating the five-number summary.(10 marks)
3(a) The table below shows a sample dataset of whether a customer responds to a survey or not. 'Outcome' is the class label.
Construct a Decision Tree Classifier for the dataset. For a new example (Rural, semidetached, low, No), what will be the predicated class label?
District | House Type | Income | Previous Customers | Outcome |
Suburban | Detached | High | No | Nothing |
Suburban | Detached | High | Yes | Nothing |
Suburban | Detached | High | No | Responded |
Urban |
Semi- Detached |
High | NO | Responded |
Urban |
Semi- Detached |
Low | NO | Responded |
Urban |
Semi- Detached |
Low | NO | NOthing |
Rural |
Semi- Detached |
Low | Yes | Responded |
Suburban | Terrace | High | NO | Nothing |
Suburban |
Semi- Detached |
Low | NO | Responded |
Urban | Terrace | Low | NO | Responded |
Suburban | Terrace | Low | Yes | Responded |
Rural | Terrace | High | Yes | Responded |
Rural | Detached | Low | No | Responded |
Urban | Terrace | High | Yes | Nothing |
Min. Support = 30% Min. Confidence=75%
TID | Items |
01 | A, B, D, E, F |
02 | B, C, E |
03 | A, B, D, E |
04 | A, B, C, E |
05 | A, B, C, D, E, F |
06 | B, C, D |
07 | A, B, D, E |
A1=(2, 10), A2=(2, 5), A3=(8, 4), A4=(5, 8),
A5=(7, 5), A6(6, 4), A7=(1, 2), A8=(4, 9)(10 marks) 5(b) What is an outlier? Describe methods that can be used for outlier analysis.(10 marks) 6(a) Consider the following case study: An International chain of hotels wants to analysis and improve its performance using several performance indicators-quality of rooms, service facilities, check in, breakfast , popular time of visits, duration of stay etc.
For this case study design a B1 system, clearly explaining all steps from data collection to decision making.(10 marks) 6(b) Clearly explain the working of the DB_SCAN algorithm using appropriate diagrams.(10 marks)