written 2.8 years ago by |
Random Forests : -
A random forest is an ensemble learning method where multiple decision trees are constructed and then they are merged to get a more accurate prediction.
Algorithm : -
Here is an outline of the random forest algorithm.
The random forests algorithm generates many classification trees. Each tree is generated as follows:
(a) If the number of examples in the training set is N, take a sample of N examples at random - but with replacement, from the original data. This sample will be the training set for generating the tree.
(b) If there are M input variables, a number m is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the generation of the various trees inthe forest.
(c) Each tree is grown to the largest extent possible.
To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification.
Strengths and weaknesses :-
Strengths -
The following are some of the important strengths of random forests.
It runs efficiently on large data bases.
It can handle thousands of input variables without variable deletion.
It gives estimates of what variables are important in the classification.
It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
Generated forests can be saved for future use on other data.
Prototypes are computed that give information about the relation between the variables and the classification.
The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
It offers an experimental method for detecting variable interactions.
Random forest run times are quite fast, and they are able to deal with unbalanced and missing data.
They can handle binary features, categorical features, numerical features without any need for scaling.
There are lots of excellent, free, and open-source implementations of the random forest algorithm. We can find a good implementation in almost all major ML libraries and toolkits.
Weaknesses -
A weakness of random forest algorithms is that when used for regression they cannot predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy.
The sizes of the models created by random forests may be very large. It may take hundreds of megabytes of memory and may be slow to evaluate.
Random forest models are black boxes that are very hard to interpret.