Jump to content

Random forest

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Tolstoy the Cat (talk | contribs) at 12:10, 6 August 2006. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In machine learning, a Random Forest is a classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and "Random Forests" is their trademark.

Each tree is constructed using the following algorithm:

  1. Let the number of training cases be N, and the number of variables in the classifier be M.
  2. We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.
  3. Choose a training set for this tree by choosing N times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.
  4. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
  5. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).

The advantages of random forest are:

  • For many data sets, it produces a highly accurate classifier.
  • It handles a very large number of input variables.
  • It estimates the importance of variables in determining classification.
  • It generates an internal unbiased estimate of the generalization error as the forest building progresses.
  • It includes a good method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
  • It provides an experimental way to detect variable interactions.
  • It can balance error in class population unbalanced data sets.
  • It computes proximities between cases, useful for clustering, detecting outliers, and (by scaling) visualizing the data.
  • Using the above, it can be extended to unlabeled data, leading to unsupervised clustering, outlier detection and data views.
  • Learning is fast.