Thursday, February 6, 2014

Random Forest Almighty

Random Forests are awesome. They do not overfit, they are easy to tune, they tell you about important variables, they can be used for classification and regression, they are implemented in many programming languages and they are faster than their competitors (neural nets, boosting, support vector machines, ...)

Let us take a moment to appreciate them:




The Random Forest is my shepherd; I shall not want.
     He makes me watch the mean squared error decrease rapidly.
He leads me beside classification problems.
    He restores my soul.
He leads me in paths of the power of ensembles
    for his name's sake. 

Even though I walk through the valley of the curse of dimensionality,
    I will fear no overfitting,
for you are with me;
    your bootstrap and your randomness,
    they comfort me.

 You prepare a prediction before me
    in the presence of complex interactions;
you anoint me data scientist;
    my wallet overflows.
 Surely goodness of fit and money shall follow me
    all the days of my life,
and I shall use Random Forests 
    forever.


Joke aside: Random Forests proved to give very stable and good predictions in many prediction settings (like in the See Click Predict Fix kaggle challenge, in which the winner used the Random Forests implementation in R), but they are not the solution to all problems. In a very simple setting, where the true relationship between the response and the covariates is linear, a linear model will perform better than a random forest. You find a good explanation why this happens on Cross Validated.
One thing I learned the hard way was that you should not get to attached to an algorithm for prediction. This probably applies to other areas as well. When I participated in the Observing Dark Worlds challenge, I  fell into this trap by sticking to Random Forests. My model performed poorly, but instead of thinking about another algorithm I thought about better features. The winner of this competition used a Bayesian approach.

You can find implementations in R (randomForest package) or in Python (scikit-learn library).

Explaining the decisions of machine learning algorithms

Being both statistician and machine learning practitioner, I have always been interested in combining the predictive power of (black box) ma...