Skip to main content

Random Forest Almighty

Random Forests are awesome. They do not overfit, they are easy to tune, they tell you about important variables, they can be used for classification and regression, they are implemented in many programming languages and they are faster than their competitors (neural nets, boosting, support vector machines, ...)

Let us take a moment to appreciate them:

The Random Forest is my shepherd; I shall not want.
     He makes me watch the mean squared error decrease rapidly.
He leads me beside classification problems.
    He restores my soul.
He leads me in paths of the power of ensembles
    for his name's sake. 

Even though I walk through the valley of the curse of dimensionality,
    I will fear no overfitting,
for you are with me;
    your bootstrap and your randomness,
    they comfort me.

 You prepare a prediction before me
    in the presence of complex interactions;
you anoint me data scientist;
    my wallet overflows.
 Surely goodness of fit and money shall follow me
    all the days of my life,
and I shall use Random Forests 

Joke aside: Random Forests proved to give very stable and good predictions in many prediction settings (like in the See Click Predict Fix kaggle challenge, in which the winner used the Random Forests implementation in R), but they are not the solution to all problems. In a very simple setting, where the true relationship between the response and the covariates is linear, a linear model will perform better than a random forest. You find a good explanation why this happens on Cross Validated.
One thing I learned the hard way was that you should not get to attached to an algorithm for prediction. This probably applies to other areas as well. When I participated in the Observing Dark Worlds challenge, I  fell into this trap by sticking to Random Forests. My model performed poorly, but instead of thinking about another algorithm I thought about better features. The winner of this competition used a Bayesian approach.

You can find implementations in R (randomForest package) or in Python (scikit-learn library).


  1. Excellent poem! Absolutely love it. However... (!) I think there's a good thread or two on the Heritage Health Prize forum where lots of folks, including myself ran into over-fitting problems with Random Forests in R. What I think is more precise is that the more iterations you run on Random Forest will not cause a bad fit, but it does not mean that the output of a random forest can not be overly tied to predictions that are too specific (fitted) to the input cases (and then fail the general or predicted case), even with reasonable cross-validation. It may overfit differently than other algorithms, but it does certainly happen. Good talk here:

    1. Thanks for the comment. Yes, Random Forests can overfit, but I think they are more robust and easier to handle than other algorithms. The discussion on Wikipedia Random Forest page was interesting.

  2. Thank you for this! Leo Breiman was my father- and I sometimes dig around for reflections of his work to best understand what made him tick- just having my own child who never met him, I collect the things that I can find on him wherever they may be- Best, Rebecca Breiman

  3. Thank you for this! Leo Breiman was my father- and I sometimes dig around for reflections of his work to best understand what made him tick- just having my own child who never met him, I collect the things that I can find on him wherever they may be- Best, Rebecca Breiman

    1. You never know who reads your blog =)
      I am glad you liked the post!


Post a Comment

Popular posts from this blog

My first deep learning steps with Google and Udacity

I did my first steps in deep learning by taking the deep learning course at Udacity.

Deep learning is a hot topic. Deep neural networks can classify images, describe scenes, translate text and do so much more. It's great that Google and Udacity offer this course which helped me getting started with learning about deep learning.

How does the course work? The course consists of dozens 1-2 minute videos and assignments accompanying the videos.

Well, actually it's the other way round: The assignments are the heart of the course and the videos just give you the basic understanding you need to get started building networks. There are no exams.

The course covers basic neural networks, softmax, stochastic gradient descent, backpropagation, ReLU units, hidden layers, regularization, dropout, convolutional networks, recurrent networks, LSTM cells and more. Building deep neural networks is a bit like playing Legos and the course shows you the building bricks and teaches you how to use th…

Statistical modeling: two ways to see the world.

This a machine-learning-vs-traditional-statistics kind of blog post inspired by Leo Breiman's "Statistical Modeling: The Two Cultures". If you're like: "I had enough of this machine learning vs. statistics discussion,  BUT I would love to see beautiful beamer-slides with an awesome font.", then jump to the bottom of the post and for my slides on this subject plus source code.

I prepared presentation slides about the paper for a university course. Leo Breiman basically argued, that there are two cultures of statistical modeling:
Data modeling culture: You assume to know the underlying data-generating process and model your data accordingly. For example if you choose to model your data with a linear regression model you assume that the outcome y is normally distributed given the covariates x. This is a typical procedure in traditional statistics. Algorithmic modeling culture:  You treat the true data-generating process as unkown and try to find a model that is…