Skip to main content

Statistical modeling: two ways to see the world.

This a machine-learning-vs-traditional-statistics kind of blog post inspired by Leo Breiman's "Statistical Modeling: The Two Cultures". If you're like: "I had enough of this machine learning vs. statistics discussion,  BUT I would love to see beautiful beamer-slides with an awesome font.", then jump to the bottom of the post and for my slides on this subject plus source code.

I prepared presentation slides about the paper for a university course. Leo Breiman basically argued, that there are two cultures of statistical modeling:
  • Data modeling culture: You assume to know the underlying data-generating process and model your data accordingly. For example if you choose to model your data with a linear regression model you assume that the outcome y is normally distributed given the covariates x. This is a typical procedure in traditional statistics. 
  • Algorithmic modeling culture:  You treat the true data-generating process as unkown and try to find a model that is good at predicting the outcome y given the covariates x. So you basically try to find a function of your covariates x that minimizes the loss between prediction and true outcome y. This culture is associated with machine learning. 
Breiman was a supporter of the algorithmic modeling culture and argued that in many cases this culture is superior to the other. I recommend to read the paper when you have a quiet hour. I think it is a compulsive reading for everyone who seriously analyses data. 

The paper is 13 years old and I guess most people in the statistics community still live the data modeling culture (at least at the statistics department where I study). But the world is not black and white: There are also professors and research associates who research in the field of the algorithmic modeling culture: Extention of boosting algorithms to a more flexible model class, introduction of permutation tests to random forests, study of the effect of different cross validation schemas, ... Not only can the traditional statistics be enhanced by machine learning, but there is also a need to bring more statistics into machine learning. I think there is a lot of room for mutual benefit

My personal opinion on this subject is very pragmatic. I use predictive accuracy now more often as an evaluation of model goodness and added Random Forests, (Tree-) Boosting and others to my tool kit. But sometimes it is okay to pay some MSE, AUC or Gini to replace a complex random forest with a glm with a nice interpretation. Even if you assumed the wrong data-generating process, your conclusions from the fitted model is not deadly wrong in most cases. 

So here are the slides. They contain notes so they should be easy to follow:
If you want to now how to reproduce them please visit my Github account.


  1. This comment has been removed by a blog administrator.


Post a Comment

Popular posts from this blog

My first deep learning steps with Google and Udacity

I did my first steps in deep learning by taking the deep learning course at Udacity.

Deep learning is a hot topic. Deep neural networks can classify images, describe scenes, translate text and do so much more. It's great that Google and Udacity offer this course which helped me getting started with learning about deep learning.

How does the course work? The course consists of dozens 1-2 minute videos and assignments accompanying the videos.

Well, actually it's the other way round: The assignments are the heart of the course and the videos just give you the basic understanding you need to get started building networks. There are no exams.

The course covers basic neural networks, softmax, stochastic gradient descent, backpropagation, ReLU units, hidden layers, regularization, dropout, convolutional networks, recurrent networks, LSTM cells and more. Building deep neural networks is a bit like playing Legos and the course shows you the building bricks and teaches you how to use th…