Skip to main content

Music Data Hackathon 2012 - Beginner's view

When I first heard of the existence of Hackathons (receive a data set, predict the response as good as possible, win money. All within 24 hours), I had two thoughts:
1. Wow, that sounds greats. Like a huge game for intelligent people.
2. My skills are not good enough to participate.

That was one or two years ago. Now I have finished my bachelor degree in statistics and also gained a little experience with some machine learning techniques (boosting and neural networks). So I felt confident enough to try it out. Then I read about the EMI Music Data Science Hackaton and decided to take part. The cool thing about it was, that it was hosted by kaggle, so you did not have to be in London to participate.

The week before

The next step was to find a team. As a statistics student it is easy to find other statisticians. So I started to ask people around me if they were interested. To my surprise the euphoria to be part of such a competition was huge. My first plan to spend the 24 hours of data hacking in my kitchen (which can handle up to 5 people) was soon discarded, as the team size grew to 11 people.

So we had to find a better place. The answer was the computer room of the statistics department. So I asked the the supervisor of my bachelor thesis if it would be possible to use the computer room in the statistics department for the weekend (even stay there over night from saturday to sunday). I was amazed how uncomplicated it was to get the permission.


24 hours before the first submissions of the results could be made, the data sets were made available. Our team met to discuss how we would organize everything, have a look at the data and think about  possible models The response value was a users ranking of a particular song. The data (shared as csv - files) was stored in three tables. One with demographic information about the users, one with the user, artist, track and rating information and one with informations about how much some of the user liked the artists. In a first step we merged the data and made some descriptive analysis. We also added some new features to the variables, which showed up to be very useful later on.  All of our team members used R, so it was great that we could share code.


We met very early in the morning to start modeling the response. We got very excited when we could upload our first submissions. But we got disappointed soon. At first we got very bad results, but that was due to the wrong order of cases in the submission file. I tried a boosting model but got bad results as well. One of our team members tried a very simple linear model with manual variable selection. And it was surprisingly good. Compared to the other teams we still had a poorly high RMSE, but at least it performed better than the benchmarks.  This was our best model for quite a while, which was very disappointing. Eleven statisticians could not find a better model than a very weak linear model? Why even study then? But then we had a success, when we combined a linear model with boosting results and we went some positions up in the leaderboard. We also tried other methods like random forests, mixture of regression models, gam's and simple linear models.


Most of us did not leave the university but kept programming over night.
No sleep means, more time for data hacking! But of course there were moments when everyone was tired and we worked really ineffectivly. 1:00 pm London time came closer and we were very busy getting better results. In the end we climbed up the leaderbord by about 40 positions. Our final result was 37th position of 138 teams in total. It was not enough to be among the to 25% but among the best 27%. Our best submission was the mean over the prediction mixture of a simple generalized additive model using the package mgcv and a boosted gam using the package mboost.  Our submission over time can be seen here:

What I have learned
  • It is useful to look at the data and build new features
  • Ensemble learning can be pretty useful. We did not even use sophisticated approaches but simply took the mean of predictions or chose weights manually, and it resulted in smaller RMSEs
  • Even simple methods like linear models can be useful
  • It is necessary to have some form of cross validation to test models without having to waste a submission
All in all it was a lot of fun, we had a really nerdy hacker atmosphere, because we were programming 30 hours at a time, eating only chips and pizza and drinking energy drinks. At the end we had a satisfying result and everyone is now a little bit smarter. 


  1. Congrats guys. I wish I am still a student with enough time for intensive Kaggling- I was one, 10 years ago in your department but in this time there was no Kaggle. Now I have 2 childs and a 40 hour job, many ideas for the kaggle competitions but no time to get better with R and to program them. But some time I will reduce to 30 hours (plan is next year) and maybe we will meet again at Kaggle.
    Cheers Vielauge

  2. Thank you. Those competitions are a lot of fun, even without winning them. I hope your plans will work and you will have more time to join the fun.


Post a Comment

Popular posts from this blog

My first deep learning steps with Google and Udacity

I did my first steps in deep learning by taking the deep learning course at Udacity.

Deep learning is a hot topic. Deep neural networks can classify images, describe scenes, translate text and do so much more. It's great that Google and Udacity offer this course which helped me getting started with learning about deep learning.

How does the course work? The course consists of dozens 1-2 minute videos and assignments accompanying the videos.

Well, actually it's the other way round: The assignments are the heart of the course and the videos just give you the basic understanding you need to get started building networks. There are no exams.

The course covers basic neural networks, softmax, stochastic gradient descent, backpropagation, ReLU units, hidden layers, regularization, dropout, convolutional networks, recurrent networks, LSTM cells and more. Building deep neural networks is a bit like playing Legos and the course shows you the building bricks and teaches you how to use th…

Statistical modeling: two ways to see the world.

This a machine-learning-vs-traditional-statistics kind of blog post inspired by Leo Breiman's "Statistical Modeling: The Two Cultures". If you're like: "I had enough of this machine learning vs. statistics discussion,  BUT I would love to see beautiful beamer-slides with an awesome font.", then jump to the bottom of the post and for my slides on this subject plus source code.

I prepared presentation slides about the paper for a university course. Leo Breiman basically argued, that there are two cultures of statistical modeling:
Data modeling culture: You assume to know the underlying data-generating process and model your data accordingly. For example if you choose to model your data with a linear regression model you assume that the outcome y is normally distributed given the covariates x. This is a typical procedure in traditional statistics. Algorithmic modeling culture:  You treat the true data-generating process as unkown and try to find a model that is…