Monday, July 23, 2012

Music Data Hackathon 2012 - Beginner's view

When I first heard of the existence of Hackathons (receive a data set, predict the response as good as possible, win money. All within 24 hours), I had two thoughts:
1. Wow, that sounds greats. Like a huge game for intelligent people.
2. My skills are not good enough to participate.

That was one or two years ago. Now I have finished my bachelor degree in statistics and also gained a little experience with some machine learning techniques (boosting and neural networks). So I felt confident enough to try it out. Then I read about the EMI Music Data Science Hackaton and decided to take part. The cool thing about it was, that it was hosted by kaggle, so you did not have to be in London to participate.

The week before

The next step was to find a team. As a statistics student it is easy to find other statisticians. So I started to ask people around me if they were interested. To my surprise the euphoria to be part of such a competition was huge. My first plan to spend the 24 hours of data hacking in my kitchen (which can handle up to 5 people) was soon discarded, as the team size grew to 11 people.

So we had to find a better place. The answer was the computer room of the statistics department. So I asked the the supervisor of my bachelor thesis if it would be possible to use the computer room in the statistics department for the weekend (even stay there over night from saturday to sunday). I was amazed how uncomplicated it was to get the permission.


24 hours before the first submissions of the results could be made, the data sets were made available. Our team met to discuss how we would organize everything, have a look at the data and think about  possible models The response value was a users ranking of a particular song. The data (shared as csv - files) was stored in three tables. One with demographic information about the users, one with the user, artist, track and rating information and one with informations about how much some of the user liked the artists. In a first step we merged the data and made some descriptive analysis. We also added some new features to the variables, which showed up to be very useful later on.  All of our team members used R, so it was great that we could share code.


We met very early in the morning to start modeling the response. We got very excited when we could upload our first submissions. But we got disappointed soon. At first we got very bad results, but that was due to the wrong order of cases in the submission file. I tried a boosting model but got bad results as well. One of our team members tried a very simple linear model with manual variable selection. And it was surprisingly good. Compared to the other teams we still had a poorly high RMSE, but at least it performed better than the benchmarks.  This was our best model for quite a while, which was very disappointing. Eleven statisticians could not find a better model than a very weak linear model? Why even study then? But then we had a success, when we combined a linear model with boosting results and we went some positions up in the leaderboard. We also tried other methods like random forests, mixture of regression models, gam's and simple linear models.


Most of us did not leave the university but kept programming over night.
No sleep means, more time for data hacking! But of course there were moments when everyone was tired and we worked really ineffectivly. 1:00 pm London time came closer and we were very busy getting better results. In the end we climbed up the leaderbord by about 40 positions. Our final result was 37th position of 138 teams in total. It was not enough to be among the to 25% but among the best 27%. Our best submission was the mean over the prediction mixture of a simple generalized additive model using the package mgcv and a boosted gam using the package mboost.  Our submission over time can be seen here:

What I have learned
  • It is useful to look at the data and build new features
  • Ensemble learning can be pretty useful. We did not even use sophisticated approaches but simply took the mean of predictions or chose weights manually, and it resulted in smaller RMSEs
  • Even simple methods like linear models can be useful
  • It is necessary to have some form of cross validation to test models without having to waste a submission
All in all it was a lot of fun, we had a really nerdy hacker atmosphere, because we were programming 30 hours at a time, eating only chips and pizza and drinking energy drinks. At the end we had a satisfying result and everyone is now a little bit smarter. 

Sunday, July 15, 2012

Getting started with neural networks - the single neuron

Neural networks can model the relationship between input variables and output variables. A neural networks is built of artificial neurons which are connected. For the start it's the best to look at the architecture of a single neuron.

They are motivated by the architecture and functionality of neuron cells, of which brains are made of.  The neurons in the brain can receive multiple input signals, process them and fire a signal which again can be input to other neurons. The output is binary, so the signal can be fired (1) or not be fired (0) which depends on the input.

The artificial neuron has some inputs which we call \(x_1, x_2, ... x_p\). There can be an additional input \(x_0\), which is always set to \(1\) and is often referred to as bias. The inputs can be weighted with weights \(w_1, w_2, ..., w_p\) and \(w_0\) for the bias.  With the input and the weights we can calculate the activation of the neuron \[  a_i = \sum_{k = 1}^p w_k x_ik + w_0 \].
The output of the neuron is a function of it's activation. Here we are free to choose whatever function we want to use. If our output shall be binary or in the intervall \([0, 1]\) a good choice is the logistic function.

So the calculated output for the neuron and the observation i is  \[ o_i= \frac{1}{1 + exp(-a_i)}\]

Pretty straightforward, isn't it? If you know about logistic regression this might be already familiar to you.

Now you know about the basic structure. The next step is to "learn" the right weights for the input. Therefore you need a so called loss function which tells you how wrong you are with your predicted output. The loss function is a function of your calculated output \(o_i\) (which depends on your data \(x_i1, ..., x_ip\) and the weights) and of course on the true output \(y_i\). Your training data set is given, so the only variable part of the loss function are your weights \(w_k\). Low values of the loss function tell you, that you make an accurate prediction with your neuron.

One simple loss function would be the simple difference \(y_i - o_i \). A more sophisticated function is \(y_i ln(o_i) \cdot (1-y_i) ln(1-o_i) \), which is the negative log- Likelihood of your data, if you see \(o_i\) as the probability that the output is 1. So minimizing the negative log - Likelihood is the same as maximizing the Likelihood of your parameters given your training data set.

The first step for learning about neural networks is made! The next thing to look at is the gradient descent algorithm. This algorithm is a way to find weights, which minimize the loss function.

Have fun!

What is it all about?

This blog will be about machine learning and all the stuff which comes along with it. You will read about statistics, informatics, mathematics and software.

I am not an expert in machine learning and I am not even close. Why would I write a blog about machine learning then?

One of the most useful things I experienced in the web are blogs or homepages, where people write about things they've learned. They let other people be part of their learning progress, which is a benefit to both the blogger and the reader.

Next question: Why machine learning?

I want to search for cats in the web, thats why!

Explaining the decisions of machine learning algorithms

Being both statistician and machine learning practitioner, I have always been interested in combining the predictive power of (black box) ma...