tag:blogger.com,1999:blog-1341101952720520783.post8744415340615773733..comments2020-02-20T18:07:12.353+01:00Comments on Machine Master: Predictive analytics: Some ways to waste timekurhttp://www.blogger.com/profile/10027428521484723878noreply@blogger.comBlogger6125tag:blogger.com,1999:blog-1341101952720520783.post-7871289873566101132012-08-23T13:33:35.433+02:002012-08-23T13:33:35.433+02:00Yes thats true. On the other hand the imputation b...Yes thats true. On the other hand the imputation becomes more stable, because you have more observations of the covariables. kurhttps://www.blogger.com/profile/10027428521484723878noreply@blogger.comtag:blogger.com,1999:blog-1341101952720520783.post-9288837462708936452012-08-23T13:25:12.475+02:002012-08-23T13:25:12.475+02:00I remember reading a little debate about point 2; ...I remember reading a little debate about point 2; when you are going to do some feature engineering or imputation, by combing train and test, your features would contain some information about the test set (overfitting the test set). <br /><br />This is often regard as a necessary evil in order to get a good score for a competition, but it will probably not generalize well when you are implementing it in the real world.Anonymoushttps://www.blogger.com/profile/09248038767818934265noreply@blogger.comtag:blogger.com,1999:blog-1341101952720520783.post-48004993131968314222012-08-23T11:25:57.172+02:002012-08-23T11:25:57.172+02:00Thanks for taking part of CrowdANALYTIX competitio...Thanks for taking part of CrowdANALYTIX competitions, and some interesting insights you made. We'll try to insure the ID variable is avoided in the datasets in future. Please let me know if you have queries about CA.Anonymoushttps://www.blogger.com/profile/11300541501849654592noreply@blogger.comtag:blogger.com,1999:blog-1341101952720520783.post-26200297629020273832012-08-18T16:56:09.554+02:002012-08-18T16:56:09.554+02:00Also there's a more general version of point 1...Also there's a more general version of point 1, which is be wary of any variables who's distribution in the test set is significantly different from the training set.Anonymoushttps://www.blogger.com/profile/17305384425953877966noreply@blogger.comtag:blogger.com,1999:blog-1341101952720520783.post-25817427454458666292012-08-18T10:06:21.022+02:002012-08-18T10:06:21.022+02:00That's a very good solution. Thanks for that.That's a very good solution. Thanks for that. kurhttps://www.blogger.com/profile/10027428521484723878noreply@blogger.comtag:blogger.com,1999:blog-1341101952720520783.post-12059477893901324812012-08-18T05:12:55.065+02:002012-08-18T05:12:55.065+02:00Some advice on comment one: You can also use row.n...Some advice on comment one: You can also use row.names(dat) <- dat$id; dat$id <- NULL. This way you keep the IDs but they don't affect your model.Anonymoushttps://www.blogger.com/profile/17305384425953877966noreply@blogger.com