Skip to main content

From OpenOffice noob to control freak: A love story with R, LaTeX and knitr

Lately I had to write a seminar paper for a class and I decided to overdo it.
But let's start at the very beginning. Here is my evolution of how I used to write stuff and how I got from this:


to that:



School: 
OpenOffice - I guess everyone has some youthful indiscretions.
I remember how much time i spent trying to position each figure correctly and trying to make every line in the table of contents to start at the same position.


1. Semester: 
I heard of this "LaTeX" - thing and there was a rumor that this might be useful throughout the whole time at university, at the latest for the bachelor thesis. So I decided to learn LaTeX from the very beginning and chose to write a formulary for descriptive statistics as a first project.
The result was very neat. So proud of myself.
Started to missionize other students.

2.- 5. Semester:
Then I continued to use LaTeX for almost every document. Mostly I did a lot of presentations with the beamer class. Most of the creation time spent choosing the best themes and colors, and creating the perfect title slide.
I discovered Sweave, but did not want to use it and still copied R output into my documents and included figures manually.
Also switched from gedit to emacs about that time.

6. Semester:
Bachelor thesis time. The first time I needed to embed more R-Code into a document. I wanted my R-Code to look good. So I searched and found: "Hm ... what's knitr? ... want to meet my bachelor thesis?"
At the same time I learned subversion. Felt odd to use, but at the same time I felt there was a deeper power in revision tools.

1. Master Semester: 
Revision Control: Level Up! Now working with git.
Facebook update: Now in a bigamous relationship with knitr and emacs.


But now back to my latest paper. I used different tools to create it. Here are the ingredients plus why I used them:


LaTeX

In my opinion, there is no alternative to LaTeX for scientific paper including at least one number or one figure. There have been many discussions about LaTeX vs. Word (or Word-alikes). I like the personal summary in this blog.


Tufte book documentclass

A very fancy documentclass for LaTeX I tried for the first time. Most Latex-adversaries I know complain about every slide show or paper looking the same and I have to admit that's true. Of course you can build your own documentclasses and styles from scratch, but I really don't feel like it. Fortunately there are already some other classes around, like the fancy Tufte package, which I like very much. Edward Tufte (statistician) has done a lot in the fields information design and data visualization.
The Tufte book document class is based on the style Tufte wrote his books in.
The most significant impression is the broad margin, which can be used for notes, references and pictures.


R

Any analysis we are doing at university (I am studying statistics) is done with R. There are of course alternatives like Julia, Matlab, Python, ...
R might not be the fastest language and from my subjective feelings it is a little messy.
But I definitely love the huge ecosystem of packages, which are making R a pleasure to use. Also the plot functionality is amazing (I am kind of a ggplot2 fanboy)


knitr

Speaking of the ecosystem, knitr is one of the best examples for useful add-on packages. knitr makes integrating R code and output in documents simple and even fun. It is based on Sweave, but eliminates some problems and extends the functionality. I also like the fact that you are not tied to LaTeX documents, but you can, for example, also write markdown files and convert them to html.


emacs

Personally I am using emacs a lot. Though I think RStudio evolved very nice as well. The integration of knitr is very neat, better than in emacs. But I feel very comfortable in emacs, that's why I haven't switched to RStudio yet.


git

A revision control system. Mostly used for coding, but this time I used it for my seminar paper. It is not necessary to use revision control for a paper, but it definitely has some benefits. First of all it is a good feeling to be able to revert back to older revisions, for example in case you accidentally deleted something. One cool side effect is, that you automatically begin to split bigger tasks in small steps, because every time you commit something, you are encouraged to write a short text about what changed.


That's basically my my status quo of tools I use for papers and presentations.
My repository for the paper is public, just visit my github account. Here is the PDF.
By the way, the paper I wrote is about conditional inference trees, where I also did a presentation.







Comments

  1. He thanks for sharing your thoughts and the paper over on github. At the moment I develop on a LaTeX class called open science paper which might interest you. It is specialized on creating scientific papers in a common paper format. It offers a lot of options to manipulate the output and a make file to clean and archive the document folder and build it via knitr and pdflatex. It also offers a nice documentation in the wiki on Github. You can find the project on Github under https://github.com/cpfaff/Open-Science-Paper

    ReplyDelete
    Replies
    1. This looks very interesting. I will definitely try it out for my next paper. I also like the idea of open science presentation and poster. When do you plan to work on this?

      Delete
  2. That is a nice summary. Thanks for sharing :)

    ReplyDelete
    Replies
    1. Glad you liked it. And thanks for knitr. =)

      Delete
  3. Great post! I am also one of the converted, using almost the exact same set of tools. However, there is one that you're missing that you might enjoy: org-mode! (http://orgmode.org/) Since you're already into Emacs, you're already past the "hard part." It's basically a set of elisp functions that add all kinds of great functionality to a simple text format. This recent-ish paper sums up why it might be useful to check out as part of your workflow: http://www.jstatsoft.org/v46/i03/paper

    Writing in org-mode is a joy, and you can export to LaTeX, ASCII, HTML, or even (gasp) OpenOffice/Word for those times you are forced to.

    ReplyDelete
    Replies
    1. I already used org-mode and I liked it. But somehow I didn't use it lately. Thanks for remembering me.

      Delete
  4. I really enjoyed this post, as well as your presentation on conditional inference trees. I was wondering if you know of any references that directly compare Random Forest with conditional inference trees and could give more information on the pros/cons of each? Nice work!

    ReplyDelete
    Replies
    1. Thanks. I haven't studied cforests (RandomForests with conditional inference trees) yet. The variable importance of Random Forest with CART is biased towards variables with many possible split points. The cforest uses conditional importance, which avoids this problem (read more here: http://epub.ub.uni-muenchen.de/9387/1/techreport.pdf).
      I think the implementation of cforest (party package) is slower than the Random Forest implementation (randomForest package).

      Delete

Post a Comment

Popular posts from this blog

Explaining the decisions of machine learning algorithms

Being both statistician and machine learning practitioner, I have always been interested in combining the predictive power of (black box) machine learning algorithms and the interpretability of statistical models.

I thought the only way to combine predictive power and interpretability is by using methods that are somewhat in the middle between 'easy to understand' and 'flexible enough', like decision trees or the RuleFit algorithm or, additionally, by using techniques like partial dependency plots to understand the influence of single features. Then I read the paper "Why Should I Trust You" Explaining the Predictions of Any Classifier [1], which offers a really decent alternative for explaining decisions made by black boxes.


What is LIME? The authors propose LIME, an algorithm for Local Interpretable Model-agnostic Explanations. LIME can explain why a black box algorithm assigned a specific classification/prediction to one datapoint (image/text/tabular data) b…

Statistical modeling: two ways to see the world.

This a machine-learning-vs-traditional-statistics kind of blog post inspired by Leo Breiman's "Statistical Modeling: The Two Cultures". If you're like: "I had enough of this machine learning vs. statistics discussion,  BUT I would love to see beautiful beamer-slides with an awesome font.", then jump to the bottom of the post and for my slides on this subject plus source code.

I prepared presentation slides about the paper for a university course. Leo Breiman basically argued, that there are two cultures of statistical modeling:
Data modeling culture: You assume to know the underlying data-generating process and model your data accordingly. For example if you choose to model your data with a linear regression model you assume that the outcome y is normally distributed given the covariates x. This is a typical procedure in traditional statistics. Algorithmic modeling culture:  You treat the true data-generating process as unkown and try to find a model that is…