Skip to main content

From OpenOffice noob to control freak: A love story with R, LaTeX and knitr

Lately I had to write a seminar paper for a class and I decided to overdo it.
But let's start at the very beginning. Here is my evolution of how I used to write stuff and how I got from this:


to that:



School: 
OpenOffice - I guess everyone has some youthful indiscretions.
I remember how much time i spent trying to position each figure correctly and trying to make every line in the table of contents to start at the same position.


1. Semester: 
I heard of this "LaTeX" - thing and there was a rumor that this might be useful throughout the whole time at university, at the latest for the bachelor thesis. So I decided to learn LaTeX from the very beginning and chose to write a formulary for descriptive statistics as a first project.
The result was very neat. So proud of myself.
Started to missionize other students.

2.- 5. Semester:
Then I continued to use LaTeX for almost every document. Mostly I did a lot of presentations with the beamer class. Most of the creation time spent choosing the best themes and colors, and creating the perfect title slide.
I discovered Sweave, but did not want to use it and still copied R output into my documents and included figures manually.
Also switched from gedit to emacs about that time.

6. Semester:
Bachelor thesis time. The first time I needed to embed more R-Code into a document. I wanted my R-Code to look good. So I searched and found: "Hm ... what's knitr? ... want to meet my bachelor thesis?"
At the same time I learned subversion. Felt odd to use, but at the same time I felt there was a deeper power in revision tools.

1. Master Semester: 
Revision Control: Level Up! Now working with git.
Facebook update: Now in a bigamous relationship with knitr and emacs.


But now back to my latest paper. I used different tools to create it. Here are the ingredients plus why I used them:


LaTeX

In my opinion, there is no alternative to LaTeX for scientific paper including at least one number or one figure. There have been many discussions about LaTeX vs. Word (or Word-alikes). I like the personal summary in this blog.


Tufte book documentclass

A very fancy documentclass for LaTeX I tried for the first time. Most Latex-adversaries I know complain about every slide show or paper looking the same and I have to admit that's true. Of course you can build your own documentclasses and styles from scratch, but I really don't feel like it. Fortunately there are already some other classes around, like the fancy Tufte package, which I like very much. Edward Tufte (statistician) has done a lot in the fields information design and data visualization.
The Tufte book document class is based on the style Tufte wrote his books in.
The most significant impression is the broad margin, which can be used for notes, references and pictures.


R

Any analysis we are doing at university (I am studying statistics) is done with R. There are of course alternatives like Julia, Matlab, Python, ...
R might not be the fastest language and from my subjective feelings it is a little messy.
But I definitely love the huge ecosystem of packages, which are making R a pleasure to use. Also the plot functionality is amazing (I am kind of a ggplot2 fanboy)


knitr

Speaking of the ecosystem, knitr is one of the best examples for useful add-on packages. knitr makes integrating R code and output in documents simple and even fun. It is based on Sweave, but eliminates some problems and extends the functionality. I also like the fact that you are not tied to LaTeX documents, but you can, for example, also write markdown files and convert them to html.


emacs

Personally I am using emacs a lot. Though I think RStudio evolved very nice as well. The integration of knitr is very neat, better than in emacs. But I feel very comfortable in emacs, that's why I haven't switched to RStudio yet.


git

A revision control system. Mostly used for coding, but this time I used it for my seminar paper. It is not necessary to use revision control for a paper, but it definitely has some benefits. First of all it is a good feeling to be able to revert back to older revisions, for example in case you accidentally deleted something. One cool side effect is, that you automatically begin to split bigger tasks in small steps, because every time you commit something, you are encouraged to write a short text about what changed.


That's basically my my status quo of tools I use for papers and presentations.
My repository for the paper is public, just visit my github account. Here is the PDF.
By the way, the paper I wrote is about conditional inference trees, where I also did a presentation.







Comments

  1. He thanks for sharing your thoughts and the paper over on github. At the moment I develop on a LaTeX class called open science paper which might interest you. It is specialized on creating scientific papers in a common paper format. It offers a lot of options to manipulate the output and a make file to clean and archive the document folder and build it via knitr and pdflatex. It also offers a nice documentation in the wiki on Github. You can find the project on Github under https://github.com/cpfaff/Open-Science-Paper

    ReplyDelete
    Replies
    1. This looks very interesting. I will definitely try it out for my next paper. I also like the idea of open science presentation and poster. When do you plan to work on this?

      Delete
  2. That is a nice summary. Thanks for sharing :)

    ReplyDelete
    Replies
    1. Glad you liked it. And thanks for knitr. =)

      Delete
  3. Great post! I am also one of the converted, using almost the exact same set of tools. However, there is one that you're missing that you might enjoy: org-mode! (http://orgmode.org/) Since you're already into Emacs, you're already past the "hard part." It's basically a set of elisp functions that add all kinds of great functionality to a simple text format. This recent-ish paper sums up why it might be useful to check out as part of your workflow: http://www.jstatsoft.org/v46/i03/paper

    Writing in org-mode is a joy, and you can export to LaTeX, ASCII, HTML, or even (gasp) OpenOffice/Word for those times you are forced to.

    ReplyDelete
    Replies
    1. I already used org-mode and I liked it. But somehow I didn't use it lately. Thanks for remembering me.

      Delete
  4. I really enjoyed this post, as well as your presentation on conditional inference trees. I was wondering if you know of any references that directly compare Random Forest with conditional inference trees and could give more information on the pros/cons of each? Nice work!

    ReplyDelete
    Replies
    1. Thanks. I haven't studied cforests (RandomForests with conditional inference trees) yet. The variable importance of Random Forest with CART is biased towards variables with many possible split points. The cforest uses conditional importance, which avoids this problem (read more here: http://epub.ub.uni-muenchen.de/9387/1/techreport.pdf).
      I think the implementation of cforest (party package) is slower than the Random Forest implementation (randomForest package).

      Delete

Post a Comment

Popular posts from this blog

My first deep learning steps with Google and Udacity

I did my first steps in deep learning by taking the deep learning course at Udacity.

Deep learning is a hot topic. Deep neural networks can classify images, describe scenes, translate text and do so much more. It's great that Google and Udacity offer this course which helped me getting started with learning about deep learning.



How does the course work? The course consists of dozens 1-2 minute videos and assignments accompanying the videos.

Well, actually it's the other way round: The assignments are the heart of the course and the videos just give you the basic understanding you need to get started building networks. There are no exams.

The course covers basic neural networks, softmax, stochastic gradient descent, backpropagation, ReLU units, hidden layers, regularization, dropout, convolutional networks, recurrent networks, LSTM cells and more. Building deep neural networks is a bit like playing Legos and the course shows you the building bricks and teaches you how to use th…

Statistical modeling: two ways to see the world.

This a machine-learning-vs-traditional-statistics kind of blog post inspired by Leo Breiman's "Statistical Modeling: The Two Cultures". If you're like: "I had enough of this machine learning vs. statistics discussion,  BUT I would love to see beautiful beamer-slides with an awesome font.", then jump to the bottom of the post and for my slides on this subject plus source code.

I prepared presentation slides about the paper for a university course. Leo Breiman basically argued, that there are two cultures of statistical modeling:
Data modeling culture: You assume to know the underlying data-generating process and model your data accordingly. For example if you choose to model your data with a linear regression model you assume that the outcome y is normally distributed given the covariates x. This is a typical procedure in traditional statistics. Algorithmic modeling culture:  You treat the true data-generating process as unkown and try to find a model that is…