Skip to main content

dplyr workshop

dplyr workshop dplyr is my favourite package for data manipulation.

At my workplace I organized a short dplyr workshop, which I want to share in this blog post.

If you are also interested in learning dplyr and drastically improve readability of your code, you can try out the workshop! The dplyr introduction is followed by some exercises with toy data.



The dplyr package

Package for data manipulation. It provides the ‘verbs’ that can be chained together.

Motivation for dplyr

  • Fast
  • Offers a consistent “grammar”
  • Easy to read code, close to language
  • It abstracts the data: Can be data.frame, sql

Functionality

  • Five basic verbs: filter, select, mutate, arrange, summarise
  • Grouping of data
  • Joins
  • Window functions
  • Chains

Verbs

Example: filter(visits, treatment=='a')
  • First argument is always the data.frame
  • Subsequent arguments describe what to do with the data
  • Columns can be used without $
  • Return value is the changed data.frame
  • filter() to keep the rows that match given criteria
  • select() for selecting columns
  • mutate() for changing or creating variables
  • arrange() reorder rows
  • summarise() for summarising a data.frame
  • These are the most important verbs, but there are a few more

Chaining with %>%

Example: visits %>% filter(treatment=='a') %>% select(time)
  • Verbs can be chained together with %>%
  • The outcome from the left hand side of %>% is used as first argument of the function on the right hand side
  • The verbs are powerful in combination with the group_by() function

Grouping with group_by()

Example: visits %>% group_by(patient_id) %>% mutate(mean_activity = mean(disease_activity))
  • Only useful in conjunction with other verbs
  • Returns a special grouped data.frame
  • select() not affected
  • arrange() orders first by the grouping variable
  • mutate() and filter() are done within the groups, useful in conjunction with window function like lag() or mean()
  • summarise() summarises for each group
  • ungroup() can be used to revert the group_by()

Data

Toy data about patient visits, containing the id of the patients, time of visits, the treatment they receive and a measure of how active the disease is. Just copy-paste the data into an R script and get started with the tasks!
library('dplyr')
visits <- data.frame(patient_id = c(2,2,2,1,1,1,3,4,4,4,4), 
                     time = c(4,2,1,1,2,3,1,2,3,5,4), 
                     treatment = c('a', 'a', 'a', 'b', 'b', 'b', 'c', 'b', 'b', 'b', 'b'), 
                     disease_activity = c(3, 2, 10, 5, 5, 5, 1, 5, 4, 3, 3))

visits 
##    patient_id time treatment disease_activity
## 1           2    4         a                3
## 2           2    2         a                2
## 3           2    1         a               10
## 4           1    1         b                5
## 5           1    2         b                5
## 6           1    3         b                5
## 7           3    1         c                1
## 8           4    2         b                5
## 9           4    3         b                4
## 10          4    5         b                3
## 11          4    4         b                3

Tasks

Solve the following tasks using the dplyr package. The desired output is displayed. For revealing the solution, click ‘Show Code’. But try it yourself first. ;-)

Filter (=keep rows of) patients who got treatment a or b

filter(visits, treatment %in% c('a', 'b'))
##    patient_id time treatment disease_activity
## 1           2    4         a                3
## 2           2    2         a                2
## 3           2    1         a               10
## 4           1    1         b                5
## 5           1    2         b                5
## 6           1    3         b                5
## 7           4    2         b                5
## 8           4    3         b                4
## 9           4    5         b                3
## 10          4    4         b                3

Sort visits by patient id and time

arrange(visits, patient_id, time)
##    patient_id time treatment disease_activity
## 1           1    1         b                5
## 2           1    2         b                5
## 3           1    3         b                5
## 4           2    1         a               10
## 5           2    2         a                2
## 6           2    4         a                3
## 7           3    1         c                1
## 8           4    2         b                5
## 9           4    3         b                4
## 10          4    4         b                3
## 11          4    5         b                3

Add column ‘disease activity higher or equal than 4’ (TRUE / FALSE)

mutate(visits, high_disease_activity = disease_activity >= 4)
##    patient_id time treatment disease_activity high_disease_activity
## 1           2    4         a                3                 FALSE
## 2           2    2         a                2                 FALSE
## 3           2    1         a               10                  TRUE
## 4           1    1         b                5                  TRUE
## 5           1    2         b                5                  TRUE
## 6           1    3         b                5                  TRUE
## 7           3    1         c                1                 FALSE
## 8           4    2         b                5                  TRUE
## 9           4    3         b                4                  TRUE
## 10          4    5         b                3                 FALSE
## 11          4    4         b                3                 FALSE

Remove column disease_activity

select(visits, -disease_activity)
##    patient_id time treatment
## 1           2    4         a
## 2           2    2         a
## 3           2    1         a
## 4           1    1         b
## 5           1    2         b
## 6           1    3         b
## 7           3    1         c
## 8           4    2         b
## 9           4    3         b
## 10          4    5         b
## 11          4    4         b

Rename column time to year_since_inclusion

rename(visits, year_since_inclusion=time)
##    patient_id year_since_inclusion treatment disease_activity
## 1           2                    4         a                3
## 2           2                    2         a                2
## 3           2                    1         a               10
## 4           1                    1         b                5
## 5           1                    2         b                5
## 6           1                    3         b                5
## 7           3                    1         c                1
## 8           4                    2         b                5
## 9           4                    3         b                4
## 10          4                    5         b                3
## 11          4                    4         b                3

How many visits exist per time point?

visits %>%
  group_by(time) %>%
  summarise(n = n())
## Source: local data frame [5 x 2]
## 
##    time     n
##   (dbl) (int)
## 1     1     3
## 2     2     3
## 3     3     2
## 4     4     2
## 5     5     1

What is the median, minimum and maximum disease activity at the second visit? Time == 2 is not equivalent to the second visit for all patients.

visits %>%
  group_by(patient_id) %>%
  arrange(time) %>%
  filter(row_number() == 2) %>%
  ungroup() %>%
  summarise(median_activity = median(disease_activity), 
            max_activity = max(disease_activity), 
            min_activity = min(disease_activity))
## Source: local data frame [1 x 3]
## 
##   median_activity max_activity min_activity
##             (dbl)        (dbl)        (dbl)
## 1               4            5            2

Sort visits descending by disease activity

arrange(visits, desc(disease_activity))
##    patient_id time treatment disease_activity
## 1           2    1         a               10
## 2           1    1         b                5
## 3           1    2         b                5
## 4           1    3         b                5
## 5           4    2         b                5
## 6           4    3         b                4
## 7           2    4         a                3
## 8           4    5         b                3
## 9           4    4         b                3
## 10          2    2         a                2
## 11          3    1         c                1

Filter patients with treatment c and select the disease activity column

visits %>% 
  filter(treatment == 'c') %>%
  select(disease_activity)
##   disease_activity
## 1                1

Create a new variable per patient: nth visit

visits %>%
  group_by(patient_id) %>%
  mutate(nth_visit = rank(time))
## Source: local data frame [11 x 5]
## Groups: patient_id [4]
## 
##    patient_id  time treatment disease_activity nth_visit
##         (dbl) (dbl)    (fctr)            (dbl)     (dbl)
## 1           2     4         a                3         3
## 2           2     2         a                2         2
## 3           2     1         a               10         1
## 4           1     1         b                5         1
## 5           1     2         b                5         2
## 6           1     3         b                5         3
## 7           3     1         c                1         1
## 8           4     2         b                5         1
## 9           4     3         b                4         2
## 10          4     5         b                3         4
## 11          4     4         b                3         3

Filter all patients with only one visit. Hint: Use n()

visits %>% 
  group_by(patient_id) %>% 
  filter(n() == 1)
## Source: local data frame [1 x 4]
## Groups: patient_id [1]
## 
##   patient_id  time treatment disease_activity
##        (dbl) (dbl)    (fctr)            (dbl)
## 1          3     1         c                1

Calculate the change of disease activity per visit for each patient. Also the change over two visits. Hint: Use lag()

visits %>% 
  group_by(patient_id) %>% 
  arrange(time) %>%
  mutate(change = disease_activity - lag(disease_activity), 
         change_2 = disease_activity - lag(disease_activity, n=2)) 
## Source: local data frame [11 x 6]
## Groups: patient_id [4]
## 
##    patient_id  time treatment disease_activity change change_2
##         (dbl) (dbl)    (fctr)            (dbl)  (dbl)    (dbl)
## 1           1     1         b                5     NA       NA
## 2           1     2         b                5      0       NA
## 3           1     3         b                5      0        0
## 4           2     1         a               10     NA       NA
## 5           2     2         a                2     -8       NA
## 6           2     4         a                3      1       -7
## 7           3     1         c                1     NA       NA
## 8           4     2         b                5     NA       NA
## 9           4     3         b                4     -1       NA
## 10          4     4         b                3     -1       -2
## 11          4     5         b                3      0       -1

Comments

Popular posts from this blog

My first deep learning steps with Google and Udacity

I did my first steps in deep learning by taking the deep learning course at Udacity.

Deep learning is a hot topic. Deep neural networks can classify images, describe scenes, translate text and do so much more. It's great that Google and Udacity offer this course which helped me getting started with learning about deep learning.



How does the course work? The course consists of dozens 1-2 minute videos and assignments accompanying the videos.

Well, actually it's the other way round: The assignments are the heart of the course and the videos just give you the basic understanding you need to get started building networks. There are no exams.

The course covers basic neural networks, softmax, stochastic gradient descent, backpropagation, ReLU units, hidden layers, regularization, dropout, convolutional networks, recurrent networks, LSTM cells and more. Building deep neural networks is a bit like playing Legos and the course shows you the building bricks and teaches you how to use th…

Statistical modeling: two ways to see the world.

This a machine-learning-vs-traditional-statistics kind of blog post inspired by Leo Breiman's "Statistical Modeling: The Two Cultures". If you're like: "I had enough of this machine learning vs. statistics discussion,  BUT I would love to see beautiful beamer-slides with an awesome font.", then jump to the bottom of the post and for my slides on this subject plus source code.

I prepared presentation slides about the paper for a university course. Leo Breiman basically argued, that there are two cultures of statistical modeling:
Data modeling culture: You assume to know the underlying data-generating process and model your data accordingly. For example if you choose to model your data with a linear regression model you assume that the outcome y is normally distributed given the covariates x. This is a typical procedure in traditional statistics. Algorithmic modeling culture:  You treat the true data-generating process as unkown and try to find a model that is…