At my workplace I organized a short dplyr workshop, which I want to share in this blog post.
If you are also interested in learning dplyr and drastically improve readability of your code, you can try out the workshop! The dplyr introduction is followed by some exercises with toy data.
dplyr workshop
The dplyr package
Package for data manipulation. It provides the ‘verbs’ that can be chained together.Motivation for dplyr
- Fast
- Offers a consistent “grammar”
- Easy to read code, close to language
- It abstracts the data: Can be data.frame, sql
Functionality
- Five basic verbs: filter, select, mutate, arrange, summarise
- Grouping of data
- Joins
- Window functions
- Chains
Verbs
Example:filter(visits, treatment=='a')
- First argument is always the data.frame
- Subsequent arguments describe what to do with the data
- Columns can be used without $
- Return value is the changed data.frame
filter()
to keep the rows that match given criteriaselect()
for selecting columnsmutate()
for changing or creating variablesarrange()
reorder rowssummarise()
for summarising a data.frame- These are the most important verbs, but there are a few more
Chaining with %>%
Example:visits %>% filter(treatment=='a') %>% select(time)
- Verbs can be chained together with
%>%
- The outcome from the left hand side of
%>%
is used as first argument of the function on the right hand side - The verbs are powerful in combination with the
group_by()
function
Grouping with group_by()
Example: visits %>% group_by(patient_id) %>% mutate(mean_activity = mean(disease_activity))
- Only useful in conjunction with other verbs
- Returns a special grouped data.frame
select()
not affectedarrange()
orders first by the grouping variablemutate()
andfilter()
are done within the groups, useful in conjunction with window function likelag()
ormean()
summarise()
summarises for each groupungroup()
can be used to revert thegroup_by()
Data
Toy data about patient visits, containing the id of the patients, time of visits, the treatment they receive and a measure of how active the disease is. Just copy-paste the data into an R script and get started with the tasks!library('dplyr')
visits <- data.frame(patient_id = c(2,2,2,1,1,1,3,4,4,4,4),
time = c(4,2,1,1,2,3,1,2,3,5,4),
treatment = c('a', 'a', 'a', 'b', 'b', 'b', 'c', 'b', 'b', 'b', 'b'),
disease_activity = c(3, 2, 10, 5, 5, 5, 1, 5, 4, 3, 3))
visits
## patient_id time treatment disease_activity
## 1 2 4 a 3
## 2 2 2 a 2
## 3 2 1 a 10
## 4 1 1 b 5
## 5 1 2 b 5
## 6 1 3 b 5
## 7 3 1 c 1
## 8 4 2 b 5
## 9 4 3 b 4
## 10 4 5 b 3
## 11 4 4 b 3
Tasks
Solve the following tasks using the dplyr package. The desired output is displayed. For revealing the solution, click ‘Show Code’. But try it yourself first. ;-)Filter (=keep rows of) patients who got treatment a or b
filter(visits, treatment %in% c('a', 'b'))
## patient_id time treatment disease_activity
## 1 2 4 a 3
## 2 2 2 a 2
## 3 2 1 a 10
## 4 1 1 b 5
## 5 1 2 b 5
## 6 1 3 b 5
## 7 4 2 b 5
## 8 4 3 b 4
## 9 4 5 b 3
## 10 4 4 b 3
Sort visits by patient id and time
arrange(visits, patient_id, time)
## patient_id time treatment disease_activity
## 1 1 1 b 5
## 2 1 2 b 5
## 3 1 3 b 5
## 4 2 1 a 10
## 5 2 2 a 2
## 6 2 4 a 3
## 7 3 1 c 1
## 8 4 2 b 5
## 9 4 3 b 4
## 10 4 4 b 3
## 11 4 5 b 3
Add column ‘disease activity higher or equal than 4’ (TRUE / FALSE)
mutate(visits, high_disease_activity = disease_activity >= 4)
## patient_id time treatment disease_activity high_disease_activity
## 1 2 4 a 3 FALSE
## 2 2 2 a 2 FALSE
## 3 2 1 a 10 TRUE
## 4 1 1 b 5 TRUE
## 5 1 2 b 5 TRUE
## 6 1 3 b 5 TRUE
## 7 3 1 c 1 FALSE
## 8 4 2 b 5 TRUE
## 9 4 3 b 4 TRUE
## 10 4 5 b 3 FALSE
## 11 4 4 b 3 FALSE
Remove column disease_activity
select(visits, -disease_activity)
## patient_id time treatment
## 1 2 4 a
## 2 2 2 a
## 3 2 1 a
## 4 1 1 b
## 5 1 2 b
## 6 1 3 b
## 7 3 1 c
## 8 4 2 b
## 9 4 3 b
## 10 4 5 b
## 11 4 4 b
Rename column time to year_since_inclusion
rename(visits, year_since_inclusion=time)
## patient_id year_since_inclusion treatment disease_activity
## 1 2 4 a 3
## 2 2 2 a 2
## 3 2 1 a 10
## 4 1 1 b 5
## 5 1 2 b 5
## 6 1 3 b 5
## 7 3 1 c 1
## 8 4 2 b 5
## 9 4 3 b 4
## 10 4 5 b 3
## 11 4 4 b 3
How many visits exist per time point?
visits %>%
group_by(time) %>%
summarise(n = n())
## Source: local data frame [5 x 2]
##
## time n
## (dbl) (int)
## 1 1 3
## 2 2 3
## 3 3 2
## 4 4 2
## 5 5 1
What is the median, minimum and maximum disease activity at the second visit? Time == 2 is not equivalent to the second visit for all patients.
visits %>%
group_by(patient_id) %>%
arrange(time) %>%
filter(row_number() == 2) %>%
ungroup() %>%
summarise(median_activity = median(disease_activity),
max_activity = max(disease_activity),
min_activity = min(disease_activity))
## Source: local data frame [1 x 3]
##
## median_activity max_activity min_activity
## (dbl) (dbl) (dbl)
## 1 4 5 2
Sort visits descending by disease activity
arrange(visits, desc(disease_activity))
## patient_id time treatment disease_activity
## 1 2 1 a 10
## 2 1 1 b 5
## 3 1 2 b 5
## 4 1 3 b 5
## 5 4 2 b 5
## 6 4 3 b 4
## 7 2 4 a 3
## 8 4 5 b 3
## 9 4 4 b 3
## 10 2 2 a 2
## 11 3 1 c 1
Filter patients with treatment c and select the disease activity column
visits %>%
filter(treatment == 'c') %>%
select(disease_activity)
## disease_activity
## 1 1
Create a new variable per patient: nth visit
visits %>%
group_by(patient_id) %>%
mutate(nth_visit = rank(time))
## Source: local data frame [11 x 5]
## Groups: patient_id [4]
##
## patient_id time treatment disease_activity nth_visit
## (dbl) (dbl) (fctr) (dbl) (dbl)
## 1 2 4 a 3 3
## 2 2 2 a 2 2
## 3 2 1 a 10 1
## 4 1 1 b 5 1
## 5 1 2 b 5 2
## 6 1 3 b 5 3
## 7 3 1 c 1 1
## 8 4 2 b 5 1
## 9 4 3 b 4 2
## 10 4 5 b 3 4
## 11 4 4 b 3 3
Filter all patients with only one visit. Hint: Use n()
visits %>%
group_by(patient_id) %>%
filter(n() == 1)
## Source: local data frame [1 x 4]
## Groups: patient_id [1]
##
## patient_id time treatment disease_activity
## (dbl) (dbl) (fctr) (dbl)
## 1 3 1 c 1
Calculate the change of disease activity per visit for each patient. Also the change over two visits. Hint: Use lag()
visits %>%
group_by(patient_id) %>%
arrange(time) %>%
mutate(change = disease_activity - lag(disease_activity),
change_2 = disease_activity - lag(disease_activity, n=2))
## Source: local data frame [11 x 6]
## Groups: patient_id [4]
##
## patient_id time treatment disease_activity change change_2
## (dbl) (dbl) (fctr) (dbl) (dbl) (dbl)
## 1 1 1 b 5 NA NA
## 2 1 2 b 5 0 NA
## 3 1 3 b 5 0 0
## 4 2 1 a 10 NA NA
## 5 2 2 a 2 -8 NA
## 6 2 4 a 3 1 -7
## 7 3 1 c 1 NA NA
## 8 4 2 b 5 NA NA
## 9 4 3 b 4 -1 NA
## 10 4 4 b 3 -1 -2
## 11 4 5 b 3 0 -1