This package facilitates efficient handling of data frames in a unified manner. We’ll use the birthwt dataset from the MASS package to demonstrate the usage of commonly used functions in the dplyr package. This dataset is from a case-control study on risk factors for low birth weight in newborns.
library(dplyr)
data(birthwt, package = "MASS")
The birthwt dataset contains a total of 189 study subjects and 10 variables. Among them, the outcome variable ‘bwt’ represents the weight of newborns in grams, and the indicator variable ‘low’ derived from ‘bwt’ with a cutoff at 2500g. The remaining 8 variables are predictors, including mother’s age (‘age’), race (‘race’, ‘1’ = white, ‘2’ = black, ‘3’ = other), smoking status (‘smoke’, 0/1), history of hypertension (‘ht’, 0/1), etc.
Using filter() and slice() to Filter Rows
The filter() function filters a subset of the data frame based on observations, where parameters data frame name comes first, and expressions for filtering follows. For example, to filter all records of subjects with age greater than 35:
filter(birthwt, age > 35)
Multiple conditions can be separated by commas within the filter() function (this serves as logical “and”). Logical operators can also be used directly. This command selects records of subjects with age greater than 35, and birth weight less than 2500g or greater than 4000g:
head(filter(birthwt, age > 35, bwt < 2500 | bwt > 4000), 10)
The slice() function selects specified rows based on row numbers. For example, the following command selects rows 3 to 5 from the dataset:
slice(birthwt, 3:5)
Arranging Rows with arrange()
Sometimes, we may want to arrange the records of a data frame based on a specific variable. We can use arrange() function for this. The following command arranges the data frame in ascending order by ‘bwt’:
arrange(birthwt, bwt) # Default: ascending order
If we want to further sort the data frame by a second variable when 2 lines of data are equal, we can add a second variable within the arrange() function. For example, the following command arranges the data frame by ‘bwt’ in ascending order and then by ‘age’ in ascending order within each group of equal ‘bwt’ values:
arrange(birthwt, bwt, age)
To sort the data frame by a variable in descending order, we can use the desc() function:
arrange(birthwt, desc(bwt))
# Equivalent to arrange(birthwt, -bwt)
Selecting Columns with select()
The select() function is used to choose columns (variables) from a data frame. For example, the following command selects the variables ‘bwt’, ‘age’, ‘race’, and ‘smoke’ to form a new data frame:
select(birthwt, bwt, age, race, smoke)
Note that there is a function with the same name select() in the MASS package. If both the dplyr package and the MASS package are loaded simultaneously, R will default to using the function from the latter loaded package. To avoid confusion, we can specify the package explicitly using the :: symbol, for example, dplyr::select().
Adding/Changing Variables with mutate()
The mutate() function is used to create new variables or change the definition of an existing variable in a data frame. The following command creates a new variable ‘lwt.kg’ by multiplying the variable ‘lwt’ (unit: lb) in the birthwt dataset by the coefficient 0.4536 to convert lb to kg:
mutate(birthwt, lwt.kg = lwt * 0.4536)
If we want to replace an existing variable with a new one:
mutate(birthwt, lwt = lwt * 0.4536)
Computing Summary Statistics with summarise()
The summarise() function is used to calculate specified statistics for a variable in a data frame. For example, to compute the sample mean and standard deviation of the variable ‘bwt’:
summarise(birthwt, Mean.bwt = mean(bwt), Sd.bwt = sd(bwt))
Splitting Data Frames with group_by()
The group_by() function splits a data frame into multiple data frames based on one or more categorical variables. For example:
group_by(birthwt, race)
The function group_by() does not change the appearance of the data frame but changes how it interacts with other dplyr verb functions. Hence, the output may look similar to the original data frame, but it is fundamentally different. The main difference is the addition of a grouping attribute (Groups, to see details, use str()), indicating that the resulting object contains multiple data frames corresponding to the 3 categories of the variable ‘race’.
A tibble is a format similar to a data frame provided by the tidyverse package (including the dplyr package). Tibble has several advantages over traditional data frames. We can convert traditional data frames into tibbles using the as_tibble() function and vice versa using the as.data.frame() function.
as_tibble(birthwt)
Next, we’ll see how to conveniently perform grouped statistics on variables by using the group_by() and summarise() functions together.
Combining Multiple Operations with the Pipe Operator %>%
We often need to perform a series of operations on a data frame, where the input of the subsequent operation depends on the output of the previous one.
Step 1: Convert the variable ‘race’ inside the birthwt data frame into a factor and label each level, naming the new data frame as birthwt1
birthwt1 <- mutate(birthwt,
race = factor(race, labels = c("white", "black", "other")))
Step 2: Group the birthwt1 data frame by the variable ‘race’ and name the grouped object as birthwt.group
birthwt.group <- group_by(birthwt1, race)
Step 3: Calculate the mean of the variable ‘bwt’ for each group within the birthwt.group object.
summarise(birthwt.group, mean(bwt))
The major drawback of this method is the need to create a variable for each intermediate result. In many cases, such as in the example above, these intermediate variables actually serve no practical purpose. We have to name these intermediate variables, and they occupy memory space in the workspace. The pipe operator %>% passes the object before it to the function after it as the first parameter value. For example:
c(2, 4, 6, 8) %>% matrix(nrow = 2)
Because the functions in the dplyr package always take the data frame as their first argument, these functions combined with the pipe operator make it very convenient to handle data frames. Below is the rewritten command using the pipe operator:
birthwt %>%
mutate(race = factor(race, labels = c("white", "black", "other"))) %>%
group_by(race) %>%
summarise(mean(bwt))
The focus of the above code is on the verb functions rather than the parameters within the functions. When reading this sequence of code combinations, they can be considered as a series of specified actions.
Practical Application
The dataset Planning from the epiDisplay package comes from a family planning survey conducted in mid-1980s Thailand. (https://rdrr.io/cran/epiDisplay/man/Planning.html)
library(epiDisplay)
data(Planning)
print(des(Planning))
names(Planning) <- tolower(names(Planning)) # Convert variable names to lowercase
summary(Planning)
table(duplicated(Planning$id)) # Check for duplicate IDs
# FALSE TRUE
# 250 1
which(duplicated(Planning$id)) # Find the row number of the duplicate ID; replace XXXXXX with the correct code
# 216
Planning$id # Verify
Planning$id[216] <- 216 # Correct the duplicate ID
library(dplyr)
Planning <- mutate(
Planning,
relig = ifelse(relig == 9, NA, relig), # Convert 9 in variable 'relig' to NA
ped = ifelse(ped == 0 | ped == 9, NA, ped), # Convert 0 and 9 in variable 'ped' to NA
income = ifelse(income == 9, NA, income), # Convert 9 in variable 'income' to NA
am = ifelse(am == 99, NA, am), # Convert 99 in variable 'am' to NA
reason = ifelse(reason == 9, NA, reason), # Convert 9 in variable 'reason' to NA
bps = ifelse(bps == 0 | bps == 999, NA, bps), # Convert 0 and 999 in variable 'bps' to NA
bpd = ifelse(bpd == 0 | bpd == 999, NA, bpd), # Convert 0 and 999 in variable 'bpd' to NA
wt = ifelse(wt == 0 | wt > 99, NA, wt), # Convert 0 and values greater than 99 in variable 'wt' to NA
ht = ifelse(ht == 0 | ht > 300, NA, ht) # Convert 0 and values greater than 300 in variable 'ht' to NA
)