6 Formatting Data
This chapter will cover: merging data files, long and wide format, recoding varibles, changing variable types, column renaming and arrangement, and data manipulation. The intention is to briefly cover some of the most common tasks that are performed in R. There are many resources that cover these topics and more in much greater depth (and with better explanation), such as Roger D. Peng’s R Programming for Data Science and R for Data Science by Garrett Grolemund and Hadley Wickham.
The code in this chapter only works if you’re following along with the Github folder for this book (which you can download here), correctly set your working directory to the data folder (which you can learn how to do in Chapter 4), and run the code in the order it appears in this chapter.
Merging Files
Sometimes you might want to import multiple files and combine them into one dataset. This can be accomplished with the cbind()
and rbind()
functions, which are short for column bind and row bind. Let’s say you have three files that have the same column names and you want to combine them.
data1 <- read.csv("BMI_1.csv")
dim(data1)
data2 <- read.csv("BMI_2.csv")
dim(data2)
data3 <- read.csv("BMI_3.csv")
dim(data3)
newData <- rbind(data1, data2, data3)
dim(newData)
[1] 20 2
[1] 20 2
[1] 20 2
[1] 60 2
As you can see the last printout shows the dimensions are 60 rows by 2 columns, which means the three data frames were successfully combined.
Let’s say you have two files with different column names and want to combine them side-by-side. For example, you might have a data file that contains subID
, age
and sex
, and another file contains height
and weight
data. These columns can be added into one dataset with the cbind()
function.
subID
, age
, and sex
dataset:
subID age sex
1 1 30.68838 F
2 2 28.33384 M
3 3 33.65584 M
4 4 35.51969 M
5 5 32.43415 M
6 6 28.14712 F
height
and weight
dataset:
height weight
1 68.51668 154.6818
2 65.31865 173.9590
3 62.07894 138.2117
4 64.84917 165.0201
5 70.13895 148.6306
6 67.31371 196.6275
Combind with the cbind()
function:
subID age sex height weight
1 1 30.68838 F 68.51668 154.6818
2 2 28.33384 M 65.31865 173.9590
3 3 33.65584 M 62.07894 138.2117
4 4 35.51969 M 64.84917 165.0201
5 5 32.43415 M 70.13895 148.6306
6 6 28.14712 F 67.31371 196.6275
It is important to note that the rbind()
function is only useful when binding two datasets that have all of the same column names and they’re in the same order. Similarly, the cbind()
function is only useful when combining data with the same number of rows. If this is not the case, you can use the merge
function for more complex joining operations. Type merge
into the search bar under the help tab in RStudio to learn more about the function, or run the code ?base::merge
.
Long Format
Putting data in long format increases the number of rows and decreases the number of columns in a data frame. In the health sciences, this often means that one row corresponds to one observation (at one point in time). If your data is not in long format, you can change it to long format with the pivot_longer()
function, which comes from the tidyverse package.
Here’s an example of a research study where subjects had their VO2 max measured at four points in time, and it’s currently not in long format:
SubID Time1 Time2 Time3 Time4
1 1 36.72302 34.08948 41.50392 42.85447
2 2 28.09261 33.22172 39.22484 47.51721
3 3 25.32283 34.95023 38.91351 40.67717
4 4 31.87634 35.51951 38.05794 46.51975
5 5 30.98121 33.18143 41.16101 46.96758
6 6 31.83277 35.58868 33.92712 45.27417
We can convert the data_VO2.csv into long format with the pivot_longer()
function:
# Make sure the tidyverse package is loaded
data_VO2_long <- pivot_longer(data = data_VO2,
cols = c("Time1", "Time2", "Time3", "Time4"),
names_to = "Time",
values_to = "VO2_max")
head(data_VO2_long)
# A tibble: 6 x 3
SubID Time VO2_max
<int> <chr> <dbl>
1 1 Time1 36.7
2 1 Time2 34.1
3 1 Time3 41.5
4 1 Time4 42.9
5 2 Time1 28.1
6 2 Time2 33.2
Wide Format
Putting data in wide format increases the number of columns in a data frame. In the health sciences, this often means that one row will contain multiple points in time; this is how the data_VO2.csv was formatted before we converted it to long format. We can now change it back into wide format with the pivot_wider()
function, which also comes from the tidyverse
package.
# Make sure the tidyverse package is loaded
data_VO2_wide <- pivot_wider(data = data_VO2_long,
names_from = "Time",
values_from = "VO2_max")
head(data_VO2_wide)
# A tibble: 6 x 5
SubID Time1 Time2 Time3 Time4
<int> <dbl> <dbl> <dbl> <dbl>
1 1 36.7 34.1 41.5 42.9
2 2 28.1 33.2 39.2 47.5
3 3 25.3 35.0 38.9 40.7
4 4 31.9 35.5 38.1 46.5
5 5 31.0 33.2 41.2 47.0
6 6 31.8 35.6 33.9 45.3
Recoding Variables
One of the easiest ways to recode/fix factors in R is with the factor()
function. We’ll use the SubID_Age_Sex.csv dataset that was used earlier:
subID age sex
1 1 30.68838 F
2 2 28.33384 M
3 3 33.65584 M
4 4 35.51969 M
5 5 32.43415 M
6 6 28.14712 F
For the factor()
function, the first argument, x
, is the column that should be selected, which in this case is sex
. Again, we can use the dollar sign $
to select a specific column from a dataset. The second argument, levels
, is the levels that should be included, and the third argument, labels
, is the labels that should be assigned to each level, respectively.
SubID_Age_Sex$sex <- factor(x = SubID_Age_Sex$sex,
levels = c("M","F"),
labels = c(0,1))
head(SubID_Age_Sex)
subID age sex
1 1 30.68838 1
2 2 28.33384 0
3 3 33.65584 0
4 4 35.51969 0
5 5 32.43415 0
6 6 28.14712 1
Let’s take a look at a recoding example that’s slighly more involved. We’ll use the data_VO2.csv dataset used earlier.
SubID Time VO2_max
1 1 Time1 24.75065
2 1 Time2 35.03364
3 1 Time3 42.41792
4 1 Time4 46.61252
5 2 Time1 31.29356
6 2 Time2 29.43164
And we’ll add a column of the subject’s sex:
SubID Time VO2_max Sex
1 1 Time1 24.75065 M
2 1 Time2 35.03364 M
3 1 Time3 42.41792 M
4 1 Time4 46.61252 M
5 2 Time1 31.29356 F
6 2 Time2 29.43164 F
Rather than having VO2max as continuous data, we’ll change it to categorical data: poor
, fair
, good
, excellent
. The catch, however, is that the classification will partially depend on whether the subject was a male or female (it also depends on age, but we’ll assume everyone in the study was 30 years old).
The easiest way to do this is with the ifelse()
function, and we’ll also use several functions from the tidyverse package. First, we need to filter the data so that only males or females are included in the dataset, but not both. We can do this with the filter()
function from the tidyverse package. We’ll also need to create a new column to store the data in, and one easy way to do this is with the mutate()
function from the tidyverse package.
data_VO2_F <- filter(data_VO2, Sex == "F")
data_VO2_F <-
mutate(data_VO2, VO2_max_cat = ifelse(VO2_max < 22.8, "Very Poor",
ifelse(VO2_max >= 22.8 & VO2_max <= 26.9, "Poor",
ifelse(VO2_max >= 27.0 & VO2_max <= 31.4, "Fair",
ifelse(VO2_max >= 31.5 & VO2_max <= 35.6, "Good",
ifelse(VO2_max >= 35.7 & VO2_max <= 40.0, "Excellent",
"Superior"))))))
head(data_VO2_F)
SubID Time VO2_max Sex VO2_max_cat
1 1 Time1 24.75065 M Poor
2 1 Time2 35.03364 M Good
3 1 Time3 42.41792 M Superior
4 1 Time4 46.61252 M Superior
5 2 Time1 31.29356 F Fair
6 2 Time2 29.43164 F Fair
In the code above the data were filtered to include only females. Notice that there are two equals signs, ==
, within the filter()
function. This is because two equals signs is used to represent a logical statement. The first line of code above is filtering the data to only include females: Sex == "F"
. This is a logical statement where we are saying we only want to include the data where Sex == "F"
is true. If you were to add only one equals sign here you’d get an error, and errors like this can be very difficult to find!
After filtering, the data frame was assigned to the object data_VO2_F
. Next, the mutate()
function was used to create a new column called VO2_max_cat
(cat for category). Then, the ifelse()
function was used to assign a category based on the subject’s VO2_max. The first line reads, “If the VO2_max is less than 22.8, assign it the value”Very Poor“.” The last category, Superior
, is assigned if the value does not meet any of the other conditions, which in this case is all values greater than 40.0.
Now we can repeat the same process for males:
data_VO2_M <- filter(data_VO2, Sex == "M")
data_VO2_M <-
mutate(data_VO2_M, VO2_max_cat = ifelse(VO2_max < 31.5, "Very Poor",
ifelse(VO2_max >= 31.5 & VO2_max <= 35.4, "Poor",
ifelse(VO2_max >= 35.5 & VO2_max <= 40.9, "Fair",
ifelse(VO2_max >= 41.0 & VO2_max <= 44.9, "Good",
ifelse(VO2_max >= 45.0 & VO2_max <= 49.4, "Excellent",
"Superior"))))))
head(data_VO2_M)
SubID Time VO2_max Sex VO2_max_cat
1 1 Time1 24.75065 M Very Poor
2 1 Time2 35.03364 M Poor
3 1 Time3 42.41792 M Good
4 1 Time4 46.61252 M Excellent
5 3 Time1 28.31473 M Very Poor
6 3 Time2 32.42022 M Poor
If you wanted the data frames to be combined back into one data frame that included both males and females, you could use the rbind()
function.
You might want to recode certain values as NA
if you know the observed values are impossible. Here’s an example where Age
data is collected, and values less than 0 or greater than 130 are recoded as NA
. Notice the vertical bar in the code. The vertical bar means or
. So the code below reads, “If Age
is less than 0 or greater than 130, assign the value NA
to it.”
Age
1 NA
2 43
3 23
4 56
5 NA
Changing Variable Types
As discussed briefly in Chapter 2, there are several variable types in R, and the variable type of the data will affect the operations that you can and cannot perform on that data. For example, if the data is of the character class, you cannot calculate the mean with the mean()
function.
Let’s say you have 10 values consisting of zeros and ones, where the ones are intended to represent TRUE
and zeros are intended to represent FALSE
.
Unfortunately, R has no way of knowing that you want the zeros and ones to represent FALSE
and TRUE
. R will initially interpret the values exactly as they are, as numeric zeros and ones. You can find out the class of the object with the class()
function:
[1] "numeric"
As expected, the values are numeric. We can coerce the values to the logical class, so that the zeros are FALSE
and ones are TRUE
:
[1] FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
Now let’s check the class of the object:
[1] "numeric"
Notice that the class is still the same. That’s because we didn’t overwrite the values
object, which is what we need to do in order to permanently change the object:
[1] "logical"
R can only perform a coercion that makes sense.
words <- "Can this sentence be coverted to a different class?"
as.numeric(words)
as.integer(words)
as.logical(words)
as.factor(words)
[1] NA
[1] NA
[1] NA
[1] Can this sentence be coverted to a different class?
Levels: Can this sentence be coverted to a different class?
As you can see in the example above, the sentence (which is of the character class) could not successfully be converted to a numeric, integer, or logical class. Doing so produced NAs
in each case. The sentence could, however, be converted to a factor class. Although not at all sensible in this situation, it is possible because the string of characters can be represented as a categorical variable.
Here’s one more example using the mtcars
data frame, which consists of 11 columns that are all of the numeric class:
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
It would make more sense to represent the cyl
(cylinder) as a factor in most situations, since the data are not continuous and there are only 3 possible cylinder sizes in the data frame:
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Levels: 4 6 8
Rename Columns
Renaming columns is a very common task. Before renaming the columns, you might want to print all of the column names to the screen so you can look at them. This can be done with the colnames()
function:
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
Let’s say that we wanted to be abundantly clear about the column names of the mtcars data frame. Instead of mpg
as the first column name, let’s rename it to miles_per_gallon
.
[1] "miles_per_gallon" "cyl" "disp" "hp"
[5] "drat" "wt" "qsec" "vs"
[9] "am" "gear" "carb"
The brackets were used to select the first item of column names, which in this case was mpg
. If you wanted to rename multiple columns, you could do this in one step:
[1] "miles_per_gallon" "cylinder" "disp" "horsepower"
[5] "drat" "weight" "qsec" "vs"
[9] "am" "gear" "carb"
Columns 2, 4, and 6 were selected with the concatenate function (c()
) and the names cylinder
, horsepower
, and weight
were assigned to those column names, respectively.
Change Column Order
You can change the order of columns in a data frame by subsetting with brackets and then using the concatenate function to change the order, like so:
The fifth column in the iris
data frame, Species
, is now listed first instead of last.
Data Manipulation
In this section we’ll provide a few examples of data transformation using the tidyverse
package. Make sure you have the tidyverse
library loaded before using the functions listed below:
select()
: pick variables (columns) by their namesfilter()
: select observations by their valuesmutate()
: create new variables (column) with functions of existing variablesarrange()
: Reorder the rowssummarise()
: collapse many values down to a single summarygroup_by()
: change the scope of a function to operate on a group
In the Recoding Variables section above we’ve already seen how data can easily be filtered with the filter()
function from the tidyverse
package. There might be situations where you need to do more than simply filtering the data. Maybe you’d like to filter the data, group it by some grouping variable, and then summarize the data based off of the grouping and filtering. This, and much more, can be done with some of the functions in the tidyverse
package.
Let’s look at an example where we’d like to perform several transformations to the BMI_1.csv and SubID_Age_Sex.csv data frames. Earlier we combined these data frames into a new object called newData
, which is the object we’ll be manipulating. Make sure to run the code below if you haven’t already.
data1 <- read.csv("SubID_Age_Sex.csv")
data2 <- read.csv("BMI_1.csv")
newData <- cbind(data1, data2)
head(newData)
subID age sex height weight
1 1 30.68838 F 68.51668 154.6818
2 2 28.33384 M 65.31865 173.9590
3 3 33.65584 M 62.07894 138.2117
4 4 35.51969 M 64.84917 165.0201
5 5 32.43415 M 70.13895 148.6306
6 6 28.14712 F 67.31371 196.6275
Let’s say we wanted to calculate the BMI of males only, and then arrange the data so that the highest BMI values are listed first. First we’d need to filter the data to include only males, then convert height and weight to metric values, then calculate BMI, then arrange the data in descending order according to BMI. We’ll use the filter()
, mutate()
, and arrange()
functions to do this.
Good Option
Here is one option of how the desired transformation could be accomplished. The code is readable, but several objects had to be created as intermediate steps, which is not ideal.
newData1 <- filter(newData, sex == "M")
newData2 <- mutate(newData1, weight_kg = weight / 2.2,
height_m = height * .0254,
bmi = weight_kg / height_m^2)
newData3 <- arrange(newData2, desc(bmi))
newData3
subID age sex height weight weight_kg height_m bmi
1 14 31.20603 M 63.05955 173.9793 79.08148 1.601712 30.82518
2 2 28.33384 M 65.31865 173.9590 79.07225 1.659094 28.72647
3 4 35.51969 M 64.84917 165.0201 75.00913 1.647169 27.64636
4 19 29.41172 M 68.84346 184.8655 84.02978 1.748624 27.48150
5 3 33.65584 M 62.07894 138.2117 62.82348 1.576805 25.26771
6 10 25.18445 M 71.39754 177.2335 80.56069 1.813498 24.49567
7 7 30.53586 M 71.15408 166.7169 75.78042 1.807314 23.20010
8 5 32.43415 M 70.13895 148.6306 67.55937 1.781529 21.28628
This way of doing things is just fine, but it’s not very efficient. Data manipulation often involves many steps, and it would be quite the hassle to create a new object for every step in the process, and then use that new object for further manipulation. Thankfully, there’s a better way to do this.
Better Option
The code below is as readable as the last, and new objects did not have to be created in the process, making it the better option. In this code, the data was filtered to include only males, weight_kg
and height_m
columns were added, bmi
was then calculated using those two columns, and the data was arranged so that the highest BMI values were listed first.
newData %>%
filter(sex == "M") %>%
mutate(weight_kg = weight / 2.2,
height_m = height * .0254,
bmi = weight_kg / height_m^2) %>%
arrange(desc(bmi))
subID age sex height weight weight_kg height_m bmi
1 14 31.20603 M 63.05955 173.9793 79.08148 1.601712 30.82518
2 2 28.33384 M 65.31865 173.9590 79.07225 1.659094 28.72647
3 4 35.51969 M 64.84917 165.0201 75.00913 1.647169 27.64636
4 19 29.41172 M 68.84346 184.8655 84.02978 1.748624 27.48150
5 3 33.65584 M 62.07894 138.2117 62.82348 1.576805 25.26771
6 10 25.18445 M 71.39754 177.2335 80.56069 1.813498 24.49567
7 7 30.53586 M 71.15408 166.7169 75.78042 1.807314 23.20010
8 5 32.43415 M 70.13895 148.6306 67.55937 1.781529 21.28628
This was all accomplished without creating new objects thanks to the pipe operator, %>%
. Think of the pipe as the word “then”. For example, take the newData
object then filter
to include only males, then mutate
to calculate BMI values, then arrange
the data.
Here’s another example that utilizes the other three functions: select
, group_by
, and summarise
. We’ll be using the data_VO2
object created earlier. Make sure to run the code below if you haven’t already.
data_VO2 <- read.csv("data_VO2_long.csv")
data_VO2$Sex <- rep(c(rep("M", 4), rep("F", 4)), 10)
head(data_VO2)
SubID Time VO2_max Sex
1 1 Time1 24.75065 M
2 1 Time2 35.03364 M
3 1 Time3 42.41792 M
4 1 Time4 46.61252 M
5 2 Time1 31.29356 F
6 2 Time2 29.43164 F
In this example, we want to select all columns except for the SubID
column, group the observations by Time
and Sex
, perform a summary statistic where we calculate the average VO2_max
based on that grouping, and count the number of observations in each grouping.
data_VO2 %>%
select(!SubID) %>%
group_by(Time, Sex) %>%
summarise(VO2_max_avg = mean(VO2_max),
count = n())
# A tibble: 8 x 4
# Groups: Time [4]
Time Sex VO2_max_avg count
<chr> <chr> <dbl> <int>
1 Time1 F 30.7 10
2 Time1 M 28.4 10
3 Time2 F 32.0 10
4 Time2 M 34.4 10
5 Time3 F 40.0 10
6 Time3 M 41.2 10
7 Time4 F 44.1 10
8 Time4 M 45.0 10
In the select()
function the SubID
column was not selected, which is indicated by the exclamation point, !
. The data was then grouped by Time
and Sex
, and the summarise()
function was used to create a new column called VO2_max_avg
, which was computed as the average VO2_max for each of the unique groups. A count
column was also created with the n()
function, which counted the observations in each grouping.
We’ve shown that the arrange()
function can be used to reorder rows of numeric data by descending (or ascending) values; it can also be used to arrange factors, like the Sex
column. Here’s the same code as above, except that the data is arranged by Sex
instead of Time
:
data_VO2 %>%
select(!SubID) %>%
group_by(Time, Sex) %>%
summarise(VO2_max_avg = mean(VO2_max)) %>%
arrange(Sex)
# A tibble: 8 x 3
# Groups: Time [4]
Time Sex VO2_max_avg
<chr> <chr> <dbl>
1 Time1 F 30.7
2 Time2 F 32.0
3 Time3 F 40.0
4 Time4 F 44.1
5 Time1 M 28.4
6 Time2 M 34.4
7 Time3 M 41.2
8 Time4 M 45.0
These data manipulation functions are far more comprehensive than what’s shown here. To see more examples of these functions and in much greater detail, check out the Data Transformation chapter in R for Data Science, which you can read here.