6 Formatting Data

This chapter will cover: merging data files, long and wide format, recoding varibles, changing variable types, column renaming and arrangement, and data manipulation. The intention is to briefly cover some of the most common tasks that are performed in R. There are many resources that cover these topics and more in much greater depth (and with better explanation), such as Roger D. Peng’s R Programming for Data Science and R for Data Science by Garrett Grolemund and Hadley Wickham.

The code in this chapter only works if you’re following along with the Github folder for this book (which you can download here), correctly set your working directory to the data folder (which you can learn how to do in Chapter 4), and run the code in the order it appears in this chapter.

Merging Files

Sometimes you might want to import multiple files and combine them into one dataset. This can be accomplished with the cbind() and rbind() functions, which are short for column bind and row bind. Let’s say you have three files that have the same column names and you want to combine them.

data1 <- read.csv("BMI_1.csv")
dim(data1)

data2 <- read.csv("BMI_2.csv")
dim(data2)

data3 <- read.csv("BMI_3.csv")
dim(data3)

newData <- rbind(data1, data2, data3)
dim(newData)

[1] 20  2
[1] 20  2
[1] 20  2
[1] 60  2

As you can see the last printout shows the dimensions are 60 rows by 2 columns, which means the three data frames were successfully combined.

Let’s say you have two files with different column names and want to combine them side-by-side. For example, you might have a data file that contains subID, age and sex, and another file contains height and weight data. These columns can be added into one dataset with the cbind() function.

subID, age, and sex dataset:

data1 <- read.csv("SubID_Age_Sex.csv")
head(data1)

  subID      age sex
1     1 30.68838   F
2     2 28.33384   M
3     3 33.65584   M
4     4 35.51969   M
5     5 32.43415   M
6     6 28.14712   F

height and weight dataset:

data2 <- read.csv("BMI_1.csv")
head(data2)

    height   weight
1 68.51668 154.6818
2 65.31865 173.9590
3 62.07894 138.2117
4 64.84917 165.0201
5 70.13895 148.6306
6 67.31371 196.6275

Combind with the cbind() function:

newData <- cbind(data1, data2)
head(newData)

  subID      age sex   height   weight
1     1 30.68838   F 68.51668 154.6818
2     2 28.33384   M 65.31865 173.9590
3     3 33.65584   M 62.07894 138.2117
4     4 35.51969   M 64.84917 165.0201
5     5 32.43415   M 70.13895 148.6306
6     6 28.14712   F 67.31371 196.6275

It is important to note that the rbind() function is only useful when binding two datasets that have all of the same column names and they’re in the same order. Similarly, the cbind() function is only useful when combining data with the same number of rows. If this is not the case, you can use the merge function for more complex joining operations. Type merge into the search bar under the help tab in RStudio to learn more about the function, or run the code ?base::merge.

Long Format

Putting data in long format increases the number of rows and decreases the number of columns in a data frame. In the health sciences, this often means that one row corresponds to one observation (at one point in time). If your data is not in long format, you can change it to long format with the pivot_longer() function, which comes from the tidyverse package.

Here’s an example of a research study where subjects had their VO2 max measured at four points in time, and it’s currently not in long format:

data_VO2 <- read.csv("data_VO2.csv")
head(data_VO2)

  SubID    Time1    Time2    Time3    Time4
1     1 36.72302 34.08948 41.50392 42.85447
2     2 28.09261 33.22172 39.22484 47.51721
3     3 25.32283 34.95023 38.91351 40.67717
4     4 31.87634 35.51951 38.05794 46.51975
5     5 30.98121 33.18143 41.16101 46.96758
6     6 31.83277 35.58868 33.92712 45.27417

We can convert the data_VO2.csv into long format with the pivot_longer() function:

# Make sure the tidyverse package is loaded 
data_VO2_long <- pivot_longer(data = data_VO2, 
                              cols = c("Time1", "Time2", "Time3", "Time4"),
                              names_to = "Time", 
                              values_to = "VO2_max")
head(data_VO2_long)

# A tibble: 6 x 3
  SubID Time  VO2_max
  <int> <chr>   <dbl>
1     1 Time1    36.7
2     1 Time2    34.1
3     1 Time3    41.5
4     1 Time4    42.9
5     2 Time1    28.1
6     2 Time2    33.2

Wide Format

Putting data in wide format increases the number of columns in a data frame. In the health sciences, this often means that one row will contain multiple points in time; this is how the data_VO2.csv was formatted before we converted it to long format. We can now change it back into wide format with the pivot_wider() function, which also comes from the tidyverse package.

# Make sure the tidyverse package is loaded 
data_VO2_wide <- pivot_wider(data = data_VO2_long,
                             names_from = "Time", 
                             values_from = "VO2_max")
head(data_VO2_wide)

# A tibble: 6 x 5
  SubID Time1 Time2 Time3 Time4
  <int> <dbl> <dbl> <dbl> <dbl>
1     1  36.7  34.1  41.5  42.9
2     2  28.1  33.2  39.2  47.5
3     3  25.3  35.0  38.9  40.7
4     4  31.9  35.5  38.1  46.5
5     5  31.0  33.2  41.2  47.0
6     6  31.8  35.6  33.9  45.3

Recoding Variables

One of the easiest ways to recode/fix factors in R is with the factor() function. We’ll use the SubID_Age_Sex.csv dataset that was used earlier:

SubID_Age_Sex <- read.csv("SubID_Age_Sex.csv")
head(SubID_Age_Sex)

  subID      age sex
1     1 30.68838   F
2     2 28.33384   M
3     3 33.65584   M
4     4 35.51969   M
5     5 32.43415   M
6     6 28.14712   F

For the factor() function, the first argument, x, is the column that should be selected, which in this case is sex. Again, we can use the dollar sign $ to select a specific column from a dataset. The second argument, levels, is the levels that should be included, and the third argument, labels, is the labels that should be assigned to each level, respectively.

SubID_Age_Sex$sex <- factor(x = SubID_Age_Sex$sex, 
                            levels = c("M","F"), 
                            labels = c(0,1))
head(SubID_Age_Sex)

  subID      age sex
1     1 30.68838   1
2     2 28.33384   0
3     3 33.65584   0
4     4 35.51969   0
5     5 32.43415   0
6     6 28.14712   1

Let’s take a look at a recoding example that’s slighly more involved. We’ll use the data_VO2.csv dataset used earlier.

data_VO2 <- read.csv("data_VO2_long.csv")
head(data_VO2)

  SubID  Time  VO2_max
1     1 Time1 24.75065
2     1 Time2 35.03364
3     1 Time3 42.41792
4     1 Time4 46.61252
5     2 Time1 31.29356
6     2 Time2 29.43164

And we’ll add a column of the subject’s sex:

data_VO2$Sex <- rep(c(rep("M", 4), rep("F", 4)), 10)
head(data_VO2)

  SubID  Time  VO2_max Sex
1     1 Time1 24.75065   M
2     1 Time2 35.03364   M
3     1 Time3 42.41792   M
4     1 Time4 46.61252   M
5     2 Time1 31.29356   F
6     2 Time2 29.43164   F

Rather than having VO2max as continuous data, we’ll change it to categorical data: poor, fair, good, excellent. The catch, however, is that the classification will partially depend on whether the subject was a male or female (it also depends on age, but we’ll assume everyone in the study was 30 years old).

The easiest way to do this is with the ifelse() function, and we’ll also use several functions from the tidyverse package. First, we need to filter the data so that only males or females are included in the dataset, but not both. We can do this with the filter() function from the tidyverse package. We’ll also need to create a new column to store the data in, and one easy way to do this is with the mutate() function from the tidyverse package.

data_VO2_F <- filter(data_VO2, Sex == "F")

data_VO2_F <- 
mutate(data_VO2, VO2_max_cat = ifelse(VO2_max < 22.8, "Very Poor",
                               ifelse(VO2_max >= 22.8 & VO2_max <= 26.9, "Poor", 
                               ifelse(VO2_max >= 27.0 & VO2_max <= 31.4, "Fair",
                               ifelse(VO2_max >= 31.5 & VO2_max <= 35.6, "Good",
                               ifelse(VO2_max >= 35.7 & VO2_max <= 40.0, "Excellent",
                                     "Superior"))))))
head(data_VO2_F)

  SubID  Time  VO2_max Sex VO2_max_cat
1     1 Time1 24.75065   M        Poor
2     1 Time2 35.03364   M        Good
3     1 Time3 42.41792   M    Superior
4     1 Time4 46.61252   M    Superior
5     2 Time1 31.29356   F        Fair
6     2 Time2 29.43164   F        Fair

In the code above the data were filtered to include only females. Notice that there are two equals signs, ==, within the filter() function. This is because two equals signs is used to represent a logical statement. The first line of code above is filtering the data to only include females: Sex == "F". This is a logical statement where we are saying we only want to include the data where Sex == "F" is true. If you were to add only one equals sign here you’d get an error, and errors like this can be very difficult to find!

After filtering, the data frame was assigned to the object data_VO2_F. Next, the mutate() function was used to create a new column called VO2_max_cat (cat for category). Then, the ifelse() function was used to assign a category based on the subject’s VO2_max. The first line reads, “If the VO2_max is less than 22.8, assign it the value”Very Poor“.” The last category, Superior, is assigned if the value does not meet any of the other conditions, which in this case is all values greater than 40.0.

Now we can repeat the same process for males:

data_VO2_M <- filter(data_VO2, Sex == "M") 

data_VO2_M <- 
mutate(data_VO2_M, VO2_max_cat = ifelse(VO2_max < 31.5, "Very Poor",
                             ifelse(VO2_max >= 31.5 & VO2_max <= 35.4, "Poor", 
                             ifelse(VO2_max >= 35.5 & VO2_max <= 40.9, "Fair",
                             ifelse(VO2_max >= 41.0 & VO2_max <= 44.9, "Good",
                             ifelse(VO2_max >= 45.0 & VO2_max <= 49.4, "Excellent",
                                    "Superior"))))))

head(data_VO2_M)

  SubID  Time  VO2_max Sex VO2_max_cat
1     1 Time1 24.75065   M   Very Poor
2     1 Time2 35.03364   M        Poor
3     1 Time3 42.41792   M        Good
4     1 Time4 46.61252   M   Excellent
5     3 Time1 28.31473   M   Very Poor
6     3 Time2 32.42022   M        Poor

If you wanted the data frames to be combined back into one data frame that included both males and females, you could use the rbind() function.

You might want to recode certain values as NA if you know the observed values are impossible. Here’s an example where Age data is collected, and values less than 0 or greater than 130 are recoded as NA. Notice the vertical bar in the code. The vertical bar means or. So the code below reads, “If Age is less than 0 or greater than 130, assign the value NA to it.”

df <- data.frame(Age = c(1546, 43, 23, 56, -64))
df$Age[df$Age < 0 | df$Age > 130] <- NA
df

Changing Variable Types

As discussed briefly in Chapter 2, there are several variable types in R, and the variable type of the data will affect the operations that you can and cannot perform on that data. For example, if the data is of the character class, you cannot calculate the mean with the mean() function.

Let’s say you have 10 values consisting of zeros and ones, where the ones are intended to represent TRUE and zeros are intended to represent FALSE.

values <- c(0, 0, 1, 0, 1, 1, 1, 1, 0, 0)

Unfortunately, R has no way of knowing that you want the zeros and ones to represent FALSE and TRUE. R will initially interpret the values exactly as they are, as numeric zeros and ones. You can find out the class of the object with the class() function:

class(values)

[1] "numeric"

As expected, the values are numeric. We can coerce the values to the logical class, so that the zeros are FALSE and ones are TRUE:

as.logical(values)

 [1] FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

Now let’s check the class of the object:

class(values)

[1] "numeric"

Notice that the class is still the same. That’s because we didn’t overwrite the values object, which is what we need to do in order to permanently change the object:

values <- as.logical(values)
class(values)

[1] "logical"

R can only perform a coercion that makes sense.

words <- "Can this sentence be coverted to a different class?"
as.numeric(words)
as.integer(words)
as.logical(words)
as.factor(words)

[1] NA
[1] NA
[1] NA
[1] Can this sentence be coverted to a different class?
Levels: Can this sentence be coverted to a different class?

As you can see in the example above, the sentence (which is of the character class) could not successfully be converted to a numeric, integer, or logical class. Doing so produced NAs in each case. The sentence could, however, be converted to a factor class. Although not at all sensible in this situation, it is possible because the string of characters can be represented as a categorical variable.

Here’s one more example using the mtcars data frame, which consists of 11 columns that are all of the numeric class:

str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

It would make more sense to represent the cyl (cylinder) as a factor in most situations, since the data are not continuous and there are only 3 possible cylinder sizes in the data frame:

mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$cyl

 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Levels: 4 6 8

Rename Columns

Renaming columns is a very common task. Before renaming the columns, you might want to print all of the column names to the screen so you can look at them. This can be done with the colnames() function:

colnames(mtcars)

 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

Let’s say that we wanted to be abundantly clear about the column names of the mtcars data frame. Instead of mpg as the first column name, let’s rename it to miles_per_gallon.

colnames(mtcars)[1] <- "miles_per_gallon" 
colnames(mtcars)

 [1] "miles_per_gallon" "cyl"              "disp"             "hp"              
 [5] "drat"             "wt"               "qsec"             "vs"              
 [9] "am"               "gear"             "carb"

The brackets were used to select the first item of column names, which in this case was mpg. If you wanted to rename multiple columns, you could do this in one step:

colnames(mtcars)[c(2, 4, 6)] <- c("cylinder", "horsepower", "weight")
colnames(mtcars)

 [1] "miles_per_gallon" "cylinder"         "disp"             "horsepower"      
 [5] "drat"             "weight"           "qsec"             "vs"              
 [9] "am"               "gear"             "carb"

Columns 2, 4, and 6 were selected with the concatenate function (c()) and the names cylinder, horsepower, and weight were assigned to those column names, respectively.

Change Column Order

You can change the order of columns in a data frame by subsetting with brackets and then using the concatenate function to change the order, like so:

iris <- iris[c(5, 1:4)]
iris

The fifth column in the iris data frame, Species, is now listed first instead of last.

Data Manipulation

In this section we’ll provide a few examples of data transformation using the tidyverse package. Make sure you have the tidyverse library loaded before using the functions listed below:

select(): pick variables (columns) by their names
filter(): select observations by their values
mutate(): create new variables (column) with functions of existing variables
arrange(): Reorder the rows
summarise(): collapse many values down to a single summary
group_by(): change the scope of a function to operate on a group

In the Recoding Variables section above we’ve already seen how data can easily be filtered with the filter() function from the tidyverse package. There might be situations where you need to do more than simply filtering the data. Maybe you’d like to filter the data, group it by some grouping variable, and then summarize the data based off of the grouping and filtering. This, and much more, can be done with some of the functions in the tidyverse package.

Let’s look at an example where we’d like to perform several transformations to the BMI_1.csv and SubID_Age_Sex.csv data frames. Earlier we combined these data frames into a new object called newData, which is the object we’ll be manipulating. Make sure to run the code below if you haven’t already.

data1 <- read.csv("SubID_Age_Sex.csv")
data2 <- read.csv("BMI_1.csv")
newData <- cbind(data1, data2)
head(newData)

  subID      age sex   height   weight
1     1 30.68838   F 68.51668 154.6818
2     2 28.33384   M 65.31865 173.9590
3     3 33.65584   M 62.07894 138.2117
4     4 35.51969   M 64.84917 165.0201
5     5 32.43415   M 70.13895 148.6306
6     6 28.14712   F 67.31371 196.6275

Let’s say we wanted to calculate the BMI of males only, and then arrange the data so that the highest BMI values are listed first. First we’d need to filter the data to include only males, then convert height and weight to metric values, then calculate BMI, then arrange the data in descending order according to BMI. We’ll use the filter(), mutate(), and arrange() functions to do this.

Good Option

Here is one option of how the desired transformation could be accomplished. The code is readable, but several objects had to be created as intermediate steps, which is not ideal.

newData1 <- filter(newData, sex == "M")
newData2 <- mutate(newData1, weight_kg = weight / 2.2,
                             height_m = height * .0254,
                             bmi = weight_kg / height_m^2)
newData3 <- arrange(newData2, desc(bmi))
newData3

  subID      age sex   height   weight weight_kg height_m      bmi
1    14 31.20603   M 63.05955 173.9793  79.08148 1.601712 30.82518
2     2 28.33384   M 65.31865 173.9590  79.07225 1.659094 28.72647
3     4 35.51969   M 64.84917 165.0201  75.00913 1.647169 27.64636
4    19 29.41172   M 68.84346 184.8655  84.02978 1.748624 27.48150
5     3 33.65584   M 62.07894 138.2117  62.82348 1.576805 25.26771
6    10 25.18445   M 71.39754 177.2335  80.56069 1.813498 24.49567
7     7 30.53586   M 71.15408 166.7169  75.78042 1.807314 23.20010
8     5 32.43415   M 70.13895 148.6306  67.55937 1.781529 21.28628

This way of doing things is just fine, but it’s not very efficient. Data manipulation often involves many steps, and it would be quite the hassle to create a new object for every step in the process, and then use that new object for further manipulation. Thankfully, there’s a better way to do this.

Better Option

The code below is as readable as the last, and new objects did not have to be created in the process, making it the better option. In this code, the data was filtered to include only males, weight_kg and height_m columns were added, bmi was then calculated using those two columns, and the data was arranged so that the highest BMI values were listed first.

newData %>% 
    filter(sex == "M") %>% 
    mutate(weight_kg = weight / 2.2,
           height_m = height * .0254,
           bmi = weight_kg / height_m^2) %>% 
    arrange(desc(bmi))

  subID      age sex   height   weight weight_kg height_m      bmi
1    14 31.20603   M 63.05955 173.9793  79.08148 1.601712 30.82518
2     2 28.33384   M 65.31865 173.9590  79.07225 1.659094 28.72647
3     4 35.51969   M 64.84917 165.0201  75.00913 1.647169 27.64636
4    19 29.41172   M 68.84346 184.8655  84.02978 1.748624 27.48150
5     3 33.65584   M 62.07894 138.2117  62.82348 1.576805 25.26771
6    10 25.18445   M 71.39754 177.2335  80.56069 1.813498 24.49567
7     7 30.53586   M 71.15408 166.7169  75.78042 1.807314 23.20010
8     5 32.43415   M 70.13895 148.6306  67.55937 1.781529 21.28628

This was all accomplished without creating new objects thanks to the pipe operator, %>%. Think of the pipe as the word “then”. For example, take the newData object then filter to include only males, then mutate to calculate BMI values, then arrange the data.

Here’s another example that utilizes the other three functions: select, group_by, and summarise. We’ll be using the data_VO2 object created earlier. Make sure to run the code below if you haven’t already.

data_VO2 <- read.csv("data_VO2_long.csv")
data_VO2$Sex <- rep(c(rep("M", 4), rep("F", 4)), 10)
head(data_VO2)

  SubID  Time  VO2_max Sex
1     1 Time1 24.75065   M
2     1 Time2 35.03364   M
3     1 Time3 42.41792   M
4     1 Time4 46.61252   M
5     2 Time1 31.29356   F
6     2 Time2 29.43164   F

In this example, we want to select all columns except for the SubID column, group the observations by Time and Sex, perform a summary statistic where we calculate the average VO2_max based on that grouping, and count the number of observations in each grouping.

data_VO2 %>% 
    select(!SubID) %>% 
    group_by(Time, Sex) %>% 
    summarise(VO2_max_avg = mean(VO2_max),
              count = n())

# A tibble: 8 x 4
# Groups:   Time [4]
  Time  Sex   VO2_max_avg count
  <chr> <chr>       <dbl> <int>
1 Time1 F            30.7    10
2 Time1 M            28.4    10
3 Time2 F            32.0    10
4 Time2 M            34.4    10
5 Time3 F            40.0    10
6 Time3 M            41.2    10
7 Time4 F            44.1    10
8 Time4 M            45.0    10

In the select() function the SubID column was not selected, which is indicated by the exclamation point, !. The data was then grouped by Time and Sex, and the summarise() function was used to create a new column called VO2_max_avg, which was computed as the average VO2_max for each of the unique groups. A count column was also created with the n() function, which counted the observations in each grouping.

We’ve shown that the arrange() function can be used to reorder rows of numeric data by descending (or ascending) values; it can also be used to arrange factors, like the Sex column. Here’s the same code as above, except that the data is arranged by Sex instead of Time:

data_VO2 %>% 
    select(!SubID) %>% 
    group_by(Time, Sex) %>% 
    summarise(VO2_max_avg = mean(VO2_max)) %>% 
    arrange(Sex)

# A tibble: 8 x 3
# Groups:   Time [4]
  Time  Sex   VO2_max_avg
  <chr> <chr>       <dbl>
1 Time1 F            30.7
2 Time2 F            32.0
3 Time3 F            40.0
4 Time4 F            44.1
5 Time1 M            28.4
6 Time2 M            34.4
7 Time3 M            41.2
8 Time4 M            45.0

These data manipulation functions are far more comprehensive than what’s shown here. To see more examples of these functions and in much greater detail, check out the Data Transformation chapter in R for Data Science, which you can read here.