View on GitHub
File an issue

5 Working with Data

5.1 Objects in R

The first thing to keep in mind when working in R is that, by default, all objects are stored in the global environment when you assign them. An object stored to the global environment will be accessible from anywhere in a project once assigned.

These objects can be any of the data types or structures we saw a couple of sections ago.

Note that we can also assign objects as so: y = 10. However, using the equal sign is generally not preferred because it is used to assign values to function arguments, too.

To begin, though, we’ll assign single values to several variables using <-, known as the assignment operator. One important thing to keep in mind when working in R is that all names are case-sensitive, meaning that R considers myvariable and myVariable to be two different objects.

Can you name the data types for each object we just assigned? Write 3 more lines and wrap each object name in a class() function.

x1 <- 10
x2 <- "Welcome to Brown!"
x3 <- TRUE

To view what’s stored in those variables, we print them to the console.

# note that in some cases, you may want use the print() function
x1
[1] 10
x2
[1] "Welcome to Brown!"
x3
[1] TRUE

We can also perform manipulations with them:

x1 * 10
[1] 100
paste(x2, "We can do many things with the paste() function!")
[1] "Welcome to Brown! We can do many things with the paste() function!"

With x3, we can do a couple of things because Booleans are also be treated as numerics, in a sense, where TRUE equals 1 and FALSE equals 0.

Recall the prior discussion of R as a functional language. Here, we pass the object x3 to the function isTRUE.

# test whether x3 is TRUE
isTRUE(x3)
[1] TRUE
# test whether x3 is FALSE
isFALSE(x3)
[1] FALSE
# multiply TRUE (1) times 5
x3 * 5
[1] 5

These are very simple examples, of course, but hopefully they provide some inkling as to how the different data types can be combined and manipulated to achieve various ends. Checking variable values and classes will be a regular task when manipulating data and debugging code.

For instance, if we needed to confirm that x1 was a numeric variable, we could check it:

is.numeric(x1)
[1] TRUE

Or we could make sure it took the value we expected:

x1 == 10
[1] TRUE

Note, that to avoid reassigning x1, we used ==, which is a logical, or comparison, operator.

We’ll revisit logical operations in a bit after we review how to manipulate and work with some different types of objects.

5.2 More Complex Objects

When we reviewed data types previously, we saw that R is capable of handling objects such as vectors, matrices, lists, and data frames. Let’s have a look at each of these objects in turn, and get a feel for how we can work with them.

5.2.1 Vectors

You can think of a vector as a sequence of values. In fact, the objects we assigned in the prior section were simply vectors with a length of one.

Creating longer vectors is straightforward but requires a concatenation function called c(). As we did last time, let’s create a few vectors.

Concatenation Merging or joining multiple objects into one.

x1_vec <- c(50, 21, 455, 89, 09)
x2_vec <- c("Hi!", "Welcome", "to", "Brown", "SPH")
x3_vec <- c(TRUE, TRUE, FALSE, FALSE, FALSE)

x1_vec
[1]  50  21 455  89   9
x2_vec
[1] "Hi!"     "Welcome" "to"      "Brown"   "SPH"    
x3_vec
[1]  TRUE  TRUE FALSE FALSE FALSE

Now we have 3 more vectors for which we can retrieve some information. Let’s figure out the classes and lengths for each vector.

Much of the code in these blocks is included to make the output display a little neater. Ignore most of it for now, and simply recognize that we’ve passed each of the objects above into the class() and length() functions.

cat("x1_vec attributes",
    paste0(rep("-", 30), collapse = ""),
    paste("class:", class(x1_vec)),
    paste("length:", length(x1_vec)),
    sep = "\n"
  )
x1_vec attributes
------------------------------
class: numeric
length: 5
cat("x2_vec attributes",
    paste0(rep("-", 30), collapse = ""),
    paste("class:", class(x2_vec)),
    paste("length:", length(x2_vec)),
    sep = "\n"
  )
x2_vec attributes
------------------------------
class: character
length: 5
cat("x3_vec attributes",
    paste0(rep("-", 30), collapse = ""),
    paste("class:", class(x3_vec)),
    paste("length:", length(x3_vec)),
    sep = "\n"
  )
x3_vec attributes
------------------------------
class: logical
length: 5

However, a single vector can contain values of only one class. In fact, if we try to mix classes, R will convert all elements of the vector into a single class. In the example below, R converts the numerical value 5 into a character because it doesn’t know how to convert the string we provided into a number.

This process of changing value classes is known as coercion.

x4_vec <- c(5, "So-called 'Arthur King'!")

# R coerces the numeric value into a character
c("class" = class(x4_vec[1]),
  "numeric?" = is.numeric(x4_vec[1]))
      class    numeric? 
"character"     "FALSE" 

You can check specific class types with a large number of functions. A few examples to demonstrate their general form: is.numeric(), is.character(), is.logical(), is.vector().

We see the vector’s class is character. If we print the first element of the vector, we’ll see R has wrapped it in quotation marks, indicating that the number 5 has been saved as a string (a series of characters).

x4_vec[1]
[1] "5"

The implication here is important. Say you wanted to perform a logical or arithmetic operation on the first element. What would be the result?

# arithmetic operation
x4_vec[1] + 5
Error in x4_vec[1] + 5: non-numeric argument to binary operator
# equivalence with another numeric
x4_vec[1] == 5
[1] TRUE
# equivalence with another string
x4_vec[1] == "5"
[1] TRUE

Something interesting happened… the arithmetic operation failed because the first element of x4_vec was a character string, but the other two operations succeeded!

In the latter case, we might not be all that surprised: after all, we compared two strings we knew were identical. However, in the second example R implicitly coerced our character “5” into a numeric value and was able to compare it successfully to the numeric value we proposed.

Familiarize yourself with how R (or any programming lanuage you happen to be using) handles different classes and objects. All languages have their quirks in this regard, and recognizing these special cases early on will save you a lot of grief in the future.

5.2.2 Matrices

In most epidemiologic data analysis, you might not end up dealing with matrices all that often. Because most of the data sets epidemiologists use contain mixtures of numeric, string, and other variable formats, we tend to use data frames.

However, you should still be familiar with matrices as, depending on your line of work, they may come in handy.

For instance, you might actually need to do linear algebra, in which case, you may need to use a matrix. In addition, for very computationally demanding tasks, R can operate on matrices much faster than on data frames. We’ll largely ignore these issues for now, but we’ll look at a couple of instances in which matrices can be used to generate nice figure layouts for data visualization.

Matrices can be multidimensional, but for the sake of our emotional stability, we’ll consider matrices in up to three dimensions.

A two-dimensional matrix is essentially a vector:

mat1 <- as.matrix(c(6, 7, 8))
mat1
     [,1]
[1,]    6
[2,]    7
[3,]    8

However, because the elements are now indexed by both row and colum, we need to refer to each element by its specific coordinates.

For example, if we want to retrieve 7 from the matrix, we need to tell R that we want the element in row 2, column 1, which we can do as follows:

mat1[2, 1]
[1] 7

If we wanted the third row:

mat1[3, ]
[1] 8

If we wanted the first colum:

mat1[, 1]
[1] 6 7 8

With such a small matrix, these operations might seem a bit daft, so let’s consider a larger numerical matrix.

n1 <- 1:100
n2 <- 901:1000

mat2 <- matrix(c(n1, n2), nrow = 100)

# show the first 10 rows of the matrix
mat2[1:10, ]
      [,1] [,2]
 [1,]    1  901
 [2,]    2  902
 [3,]    3  903
 [4,]    4  904
 [5,]    5  905
 [6,]    6  906
 [7,]    7  907
 [8,]    8  908
 [9,]    9  909
[10,]   10  910
# print total number of rows and columns
dim(mat2)
[1] 100   2

Be careful with matrices, though:

mat3 <- matrix(c(n1, n2), nrow = 50)

mat3[1:10, ]
      [,1] [,2] [,3] [,4]
 [1,]    1   51  901  951
 [2,]    2   52  902  952
 [3,]    3   53  903  953
 [4,]    4   54  904  954
 [5,]    5   55  905  955
 [6,]    6   56  906  956
 [7,]    7   57  907  957
 [8,]    8   58  908  958
 [9,]    9   59  909  959
[10,]   10   60  910  960
dim(mat3)
[1] 50  4

Because we told R we wanted a matrix with 50 rows, the first vector we specified was distributed across the first two columns, while the second vector was distributed across the third and fourth columns.

That’s fine if it’s what we wanted. Let’s say, though, that we wanted a matrix with 50 rows and 4 columns but wanted it filled out row-by-row:

mat4 <- matrix(c(n1, n2), nrow = 50, ncol = 4, byrow = TRUE)
mat4[1:10, ]
      [,1] [,2] [,3] [,4]
 [1,]    1    2    3    4
 [2,]    5    6    7    8
 [3,]    9   10   11   12
 [4,]   13   14   15   16
 [5,]   17   18   19   20
 [6,]   21   22   23   24
 [7,]   25   26   27   28
 [8,]   29   30   31   32
 [9,]   33   34   35   36
[10,]   37   38   39   40
# NOTE
# Here, we return rows 41-50 from mat4, but the output will label them 1-10.
# That's because R returns our request to us as its own matrix.
mat4[41:50, ]
      [,1] [,2] [,3] [,4]
 [1,]  961  962  963  964
 [2,]  965  966  967  968
 [3,]  969  970  971  972
 [4,]  973  974  975  976
 [5,]  977  978  979  980
 [6,]  981  982  983  984
 [7,]  985  986  987  988
 [8,]  989  990  991  992
 [9,]  993  994  995  996
[10,]  997  998  999 1000

In general, specifying the features you want in as much detail as possible will be the safest route, and may save you a good deal of grief in the future.

Say we wanted to get the column and row sums for this matrix. Easy!

colSums(mat4)
[1] 24950 25000 25050 25100
rowSums(mat4)
 [1]   10   26   42   58   74   90  106  122  138  154  170  186  202  218  234
[16]  250  266  282  298  314  330  346  362  378  394 3610 3626 3642 3658 3674
[31] 3690 3706 3722 3738 3754 3770 3786 3802 3818 3834 3850 3866 3882 3898 3914
[46] 3930 3946 3962 3978 3994

Say we wanted to multiply every number in the matrix by 123456. Easy!

mat4 * 123456
           [,1]      [,2]      [,3]      [,4]
 [1,]    123456    246912    370368    493824
 [2,]    617280    740736    864192    987648
 [3,]   1111104   1234560   1358016   1481472
 [4,]   1604928   1728384   1851840   1975296
 [5,]   2098752   2222208   2345664   2469120
 [6,]   2592576   2716032   2839488   2962944
 [7,]   3086400   3209856   3333312   3456768
 [8,]   3580224   3703680   3827136   3950592
 [9,]   4074048   4197504   4320960   4444416
[10,]   4567872   4691328   4814784   4938240
[11,]   5061696   5185152   5308608   5432064
[12,]   5555520   5678976   5802432   5925888
[13,]   6049344   6172800   6296256   6419712
[14,]   6543168   6666624   6790080   6913536
[15,]   7036992   7160448   7283904   7407360
[16,]   7530816   7654272   7777728   7901184
[17,]   8024640   8148096   8271552   8395008
[18,]   8518464   8641920   8765376   8888832
[19,]   9012288   9135744   9259200   9382656
[20,]   9506112   9629568   9753024   9876480
[21,]   9999936  10123392  10246848  10370304
[22,]  10493760  10617216  10740672  10864128
[23,]  10987584  11111040  11234496  11357952
[24,]  11481408  11604864  11728320  11851776
[25,]  11975232  12098688  12222144  12345600
[26,] 111233856 111357312 111480768 111604224
[27,] 111727680 111851136 111974592 112098048
[28,] 112221504 112344960 112468416 112591872
[29,] 112715328 112838784 112962240 113085696
[30,] 113209152 113332608 113456064 113579520
[31,] 113702976 113826432 113949888 114073344
[32,] 114196800 114320256 114443712 114567168
[33,] 114690624 114814080 114937536 115060992
[34,] 115184448 115307904 115431360 115554816
[35,] 115678272 115801728 115925184 116048640
[36,] 116172096 116295552 116419008 116542464
[37,] 116665920 116789376 116912832 117036288
[38,] 117159744 117283200 117406656 117530112
[39,] 117653568 117777024 117900480 118023936
[40,] 118147392 118270848 118394304 118517760
[41,] 118641216 118764672 118888128 119011584
[42,] 119135040 119258496 119381952 119505408
[43,] 119628864 119752320 119875776 119999232
[44,] 120122688 120246144 120369600 120493056
[45,] 120616512 120739968 120863424 120986880
[46,] 121110336 121233792 121357248 121480704
[47,] 121604160 121727616 121851072 121974528
[48,] 122097984 122221440 122344896 122468352
[49,] 122591808 122715264 122838720 122962176
[50,] 123085632 123209088 123332544 123456000

5.2.3 Lists

Lists are very flexible objects in R. Think of them like chest of drawers into which we can store basically any other type of object. This feature of lists makes them useful for storing all sorts of data. In fact, when we get to fitting regression models, we’ll see that the fitted model objects R returns to us are essentially stored as lists containing the estimated models, the data used to fit the model, regression diagnostics, and more.

A quick demonstration of storing multiple objects in a list:

# Object 1: Vector
my_vector <- 4:10

# Object 2: Matrix
my_matrix <- matrix(1:20, ncol = 4, byrow = T)

# Object 3: Dataframe (first 15 rows of iris)
my_df <- datasets::iris[1:15, ]

# Object 4: List
my_list <- list(vector = my_vector,
                matrix = my_matrix,
                dataframe = my_df)

my_list
$vector
[1]  4  5  6  7  8  9 10

$matrix
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
[4,]   13   14   15   16
[5,]   17   18   19   20

$dataframe
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
11          5.4         3.7          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa
13          4.8         3.0          1.4         0.1  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa

Now, let’s interrogate the list so we can get familiar with its features.

First, we saved each of the three objects into their own space within the list. The objects are still separate from one other. In other words, we have not merged them. Each of these spaces possesses a name, which we assigned in the code above.

names(my_list)
[1] "vector"    "matrix"    "dataframe"

We can query each of the objects individually:

my_list$vector
[1]  4  5  6  7  8  9 10
my_list$matrix
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
[4,]   13   14   15   16
[5,]   17   18   19   20
head(my_list$dataframe)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Alternatively, we could have accessed each object using a few different methods, each with slightly different effects. Let’s focus on the dataframe to demonstrate.

Access using single brackets and name of the list element:

my_list["dataframe"]
$dataframe
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
11          5.4         3.7          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa
13          4.8         3.0          1.4         0.1  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa

Access using double brackets and name of the list element:

my_list[["dataframe"]]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
11          5.4         3.7          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa
13          4.8         3.0          1.4         0.1  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa

The difference between what each of these calls returned is subtle but important. Recall the chest of drawers metaphor we applied to lists.

The list is the chest, each element is a drawer, and each drawer is filled with particular contents.

  1. When we use single brackets, we pull the drawer out of the chest. However, we do not access the contents of the drawer directly.

  2. When we use double brackets, we pull the contents out of the drawer, which means we can access them directly and act upon them as if they were not in the list at all.

To see the implications of how we query the list, we can try to perform some operations on the dataframe using each method of accessing it.

# access the list element (pull out the drawer)
head(my_list["dataframe"])
$dataframe
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
11          5.4         3.7          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa
13          4.8         3.0          1.4         0.1  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa
# access the dataframe itself (pull out the contents of the drawer)
head(my_list[["dataframe"]])
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

In the first case, R did not apply the head() function because we did not access the contents of drawer we wanted.

The same bracket principle applies if we access elements and their contents using the numerical index of the list.

head(my_list[3])
$dataframe
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
11          5.4         3.7          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa
13          4.8         3.0          1.4         0.1  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa
head(my_list[[3]])
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Table 5.1: List access summary

Access method Example Returns
list[“elementName”] my_list[“dataframe”] Element
list[elementIndex] my_list[3] Element
list[[“elementName”]] my_list[[“dataframe”]] Element contents
list[[elementIndex]] my_list[[3]] Element contents
list$elementName my_list$dataframe Element contents

In addition, we can place chests between drawers: in other words, we can nest lists within lists.

Nesting Placing objects of the same or similar type within one another. More info.

Think of [Matryoshka (Russian nesting) dolls](https://en.wikipedia.org/wiki/Matryoshka_doll). &copy; BrokenSphere / [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Floral_matryoshka_set_2_smallest_doll_nested.JPG) (cropped) Figure 5.1: Think of Matryoshka (Russian nesting) dolls. © BrokenSphere / Wikimedia Commons (cropped)

# select first n rows from each example dataset
n1 <- 15
n2 <- 30
n3 <- 50

# create a subset of each dataset
iris_sub  <- datasets::iris[1:n1, ]
airq_sub  <- datasets::airquality[1:n2, ]
quake_sub <- datasets::quakes[1:n3, ]

# store some info about the iris dataset
# info: ?datasets::iris
nl1 <- list(data = iris_sub,
            nrows = nrow(iris_sub),
            ncols = length(iris_sub))

# store some info about the airquality dataset
# info: ?datasets::airquality
nl2 <- list(data = airq_sub,
            nrows = nrow(airq_sub),
            ncols = length(airq_sub))

# store some info about the precip dataset
# info: ?datasets::quakes
nl3 <- list(data = quake_sub,
            nrows = nrow(quake_sub),
            ncols = length(quake_sub))

# store dataset summaries to list
datasummary_list <- list(irisdata = nl1,
                         airquality_data = nl2,
                         quake_data = nl3)

Before we look at datasummary_list, sit with the code above for a minute and think about what you expect datasummary_list to look like…

We saved three lists, each containing a dataset along with some basic information about that dataset.

Next, we took these three lists and put them all into another list.

datasummary_list
$irisdata
$irisdata$data
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
11          5.4         3.7          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa
13          4.8         3.0          1.4         0.1  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa

$irisdata$nrows
[1] 15

$irisdata$ncols
[1] 5


$airquality_data
$airquality_data$data
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    NA     194  8.6   69     5  10
11     7      NA  6.9   74     5  11
12    16     256  9.7   69     5  12
13    11     290  9.2   66     5  13
14    14     274 10.9   68     5  14
15    18      65 13.2   58     5  15
16    14     334 11.5   64     5  16
17    34     307 12.0   66     5  17
18     6      78 18.4   57     5  18
19    30     322 11.5   68     5  19
20    11      44  9.7   62     5  20
21     1       8  9.7   59     5  21
22    11     320 16.6   73     5  22
23     4      25  9.7   61     5  23
24    32      92 12.0   61     5  24
25    NA      66 16.6   57     5  25
26    NA     266 14.9   58     5  26
27    NA      NA  8.0   57     5  27
28    23      13 12.0   67     5  28
29    45     252 14.9   81     5  29
30   115     223  5.7   79     5  30

$airquality_data$nrows
[1] 30

$airquality_data$ncols
[1] 6


$quake_data
$quake_data$data
      lat   long depth mag stations
1  -20.42 181.62   562 4.8       41
2  -20.62 181.03   650 4.2       15
3  -26.00 184.10    42 5.4       43
4  -17.97 181.66   626 4.1       19
5  -20.42 181.96   649 4.0       11
6  -19.68 184.31   195 4.0       12
7  -11.70 166.10    82 4.8       43
8  -28.11 181.93   194 4.4       15
9  -28.74 181.74   211 4.7       35
10 -17.47 179.59   622 4.3       19
11 -21.44 180.69   583 4.4       13
12 -12.26 167.00   249 4.6       16
13 -18.54 182.11   554 4.4       19
14 -21.00 181.66   600 4.4       10
15 -20.70 169.92   139 6.1       94
16 -15.94 184.95   306 4.3       11
17 -13.64 165.96    50 6.0       83
18 -17.83 181.50   590 4.5       21
19 -23.50 179.78   570 4.4       13
20 -22.63 180.31   598 4.4       18
21 -20.84 181.16   576 4.5       17
22 -10.98 166.32   211 4.2       12
23 -23.30 180.16   512 4.4       18
24 -30.20 182.00   125 4.7       22
25 -19.66 180.28   431 5.4       57
26 -17.94 181.49   537 4.0       15
27 -14.72 167.51   155 4.6       18
28 -16.46 180.79   498 5.2       79
29 -20.97 181.47   582 4.5       25
30 -19.84 182.37   328 4.4       17
31 -22.58 179.24   553 4.6       21
32 -16.32 166.74    50 4.7       30
33 -15.55 185.05   292 4.8       42
34 -23.55 180.80   349 4.0       10
35 -16.30 186.00    48 4.5       10
36 -25.82 179.33   600 4.3       13
37 -18.73 169.23   206 4.5       17
38 -17.64 181.28   574 4.6       17
39 -17.66 181.40   585 4.1       17
40 -18.82 169.33   230 4.4       11
41 -37.37 176.78   263 4.7       34
42 -15.31 186.10    96 4.6       32
43 -24.97 179.82   511 4.4       23
44 -15.49 186.04    94 4.3       26
45 -19.23 169.41   246 4.6       27
46 -30.10 182.30    56 4.9       34
47 -26.40 181.70   329 4.5       24
48 -11.77 166.32    70 4.4       18
49 -24.12 180.08   493 4.3       21
50 -18.97 185.25   129 5.1       73

$quake_data$nrows
[1] 50

$quake_data$ncols
[1] 5

There’s the whole thing, but we’ll take a minute to pick apart the elements of datasummary_list.

First, we named each of its elements:

names(datasummary_list)
[1] "irisdata"        "airquality_data" "quake_data"     

We can look at the first element, in which we stored a subset of the iris data:

names(datasummary_list$irisdata)
[1] "data"  "nrows" "ncols"

Did we have to access this list directly? Have a look back at the list access table above.

If we wanted to see the iris data and the information we saved about it:

datasummary_list$irisdata$data
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
11          5.4         3.7          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa
13          4.8         3.0          1.4         0.1  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa
datasummary_list$irisdata$nrows
[1] 15
datasummary_list$irisdata$ncols
[1] 5

Note, too, that you can mix access methods, depending on your needs:

Which dataset are we accessing?

head(datasummary_list[[3]]$data)
     lat   long depth mag stations
1 -20.42 181.62   562 4.8       41
2 -20.62 181.03   650 4.2       15
3 -26.00 184.10    42 5.4       43
4 -17.97 181.66   626 4.1       19
5 -20.42 181.96   649 4.0       11
6 -19.68 184.31   195 4.0       12

We’ll see some lists later on in a few worked examples, as we’ve only scratched the surface. But even with what we’ve learned so far, hopefully your brain is starting to think about different ways me might use lists to store, manipulate, and analyze data. Data analysis and programming is in part an act of creativity!

s1 <- "Okay"
s2 <- "that's all about lists"
s3 <- "for now!"

totally_unnecessary_list <- list(firstbit = s1,
                                 secondbit = list(s2, s3))

cat(paste0(totally_unnecessary_list[[1]], ","),
           totally_unnecessary_list[[2]][[1]],
           totally_unnecessary_list[[2]][[2]], sep = " ")
Okay, that's all about lists for now!

5.2.4 Data Frames

As public health researchers using R, you may be spending most of your time working with data frames.

Data frames are a rectangular data format, that is, a format which generally stores data with a series of labeled columns (variables) and observations (rows).

We’ve already seen a couple of data frames. Here are a few more that R contains by default.

class(chickwts)
[1] "data.frame"
head(chickwts)
  weight      feed
1    179 horsebean
2    160 horsebean
3    136 horsebean
4    227 horsebean
5    217 horsebean
6    168 horsebean
class(mtcars)
[1] "data.frame"
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Data frames lend themselves well to common epidemiologic analyses, but first, we should get a feel for how they behave.

We’ve seen that head(dataframe) will return the first handful of rows in the data frame (as a data frame, it turns out).

We might also want to inspect some other properties of a dataset.

str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The str() function gives us a nice, concise report on the number of rows and variables, in addition to the value class of each variable.

Sometimes, we may want to query some of these attributes directly:

nrow(mtcars)
[1] 32
length(mtcars)
[1] 11
names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

The outputs from nrow() and length() returned to us the number of rows and columns, respectively, while names() returned the column names for us.

We can extract columns from the dataset as vectors if we want to operate on them directly:

class(mtcars$qsec)
[1] "numeric"
summary(mtcars$qsec)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  14.50   16.89   17.71   17.85   18.90   22.90 

We could also summarize all the variables at the same time, simply by passing the data frame to summary().

summary(mtcars)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

Look carefully at the output from our summary. Does anything look off?

I would say we should take a look at a few of the variables, which, despite having been stored as numeric, appear to be categorical variables.

Let’s check the number of unique values stored in each variable:

sapply(mtcars, function(x) length(unique(x)))
 mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
  25    3   27   22   22   29   30    2    2    3    6 

It looks like we might want to treat cyl, vs, am, gear, and carb as categorical.

To view more information about the mtcars dataset, type ?mtcars into the console.

Specifically, we’ll treat cyl, gear, and carb as ordinal, and we’ll treat vs and am as binary.

Remember, categorical variables assign an observation to a particular group. For example, if we had a column for the car’s manufacturer, we might have categories such as Chevy, Honda, or Ford. Ordinal variables are special cases of categorical variables in which the levels have some sort of natural ordering. For example, finishing place in a race: 1, 2, 3.

# create new variables in mtcars by converting the
# numeric ordinals into factors
mtcars$cyl_fct <- as.factor(mtcars$cyl)
mtcars$gear_fct <- as.factor(mtcars$gear)
mtcars$carb_fct <- as.factor(mtcars$carb)

mtcars$vs_fct <- as.factor(ifelse(mtcars$vs == 0, "v-shaped", "straight"))
mtcars$am_fct <- as.factor(ifelse(mtcars$am == 0, "automatic", "manual"))

str(mtcars)
'data.frame':   32 obs. of  16 variables:
 $ mpg     : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl     : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp    : num  160 160 108 258 360 ...
 $ hp      : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat    : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt      : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec    : num  16.5 17 18.6 19.4 17 ...
 $ vs      : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am      : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear    : num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb    : num  4 4 1 1 2 1 4 2 2 4 ...
 $ cyl_fct : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
 $ gear_fct: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
 $ carb_fct: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
 $ vs_fct  : Factor w/ 2 levels "straight","v-shaped": 2 2 1 1 2 1 2 1 1 1 ...
 $ am_fct  : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...

Take a minute to look at the output of str(). We have 5 new factor variables.

Technically, we could have analyzed the variables without converting them, but let’s summarize the new ones anyhow.

# we index mtcars so as to inspect only the variables we just created
summary(mtcars[, 12:16])
 cyl_fct gear_fct carb_fct      vs_fct         am_fct  
 4:11    3:15     1: 7     straight:14   automatic:19  
 6: 7    4:12     2:10     v-shaped:18   manual   :13  
 8:14    5: 5     3: 3                                 
                  4:10                                 
                  6: 1                                 
                  8: 1                                 

Not the prettiest summary, but note how R gives us the number of observations within each category for each factor. We’ll look at more readable ways to summarize factors in a future section.

5.2.5 Other Object Classes

Various subfields of epidemiology and statistics provide an array of object classes that are beyond the scope of this tutorial. For instance, an object with the class “graph” describes the components and structures of a graph. The information stored in this object would typically allow one to manipulate, analyze, and visualize the graph.

5.3 Conducting Operations on Objects

5.3.1 Arithmetic Operations

It may come as no surprise that you’ll be doing a fair bit of arithmetic while coding, whether you’re doing quick calculations in the console or creating variables derived from some arithmetic combination of other variables.

R respects grouping operations and handles mathematical operations as one might expect.

For instance, we can do some quick scratch calculations and print them directly to the console.

In fact, R can be a handy calculator if you run it from the command line!

5 + 10
[1] 15
0.25 * 9 + 0.75 * 3
[1] 4.5
(5 ^ 2 + 10 ^ 3) / 50
[1] 20.5
8 %% 3
[1] 2
8 %/% 3
[1] 2

Table 5.2: Arithmetic operators

Operator Function
+ Add
- Subtract
^ Power
* Multiply
/ Divide
%% Modulus (remainder)
%/% Modulus (remainder)
%*% Matrix multiplication

For more information, refer to R’s documentation on arithmetic operators.

5.3.2 Logical Operations

We already saw one example of a logical operation, when we checked to see if x1 was equal to 10. Statistical and epidemiologic analysis relies heavily on implementing logical operations and checks for truth, and so it is a good idea to master these operators early—and to use them often.

Table 5.3: Logical operators

Operator Function
== Check equality of two values
< Less than
<= Less than or equal to
> Greater than
>= Greater than or equal to
& (or &&) AND
| (or ||) OR

When combining various operators, some will take precedence over others, and it’s important to get a feel for this hierarchy to avoid unexpected results while programming.

5.3.3 Recycling

In short, recycling refers to the reuse of an input object (or objects) to produce an output.

Recycling is a very important concept in R. First, recycling can make your code more efficient. Second, if you forget that R recycles, you might unwittingly perform an operation that spits out a bunch of wrong answers.

That’s a pretty abstract description, so let’s look at a few examples to develop our intuition a bit.

First, recall the operators we covered in the the prior two sections, as well as vectors.

Say we wanted to add 5 to each of 3 numbers. Rather than specifying the arithmetic operation separately for each number, we could store those numbers to a vector and tell R to add 5 to the vector:

numvec <- c(10, 40, 50)
numvec + 5
[1] 15 45 55

See what happened? R added 5 to each vector element separately.

If we wanted to perform separate additions on each number, we could do so easily, as long as the longer vector’s length is a multiple of the shorter vector’s length:

numvec + c(5, 10, 90)
[1]  15  50 140

Did you notice the difference, though? Because the vector lengths were equal, R did not recycle 5, 10, and 90 to add each to each item in numvec. Rather, it added 5 to numvec[1], 10 to numvec[2], and 90 to numvec[3]

However, attempting to execute the following operation results in a warning, because 3 cannot be divided evenly by 2:

numvec + c(5, 10)
Warning in numvec + c(5, 10): longer object length is not a multiple of shorter
object length
[1] 15 50 55

Notice, however, that the second, unnamed vector was recycled in this case. Looking at the output, we can see that R executed the following operations: 5 + 5, 40 + 10, and 50 + 5. In other words, R recycled the shorter vector until it was finished operating on numvec.

You might imagine this result was not really what we were looking for if we wanted R to add 5 to each element in numvec and then add 10 to each element in numvec and return the results for both sets of operations.

When we get to for loops and functions, we’ll see how we can get R to do these sorts of tasks for us.