The first thing to keep in mind when working in R is that, by default, all objects are stored in the global environment when you assign them. An object stored to the global environment will be accessible from anywhere in a project once assigned.
These objects can be any of the data types or structures we saw a couple of sections ago.
Note that we can also assign objects as so: y = 10
. However, using the equal sign is generally not preferred because it is used to assign values to function arguments, too.
To begin, though, we’ll assign single values to several variables using <-
, known as the assignment operator. One important thing to keep in mind when working in R is that all names are case-sensitive, meaning that R considers myvariable
and myVariable
to be two different objects.
Can you name the data types for each object we just assigned? Write 3 more lines and wrap each object name in a class()
function.
10
x1 <- "Welcome to Brown!"
x2 <- TRUE x3 <-
To view what’s stored in those variables, we print them to the console.
# note that in some cases, you may want use the print() function
x1
[1] 10
x2
[1] "Welcome to Brown!"
x3
[1] TRUE
We can also perform manipulations with them:
* 10 x1
[1] 100
paste(x2, "We can do many things with the paste() function!")
[1] "Welcome to Brown! We can do many things with the paste() function!"
With x3
, we can do a couple of things because Booleans are also be treated as numerics, in a sense, where TRUE
equals 1 and FALSE
equals 0.
Recall the prior discussion of R as a functional language. Here, we pass the object x3
to the function isTRUE
.
# test whether x3 is TRUE
isTRUE(x3)
[1] TRUE
# test whether x3 is FALSE
isFALSE(x3)
[1] FALSE
# multiply TRUE (1) times 5
* 5 x3
[1] 5
These are very simple examples, of course, but hopefully they provide some inkling as to how the different data types can be combined and manipulated to achieve various ends. Checking variable values and classes will be a regular task when manipulating data and debugging code.
For instance, if we needed to confirm that x1
was a numeric variable, we could check it:
is.numeric(x1)
[1] TRUE
Or we could make sure it took the value we expected:
== 10 x1
[1] TRUE
Note, that to avoid reassigning x1
, we used ==
, which is a logical, or comparison, operator.
We’ll revisit logical operations in a bit after we review how to manipulate and work with some different types of objects.
When we reviewed data types previously, we saw that R is capable of handling objects such as vectors, matrices, lists, and data frames. Let’s have a look at each of these objects in turn, and get a feel for how we can work with them.
You can think of a vector as a sequence of values. In fact, the objects we assigned in the prior section were simply vectors with a length of one.
Creating longer vectors is straightforward but requires a concatenation function called c()
. As we did last time, let’s create a few vectors.
Concatenation Merging or joining multiple objects into one.
c(50, 21, 455, 89, 09)
x1_vec <- c("Hi!", "Welcome", "to", "Brown", "SPH")
x2_vec <- c(TRUE, TRUE, FALSE, FALSE, FALSE)
x3_vec <-
x1_vec
[1] 50 21 455 89 9
x2_vec
[1] "Hi!" "Welcome" "to" "Brown" "SPH"
x3_vec
[1] TRUE TRUE FALSE FALSE FALSE
Now we have 3 more vectors for which we can retrieve some information. Let’s figure out the classes and lengths for each vector.
Much of the code in these blocks is included to make the output display a little neater. Ignore most of it for now, and simply recognize that we’ve passed each of the objects above into the class()
and length()
functions.
cat("x1_vec attributes",
paste0(rep("-", 30), collapse = ""),
paste("class:", class(x1_vec)),
paste("length:", length(x1_vec)),
sep = "\n"
)
x1_vec attributes
------------------------------
class: numeric
length: 5
cat("x2_vec attributes",
paste0(rep("-", 30), collapse = ""),
paste("class:", class(x2_vec)),
paste("length:", length(x2_vec)),
sep = "\n"
)
x2_vec attributes
------------------------------
class: character
length: 5
cat("x3_vec attributes",
paste0(rep("-", 30), collapse = ""),
paste("class:", class(x3_vec)),
paste("length:", length(x3_vec)),
sep = "\n"
)
x3_vec attributes
------------------------------
class: logical
length: 5
However, a single vector can contain values of only one class. In fact, if we try to mix classes, R will convert all elements of the vector into a single class. In the example below, R converts the numerical value 5 into a character because it doesn’t know how to convert the string we provided into a number.
This process of changing value classes is known as coercion.
c(5, "So-called 'Arthur King'!")
x4_vec <-
# R coerces the numeric value into a character
c("class" = class(x4_vec[1]),
"numeric?" = is.numeric(x4_vec[1]))
class numeric?
"character" "FALSE"
You can check specific class types with a large number of functions. A few examples to demonstrate their general form: is.numeric()
, is.character()
, is.logical()
, is.vector()
.
We see the vector’s class is character
. If we print the first element of the vector, we’ll see R has wrapped it in quotation marks, indicating that the number 5 has been saved as a string (a series of characters).
1] x4_vec[
[1] "5"
The implication here is important. Say you wanted to perform a logical or arithmetic operation on the first element. What would be the result?
# arithmetic operation
1] + 5 x4_vec[
Error in x4_vec[1] + 5: non-numeric argument to binary operator
# equivalence with another numeric
1] == 5 x4_vec[
[1] TRUE
# equivalence with another string
1] == "5" x4_vec[
[1] TRUE
Something interesting happened… the arithmetic operation failed because the first element of x4_vec
was a character string, but the other two operations succeeded!
In the latter case, we might not be all that surprised: after all, we compared two strings we knew were identical. However, in the second example R implicitly coerced our character “5” into a numeric value and was able to compare it successfully to the numeric value we proposed.
Familiarize yourself with how R (or any programming lanuage you happen to be using) handles different classes and objects. All languages have their quirks in this regard, and recognizing these special cases early on will save you a lot of grief in the future.
In most epidemiologic data analysis, you might not end up dealing with matrices all that often. Because most of the data sets epidemiologists use contain mixtures of numeric, string, and other variable formats, we tend to use data frames.
However, you should still be familiar with matrices as, depending on your line of work, they may come in handy.
For instance, you might actually need to do linear algebra, in which case, you may need to use a matrix. In addition, for very computationally demanding tasks, R can operate on matrices much faster than on data frames. We’ll largely ignore these issues for now, but we’ll look at a couple of instances in which matrices can be used to generate nice figure layouts for data visualization.
Matrices can be multidimensional, but for the sake of our emotional stability, we’ll consider matrices in up to three dimensions.
A two-dimensional matrix is essentially a vector:
as.matrix(c(6, 7, 8))
mat1 <- mat1
[,1]
[1,] 6
[2,] 7
[3,] 8
However, because the elements are now indexed by both row and colum, we need to refer to each element by its specific coordinates.
For example, if we want to retrieve 7 from the matrix, we need to tell R that we want the element in row 2, column 1, which we can do as follows:
2, 1] mat1[
[1] 7
If we wanted the third row:
3, ] mat1[
[1] 8
If we wanted the first colum:
1] mat1[,
[1] 6 7 8
With such a small matrix, these operations might seem a bit daft, so let’s consider a larger numerical matrix.
1:100
n1 <- 901:1000
n2 <-
matrix(c(n1, n2), nrow = 100)
mat2 <-
# show the first 10 rows of the matrix
1:10, ] mat2[
[,1] [,2]
[1,] 1 901
[2,] 2 902
[3,] 3 903
[4,] 4 904
[5,] 5 905
[6,] 6 906
[7,] 7 907
[8,] 8 908
[9,] 9 909
[10,] 10 910
# print total number of rows and columns
dim(mat2)
[1] 100 2
Be careful with matrices, though:
matrix(c(n1, n2), nrow = 50)
mat3 <-
1:10, ] mat3[
[,1] [,2] [,3] [,4]
[1,] 1 51 901 951
[2,] 2 52 902 952
[3,] 3 53 903 953
[4,] 4 54 904 954
[5,] 5 55 905 955
[6,] 6 56 906 956
[7,] 7 57 907 957
[8,] 8 58 908 958
[9,] 9 59 909 959
[10,] 10 60 910 960
dim(mat3)
[1] 50 4
Because we told R we wanted a matrix with 50 rows, the first vector we specified was distributed across the first two columns, while the second vector was distributed across the third and fourth columns.
That’s fine if it’s what we wanted. Let’s say, though, that we wanted a matrix with 50 rows and 4 columns but wanted it filled out row-by-row:
matrix(c(n1, n2), nrow = 50, ncol = 4, byrow = TRUE)
mat4 <-1:10, ] mat4[
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
[5,] 17 18 19 20
[6,] 21 22 23 24
[7,] 25 26 27 28
[8,] 29 30 31 32
[9,] 33 34 35 36
[10,] 37 38 39 40
# NOTE
# Here, we return rows 41-50 from mat4, but the output will label them 1-10.
# That's because R returns our request to us as its own matrix.
41:50, ] mat4[
[,1] [,2] [,3] [,4]
[1,] 961 962 963 964
[2,] 965 966 967 968
[3,] 969 970 971 972
[4,] 973 974 975 976
[5,] 977 978 979 980
[6,] 981 982 983 984
[7,] 985 986 987 988
[8,] 989 990 991 992
[9,] 993 994 995 996
[10,] 997 998 999 1000
In general, specifying the features you want in as much detail as possible will be the safest route, and may save you a good deal of grief in the future.
Say we wanted to get the column and row sums for this matrix. Easy!
colSums(mat4)
[1] 24950 25000 25050 25100
rowSums(mat4)
[1] 10 26 42 58 74 90 106 122 138 154 170 186 202 218 234
[16] 250 266 282 298 314 330 346 362 378 394 3610 3626 3642 3658 3674
[31] 3690 3706 3722 3738 3754 3770 3786 3802 3818 3834 3850 3866 3882 3898 3914
[46] 3930 3946 3962 3978 3994
Say we wanted to multiply every number in the matrix by 123456. Easy!
* 123456 mat4
[,1] [,2] [,3] [,4]
[1,] 123456 246912 370368 493824
[2,] 617280 740736 864192 987648
[3,] 1111104 1234560 1358016 1481472
[4,] 1604928 1728384 1851840 1975296
[5,] 2098752 2222208 2345664 2469120
[6,] 2592576 2716032 2839488 2962944
[7,] 3086400 3209856 3333312 3456768
[8,] 3580224 3703680 3827136 3950592
[9,] 4074048 4197504 4320960 4444416
[10,] 4567872 4691328 4814784 4938240
[11,] 5061696 5185152 5308608 5432064
[12,] 5555520 5678976 5802432 5925888
[13,] 6049344 6172800 6296256 6419712
[14,] 6543168 6666624 6790080 6913536
[15,] 7036992 7160448 7283904 7407360
[16,] 7530816 7654272 7777728 7901184
[17,] 8024640 8148096 8271552 8395008
[18,] 8518464 8641920 8765376 8888832
[19,] 9012288 9135744 9259200 9382656
[20,] 9506112 9629568 9753024 9876480
[21,] 9999936 10123392 10246848 10370304
[22,] 10493760 10617216 10740672 10864128
[23,] 10987584 11111040 11234496 11357952
[24,] 11481408 11604864 11728320 11851776
[25,] 11975232 12098688 12222144 12345600
[26,] 111233856 111357312 111480768 111604224
[27,] 111727680 111851136 111974592 112098048
[28,] 112221504 112344960 112468416 112591872
[29,] 112715328 112838784 112962240 113085696
[30,] 113209152 113332608 113456064 113579520
[31,] 113702976 113826432 113949888 114073344
[32,] 114196800 114320256 114443712 114567168
[33,] 114690624 114814080 114937536 115060992
[34,] 115184448 115307904 115431360 115554816
[35,] 115678272 115801728 115925184 116048640
[36,] 116172096 116295552 116419008 116542464
[37,] 116665920 116789376 116912832 117036288
[38,] 117159744 117283200 117406656 117530112
[39,] 117653568 117777024 117900480 118023936
[40,] 118147392 118270848 118394304 118517760
[41,] 118641216 118764672 118888128 119011584
[42,] 119135040 119258496 119381952 119505408
[43,] 119628864 119752320 119875776 119999232
[44,] 120122688 120246144 120369600 120493056
[45,] 120616512 120739968 120863424 120986880
[46,] 121110336 121233792 121357248 121480704
[47,] 121604160 121727616 121851072 121974528
[48,] 122097984 122221440 122344896 122468352
[49,] 122591808 122715264 122838720 122962176
[50,] 123085632 123209088 123332544 123456000
Lists are very flexible objects in R. Think of them like chest of drawers into which we can store basically any other type of object. This feature of lists makes them useful for storing all sorts of data. In fact, when we get to fitting regression models, we’ll see that the fitted model objects R returns to us are essentially stored as lists containing the estimated models, the data used to fit the model, regression diagnostics, and more.
A quick demonstration of storing multiple objects in a list:
# Object 1: Vector
4:10
my_vector <-
# Object 2: Matrix
matrix(1:20, ncol = 4, byrow = T)
my_matrix <-
# Object 3: Dataframe (first 15 rows of iris)
datasets::iris[1:15, ]
my_df <-
# Object 4: List
list(vector = my_vector,
my_list <-matrix = my_matrix,
dataframe = my_df)
my_list
$vector
[1] 4 5 6 7 8 9 10
$matrix
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
[5,] 17 18 19 20
$dataframe
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
Now, let’s interrogate the list so we can get familiar with its features.
First, we saved each of the three objects into their own space within the list. The objects are still separate from one other. In other words, we have not merged them. Each of these spaces possesses a name, which we assigned in the code above.
names(my_list)
[1] "vector" "matrix" "dataframe"
We can query each of the objects individually:
$vector my_list
[1] 4 5 6 7 8 9 10
$matrix my_list
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
[5,] 17 18 19 20
head(my_list$dataframe)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Alternatively, we could have accessed each object using a few different methods, each with slightly different effects. Let’s focus on the dataframe to demonstrate.
Access using single brackets and name of the list element:
"dataframe"] my_list[
$dataframe
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
Access using double brackets and name of the list element:
"dataframe"]] my_list[[
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
The difference between what each of these calls returned is subtle but important. Recall the chest of drawers metaphor we applied to lists.
The list is the chest, each element is a drawer, and each drawer is filled with particular contents.
When we use single brackets, we pull the drawer out of the chest. However, we do not access the contents of the drawer directly.
When we use double brackets, we pull the contents out of the drawer, which means we can access them directly and act upon them as if they were not in the list at all.
To see the implications of how we query the list, we can try to perform some operations on the dataframe using each method of accessing it.
# access the list element (pull out the drawer)
head(my_list["dataframe"])
$dataframe
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
# access the dataframe itself (pull out the contents of the drawer)
head(my_list[["dataframe"]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
In the first case, R did not apply the head()
function because we did not access the contents of drawer we wanted.
The same bracket principle applies if we access elements and their contents using the numerical index of the list.
head(my_list[3])
$dataframe
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
head(my_list[[3]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Table 5.1: List access summary
Access method | Example | Returns |
---|---|---|
list[“elementName”] | my_list[“dataframe”] | Element |
list[elementIndex] | my_list[3] | Element |
list[[“elementName”]] | my_list[[“dataframe”]] | Element contents |
list[[elementIndex]] | my_list[[3]] | Element contents |
list$elementName | my_list$dataframe | Element contents |
In addition, we can place chests between drawers: in other words, we can nest lists within lists.
Nesting Placing objects of the same or similar type within one another. More info.
Figure 5.1: Think of Matryoshka (Russian nesting) dolls. © BrokenSphere / Wikimedia Commons (cropped)
# select first n rows from each example dataset
15
n1 <- 30
n2 <- 50
n3 <-
# create a subset of each dataset
datasets::iris[1:n1, ]
iris_sub <- datasets::airquality[1:n2, ]
airq_sub <- datasets::quakes[1:n3, ]
quake_sub <-
# store some info about the iris dataset
# info: ?datasets::iris
list(data = iris_sub,
nl1 <-nrows = nrow(iris_sub),
ncols = length(iris_sub))
# store some info about the airquality dataset
# info: ?datasets::airquality
list(data = airq_sub,
nl2 <-nrows = nrow(airq_sub),
ncols = length(airq_sub))
# store some info about the precip dataset
# info: ?datasets::quakes
list(data = quake_sub,
nl3 <-nrows = nrow(quake_sub),
ncols = length(quake_sub))
# store dataset summaries to list
list(irisdata = nl1,
datasummary_list <-airquality_data = nl2,
quake_data = nl3)
Before we look at datasummary_list
, sit with the code above for a minute and think about what you expect datasummary_list
to look like…
We saved three lists, each containing a dataset along with some basic information about that dataset.
Next, we took these three lists and put them all into another list.
datasummary_list
$irisdata
$irisdata$data
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
$irisdata$nrows
[1] 15
$irisdata$ncols
[1] 5
$airquality_data
$airquality_data$data
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
13 11 290 9.2 66 5 13
14 14 274 10.9 68 5 14
15 18 65 13.2 58 5 15
16 14 334 11.5 64 5 16
17 34 307 12.0 66 5 17
18 6 78 18.4 57 5 18
19 30 322 11.5 68 5 19
20 11 44 9.7 62 5 20
21 1 8 9.7 59 5 21
22 11 320 16.6 73 5 22
23 4 25 9.7 61 5 23
24 32 92 12.0 61 5 24
25 NA 66 16.6 57 5 25
26 NA 266 14.9 58 5 26
27 NA NA 8.0 57 5 27
28 23 13 12.0 67 5 28
29 45 252 14.9 81 5 29
30 115 223 5.7 79 5 30
$airquality_data$nrows
[1] 30
$airquality_data$ncols
[1] 6
$quake_data
$quake_data$data
lat long depth mag stations
1 -20.42 181.62 562 4.8 41
2 -20.62 181.03 650 4.2 15
3 -26.00 184.10 42 5.4 43
4 -17.97 181.66 626 4.1 19
5 -20.42 181.96 649 4.0 11
6 -19.68 184.31 195 4.0 12
7 -11.70 166.10 82 4.8 43
8 -28.11 181.93 194 4.4 15
9 -28.74 181.74 211 4.7 35
10 -17.47 179.59 622 4.3 19
11 -21.44 180.69 583 4.4 13
12 -12.26 167.00 249 4.6 16
13 -18.54 182.11 554 4.4 19
14 -21.00 181.66 600 4.4 10
15 -20.70 169.92 139 6.1 94
16 -15.94 184.95 306 4.3 11
17 -13.64 165.96 50 6.0 83
18 -17.83 181.50 590 4.5 21
19 -23.50 179.78 570 4.4 13
20 -22.63 180.31 598 4.4 18
21 -20.84 181.16 576 4.5 17
22 -10.98 166.32 211 4.2 12
23 -23.30 180.16 512 4.4 18
24 -30.20 182.00 125 4.7 22
25 -19.66 180.28 431 5.4 57
26 -17.94 181.49 537 4.0 15
27 -14.72 167.51 155 4.6 18
28 -16.46 180.79 498 5.2 79
29 -20.97 181.47 582 4.5 25
30 -19.84 182.37 328 4.4 17
31 -22.58 179.24 553 4.6 21
32 -16.32 166.74 50 4.7 30
33 -15.55 185.05 292 4.8 42
34 -23.55 180.80 349 4.0 10
35 -16.30 186.00 48 4.5 10
36 -25.82 179.33 600 4.3 13
37 -18.73 169.23 206 4.5 17
38 -17.64 181.28 574 4.6 17
39 -17.66 181.40 585 4.1 17
40 -18.82 169.33 230 4.4 11
41 -37.37 176.78 263 4.7 34
42 -15.31 186.10 96 4.6 32
43 -24.97 179.82 511 4.4 23
44 -15.49 186.04 94 4.3 26
45 -19.23 169.41 246 4.6 27
46 -30.10 182.30 56 4.9 34
47 -26.40 181.70 329 4.5 24
48 -11.77 166.32 70 4.4 18
49 -24.12 180.08 493 4.3 21
50 -18.97 185.25 129 5.1 73
$quake_data$nrows
[1] 50
$quake_data$ncols
[1] 5
There’s the whole thing, but we’ll take a minute to pick apart the elements of datasummary_list
.
First, we named each of its elements:
names(datasummary_list)
[1] "irisdata" "airquality_data" "quake_data"
We can look at the first element, in which we stored a subset of the iris
data:
names(datasummary_list$irisdata)
[1] "data" "nrows" "ncols"
Did we have to access this list directly? Have a look back at the list access table above.
If we wanted to see the iris
data and the information we saved about it:
$irisdata$data datasummary_list
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
$irisdata$nrows datasummary_list
[1] 15
$irisdata$ncols datasummary_list
[1] 5
Note, too, that you can mix access methods, depending on your needs:
Which dataset are we accessing?
head(datasummary_list[[3]]$data)
lat long depth mag stations
1 -20.42 181.62 562 4.8 41
2 -20.62 181.03 650 4.2 15
3 -26.00 184.10 42 5.4 43
4 -17.97 181.66 626 4.1 19
5 -20.42 181.96 649 4.0 11
6 -19.68 184.31 195 4.0 12
We’ll see some lists later on in a few worked examples, as we’ve only scratched the surface. But even with what we’ve learned so far, hopefully your brain is starting to think about different ways me might use lists to store, manipulate, and analyze data. Data analysis and programming is in part an act of creativity!
"Okay"
s1 <- "that's all about lists"
s2 <- "for now!"
s3 <-
list(firstbit = s1,
totally_unnecessary_list <-secondbit = list(s2, s3))
cat(paste0(totally_unnecessary_list[[1]], ","),
2]][[1]],
totally_unnecessary_list[[2]][[2]], sep = " ") totally_unnecessary_list[[
Okay, that's all about lists for now!
As public health researchers using R, you may be spending most of your time working with data frames.
Data frames are a rectangular data format, that is, a format which generally stores data with a series of labeled columns (variables) and observations (rows).
We’ve already seen a couple of data frames. Here are a few more that R contains by default.
class(chickwts)
[1] "data.frame"
head(chickwts)
weight feed
1 179 horsebean
2 160 horsebean
3 136 horsebean
4 227 horsebean
5 217 horsebean
6 168 horsebean
class(mtcars)
[1] "data.frame"
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Data frames lend themselves well to common epidemiologic analyses, but first, we should get a feel for how they behave.
We’ve seen that head(dataframe)
will return the first handful of rows in the data frame (as a data frame, it turns out).
We might also want to inspect some other properties of a dataset.
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
The str()
function gives us a nice, concise report on the number of rows and variables, in addition to the value class of each variable.
Sometimes, we may want to query some of these attributes directly:
nrow(mtcars)
[1] 32
length(mtcars)
[1] 11
names(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
The outputs from nrow()
and length()
returned to us the number of rows and columns, respectively, while names()
returned the column names for us.
We can extract columns from the dataset as vectors if we want to operate on them directly:
class(mtcars$qsec)
[1] "numeric"
summary(mtcars$qsec)
Min. 1st Qu. Median Mean 3rd Qu. Max.
14.50 16.89 17.71 17.85 18.90 22.90
We could also summarize all the variables at the same time, simply by passing the data frame to summary()
.
summary(mtcars)
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
Look carefully at the output from our summary. Does anything look off?
I would say we should take a look at a few of the variables, which, despite having been stored as numeric, appear to be categorical variables.
Let’s check the number of unique values stored in each variable:
sapply(mtcars, function(x) length(unique(x)))
mpg cyl disp hp drat wt qsec vs am gear carb
25 3 27 22 22 29 30 2 2 3 6
It looks like we might want to treat cyl
, vs
, am
, gear
, and carb
as categorical.
To view more information about the mtcars
dataset, type ?mtcars
into the console.
Specifically, we’ll treat cyl
, gear
, and carb
as ordinal, and we’ll treat vs
and am
as binary.
Remember, categorical variables assign an observation to a particular group. For example, if we had a column for the car’s manufacturer, we might have categories such as Chevy, Honda, or Ford. Ordinal variables are special cases of categorical variables in which the levels have some sort of natural ordering. For example, finishing place in a race: 1, 2, 3.
# create new variables in mtcars by converting the
# numeric ordinals into factors
$cyl_fct <- as.factor(mtcars$cyl)
mtcars$gear_fct <- as.factor(mtcars$gear)
mtcars$carb_fct <- as.factor(mtcars$carb)
mtcars
$vs_fct <- as.factor(ifelse(mtcars$vs == 0, "v-shaped", "straight"))
mtcars$am_fct <- as.factor(ifelse(mtcars$am == 0, "automatic", "manual"))
mtcars
str(mtcars)
'data.frame': 32 obs. of 16 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp : num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec : num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear : num 4 4 4 3 3 3 3 4 4 4 ...
$ carb : num 4 4 1 1 2 1 4 2 2 4 ...
$ cyl_fct : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
$ gear_fct: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
$ carb_fct: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
$ vs_fct : Factor w/ 2 levels "straight","v-shaped": 2 2 1 1 2 1 2 1 1 1 ...
$ am_fct : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
Take a minute to look at the output of str()
. We have 5 new factor variables.
Technically, we could have analyzed the variables without converting them, but let’s summarize the new ones anyhow.
# we index mtcars so as to inspect only the variables we just created
summary(mtcars[, 12:16])
cyl_fct gear_fct carb_fct vs_fct am_fct
4:11 3:15 1: 7 straight:14 automatic:19
6: 7 4:12 2:10 v-shaped:18 manual :13
8:14 5: 5 3: 3
4:10
6: 1
8: 1
Not the prettiest summary, but note how R gives us the number of observations within each category for each factor. We’ll look at more readable ways to summarize factors in a future section.
Various subfields of epidemiology and statistics provide an array of object classes that are beyond the scope of this tutorial. For instance, an object with the class “graph” describes the components and structures of a graph. The information stored in this object would typically allow one to manipulate, analyze, and visualize the graph.
It may come as no surprise that you’ll be doing a fair bit of arithmetic while coding, whether you’re doing quick calculations in the console or creating variables derived from some arithmetic combination of other variables.
R respects grouping operations and handles mathematical operations as one might expect.
For instance, we can do some quick scratch calculations and print them directly to the console.
In fact, R can be a handy calculator if you run it from the command line!
5 + 10
[1] 15
0.25 * 9 + 0.75 * 3
[1] 4.5
5 ^ 2 + 10 ^ 3) / 50 (
[1] 20.5
8 %% 3
[1] 2
8 %/% 3
[1] 2
Table 5.2: Arithmetic operators
Operator | Function |
---|---|
+ | Add |
- | Subtract |
^ | Power |
* | Multiply |
/ | Divide |
%% | Modulus (remainder) |
%/% | Modulus (remainder) |
%*% | Matrix multiplication |
For more information, refer to R’s documentation on arithmetic operators.
We already saw one example of a logical operation, when we checked to see if x1
was equal to 10. Statistical and epidemiologic analysis relies heavily on implementing logical operations and checks for truth, and so it is a good idea to master these operators early—and to use them often.
Table 5.3: Logical operators
Operator | Function |
---|---|
== | Check equality of two values |
< | Less than |
<= | Less than or equal to |
> | Greater than |
>= | Greater than or equal to |
& (or &&) | AND |
| (or ||) | OR |
When combining various operators, some will take precedence over others, and it’s important to get a feel for this hierarchy to avoid unexpected results while programming.
In short, recycling refers to the reuse of an input object (or objects) to produce an output.
Recycling is a very important concept in R. First, recycling can make your code more efficient. Second, if you forget that R recycles, you might unwittingly perform an operation that spits out a bunch of wrong answers.
That’s a pretty abstract description, so let’s look at a few examples to develop our intuition a bit.
First, recall the operators we covered in the the prior two sections, as well as vectors.
Say we wanted to add 5 to each of 3 numbers. Rather than specifying the arithmetic operation separately for each number, we could store those numbers to a vector and tell R to add 5 to the vector:
c(10, 40, 50)
numvec <-+ 5 numvec
[1] 15 45 55
See what happened? R added 5 to each vector element separately.
If we wanted to perform separate additions on each number, we could do so easily, as long as the longer vector’s length is a multiple of the shorter vector’s length:
+ c(5, 10, 90) numvec
[1] 15 50 140
Did you notice the difference, though? Because the vector lengths were equal, R did not recycle 5, 10, and 90 to add each to each item in numvec
. Rather, it added 5 to numvec[1]
, 10 to numvec[2]
, and 90 to numvec[3]
However, attempting to execute the following operation results in a warning, because 3 cannot be divided evenly by 2:
+ c(5, 10) numvec
Warning in numvec + c(5, 10): longer object length is not a multiple of shorter
object length
[1] 15 50 55
Notice, however, that the second, unnamed vector was recycled in this case. Looking at the output, we can see that R executed the following operations: 5 + 5, 40 + 10, and 50 + 5. In other words, R recycled the shorter vector until it was finished operating on numvec
.
You might imagine this result was not really what we were looking for if we wanted R to add 5 to each element in numvec
and then add 10 to each element in numvec
and return the results for both sets of operations.
When we get to for loops and functions, we’ll see how we can get R to do these sorts of tasks for us.