View on GitHub
File an issue

6 Statistical Analysis

6.1 Setup

In this section, we’ll provide some examples in R of basic statistical analyses you might need to implement at some point.

The analyses you conduct as part of real research will almost certainly be more complicated than the code shown here. However, the hope is that you’ll get enough of a feel for the mechanics of R to begin putting the pieces together as demanded by a given project.

6.1.1 Data

If you’d like to try any of these examples on your own, we’re using the NHANES I Epidemiologic Followup Study (NHEFS) data set freely available at the website for Causal Inference: What If by Miguel A. Hernán and James M. Robins.

For both SAS and R, we’ll use the CSV file they provide.

The website itself contains extensive code examples in SAS, Stata, R, and Python—everything you could want!—so those who will be taking advanced epidemiologic or causal inference coursework should refer to those resources for guidance on how to implement relevant analyses. We’ll cover a little bit here, but only briefly.

6.1.2 NHEFS Codebook

Variable name Description
active IN YOUR USUAL DAY, HOW ACTIVE ARE YOU? IN 1971, 0:very active, 1:moderately active, 2:inactive
age AGE IN 1971
alcoholfreq HOW OFTEN DO YOU DRINK? IN 1971 0: Almost every day, 1: 2-3 times/week, 2: 1-4 times/month, 3: < 12 times/year, 4: No alcohol last year, 5: Unknown
alcoholhowmuch WHEN YOU DRINK, HOW MUCH DO YOU DRINK? IN 1971
alcoholpy HAVE YOU HAD 1 DRINK PAST YEAR? IN 1971, 1:EVER, 0:NEVER; 2:MISSING
alcoholtype WHICH DO YOU MOST FREQUENTLY DRINK? IN 1971 1: BEER, 2: WINE, 3: LIQUOR, 4: OTHER/UNKNOWN
allergies USE ALLERGIES MEDICATION IN 1971, 1:EVER, 0:NEVER
asthma DX ASTHMA IN 1971, 1:EVER, 0:NEVER
bithcontrol BIRTH CONTROL PILLS PAST 6 MONTHS? IN 1971 1:YES, 0:NO, 2:MISSING
birthplace CHECK STATE CODE - SECOND PAGE
boweltrouble USE BOWEL TROUBLE MEDICATION IN 1971, 1:EVER, 0:NEVER, ; 2:MISSING
bronch DX CHRONIC BRONCHITIS/EMPHYSEMA IN 1971, 1:EVER, 0:NEVER
cholesterol SERUM CHOLESTEROL (MG/100ML) IN 1971
chroniccough DX CHRONIC COUGH IN 1971, 1:EVER, 0:NEVER
colitis DX COLITIS IN 1971, 1:EVER, 0:NEVER
dadth DAY OF DEATH
dbp DIASTOLIC BLOOD PRESSURE IN 1982
death DEATH BY 1992, 1:YES, 0:NO
diabetes DX DIABETES IN 1971, 1:EVER, 0:NEVER, 2:MISSING
education AMOUNT OF EDUCATION BY 1971: 1: 8TH GRADE OR LESS, 2: HS DROPOUT, 3: HS, 4:COLLEGE DROPOUT, 5: COLLEGE OR MORE
exercise IN RECREATION, HOW MUCH EXERCISE? IN 1971, 0:much exercise,1:moderate exercise,2:little or no exercise
hayfever DX HAY FEVER IN 1971, 1:EVER, 0:NEVER
hbp DX HIGH BLOOD PRESSURE IN 1971, 1:EVER, 0:NEVER, 2:MISSING
hbpmed USE HIGH BLOOD PRESSURE MEDICATION IN 1971, 1:EVER, 0:NEVER, ; 2:MISSING
headache USE HEADACHE MEDICATION IN 1971, 1:EVER, 0:NEVER
hepatitis DX HEPATITIS IN 1971, 1:EVER, 0:NEVER
hf DX HEART FAILURE IN 1971, 1:EVER, 0:NEVER
hightax82 LIVING IN A HIGHLY TAXED STATE IN 1982, High taxed state of residence=1, 0 otherwise
ht HEIGHT IN CENTIMETERS IN 1971
income TOTAL FAMILY INCOME IN 1971 11:<$1000, 12: 1000-1999, 13: 2000-2999, 14: 3000-3999, 15: 4000-4999, 16: 5000-5999, 17: 6000-6999, 18: 7000-9999, 19: 10000-14999, 20: 15000-19999, 21: 20000-24999, 22: 25000+
infection USE INFECTION MEDICATION IN 1971, 1:EVER, 0:NEVER
lackpep USELACK OF PEP MEDICATION IN 1971, 1:EVER, 0:NEVER
marital MARITAL STATUS IN 1971 1: Under 17, 2: Married, 3: Widowed, 4: Never married, 5: Divorced, 6: Separated, 8: Unknown
modth MONTH OF DEATH
nerves USE NERVES MEDICATION IN 1971, 1:EVER, 0:NEVER
nervousbreak DX NERVOUS BREAKDOWN IN 1971, 1:EVER, 0:NEVER
otherpain USE OTHER PAINS MEDICATION IN 1971, 1:EVER, 0:NEVER
pepticulcer DX PEPTIC ULCER IN 1971, 1:EVER, 0:NEVER
pica DO YOU EAT DIRT OR CLAY, STARCH OR OTHER NON STANDARD FOOD? IN 1971 1:EVER, 0:NEVER; 2:MISSING
polio DX POLIO IN 1971, 1:EVER, 0:NEVER
pregnancies TOTAL NUMBER OF PREGNANCIES? IN 1971
price71 AVG TOBACCO PRICE IN STATE OF RESIDENCE 1971 (US$2008)
price71_82 DIFFERENCE IN AVG TOBACCO PRICE IN STATE OF RESIDENCE 1971-1982 (US$2008)
price82 AVG TOBACCO PRICE IN STATE OF RESIDENCE 1982 (US$2008)
qsmk QUIT SMOKING BETWEEN 1ST QUESTIONNAIRE AND 1982, 1:YES, 0:NO
race 0: WHITE 1: BLACK OR OTHER IN 1971
sbp SYSTOLIC BLOOD PRESSURE IN 1982
school HIGHEST GRADE OF REGULAR SCHOOL EVER IN 1971
seqn UNIQUE PERSONAL IDENTIFIER
sex 0: MALE 1: FEMALE
smokeintensity NUMBER OF CIGARETTES SMOKED PER DAY IN 1971
smkintensity 82_71 INCREASE IN NUMBER OF CIGARETTES/DAY BETWEEN 1971 and 1982
smokeyrs YEARS OF SMOKING
tax71 TOBACCO TAX IN STATE OF RESIDENCE 1971 (US$2008)
tax71_82 DIFFERENCE IN TOBACCO TAX IN STATE OF RESIDENCE 1971-1982 (US$2008)
tax82 TOBACCO TAX IN STATE OF RESIDENCE 1971 (US$2008)
tb DX TUBERCULOSIS IN 1971, 1:EVER, 0:NEVER
tumor DX MALIGNANT TUMOR/GROWTH IN 1971, 1:EVER, 0:NEVER
weakheart USE WEAK HEART MEDICATION IN 1971, 1:EVER, 0:NEVER
wt71 WEIGHT IN KILOGRAMS IN 1971
wt82 WEIGHT IN KILOGRAMS IN 1982
wt82_71 WEIGHT CHANGE IN KILOGRAMS
wtloss USE WEIGHT LOSS MEDICATION IN 1971, 1:EVER, 0:NEVER
yrdth YEAR OF DEATH

6.2 Exploratory Data Analysis

First, let’s read in our data and glance at it using a couple of different views.

nhefs <- readr::read_csv("data/nhefs.csv")
head(nhefs)
# A tibble: 6 × 64
   seqn  qsmk death yrdth modth dadth   sbp   dbp   sex   age  race income
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1   233     0     0    NA    NA    NA   175    96     0    42     1     19
2   235     0     0    NA    NA    NA   123    80     0    36     0     18
3   244     0     0    NA    NA    NA   115    75     1    56     1     15
4   245     0     1    85     2    14   148    78     0    68     1     15
5   252     0     0    NA    NA    NA   118    77     0    40     0     18
6   257     0     0    NA    NA    NA   141    83     1    43     1     11
# … with 52 more variables: marital <dbl>, school <dbl>, education <dbl>,
#   ht <dbl>, wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
#   smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
#   asthma <dbl>, bronch <dbl>, tb <dbl>, hf <dbl>, hbp <dbl>,
#   pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
#   hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
#   nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <dbl>, …
tibble::glimpse(nhefs)
Rows: 1,629
Columns: 64
$ seqn              <dbl> 233, 235, 244, 245, 252, 257, 262, 266, 419, 420, 42…
$ qsmk              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1…
$ death             <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0…
$ yrdth             <dbl> NA, NA, NA, 85, NA, NA, NA, NA, 84, 86, NA, 92, NA, …
$ modth             <dbl> NA, NA, NA, 2, NA, NA, NA, NA, 10, 10, NA, 11, NA, N…
$ dadth             <dbl> NA, NA, NA, 14, NA, NA, NA, NA, 13, 17, NA, 28, NA, …
$ sbp               <dbl> 175, 123, 115, 148, 118, 141, 132, 100, 163, 184, 13…
$ dbp               <dbl> 96, 80, 75, 78, 77, 83, 69, 53, 79, 106, 89, 69, 80,…
$ sex               <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1…
$ age               <dbl> 42, 36, 56, 68, 40, 43, 56, 29, 51, 43, 43, 34, 54, …
$ race              <dbl> 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ income            <dbl> 19, 18, 15, 15, 18, 11, 19, 22, 18, 16, 19, 18, 16, …
$ marital           <dbl> 2, 2, 3, 3, 2, 4, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2…
$ school            <dbl> 7, 9, 11, 5, 11, 9, 12, 12, 10, 11, 12, 12, 8, 12, 0…
$ education         <dbl> 1, 2, 2, 1, 2, 2, 3, 3, 2, 2, 3, 3, 1, 3, 1, 3, 3, 3…
$ ht                <dbl> 174.1875, 159.3750, 168.5000, 170.1875, 181.8750, 16…
$ wt71              <dbl> 79.04, 58.63, 56.81, 59.42, 87.09, 99.00, 63.05, 58.…
$ wt82              <dbl> 68.94604, 61.23497, 66.22449, 64.41012, 92.07925, 10…
$ wt82_71           <dbl> -10.0939598, 2.6049700, 9.4144860, 4.9901165, 4.9892…
$ birthplace        <dbl> 47, 42, 51, 37, 42, 34, NA, NA, 42, 42, 42, 42, NA, …
$ smokeintensity    <dbl> 30, 20, 20, 3, 20, 10, 20, 2, 25, 20, 30, 40, 20, 10…
$ smkintensity82_71 <dbl> -10, -10, -14, 4, 0, 10, 0, 1, -10, -20, -30, -10, -…
$ smokeyrs          <dbl> 29, 24, 26, 53, 19, 21, 39, 9, 37, 25, 24, 20, 19, 3…
$ asthma            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ bronch            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0…
$ tb                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ hf                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ hbp               <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
$ pepticulcer       <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ colitis           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ hepatitis         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ chroniccough      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ hayfever          <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ diabetes          <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ polio             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ tumor             <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ nervousbreak      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ alcoholpy         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1…
$ alcoholfreq       <dbl> 1, 0, 3, 2, 2, 3, 1, 0, 1, 0, 3, 0, 0, 4, 4, 3, 2, 1…
$ alcoholtype       <dbl> 3, 1, 4, 3, 1, 2, 3, 2, 3, 1, 3, 1, 1, 4, 4, 3, 3, 2…
$ alcoholhowmuch    <dbl> 7, 4, NA, 4, 2, 1, 4, 1, 2, 6, 2, 6, 1, NA, NA, 3, 2…
$ pica              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ headache          <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1…
$ otherpain         <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ weakheart         <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ allergies         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ nerves            <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ lackpep           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ hbpmed            <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ boweltrouble      <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ wtloss            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ infection         <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ active            <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 2, 1, 1, 0, 1, 1, 1, 0, 0, 1…
$ exercise          <dbl> 2, 0, 2, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 0, 2, 1…
$ birthcontrol      <dbl> 2, 2, 0, 2, 2, 0, 0, 0, 2, 2, 2, 2, 0, 0, 0, 0, 2, 0…
$ pregnancies       <dbl> NA, NA, 2, NA, NA, 1, 1, 2, NA, NA, NA, NA, 2, NA, 1…
$ cholesterol       <dbl> 197, 301, 157, 174, 216, 212, 205, 166, 337, 279, 17…
$ hightax82         <dbl> 0, 0, 0, 0, 0, 1, NA, NA, 0, 0, 0, 0, NA, 0, NA, 0, …
$ price71           <dbl> 2.183594, 2.346680, 1.569580, 1.506592, 2.346680, 2.…
$ price82           <dbl> 1.739990, 1.797363, 1.513428, 1.451904, 1.797363, 2.…
$ tax71             <dbl> 1.1022949, 1.3649902, 0.5512695, 0.5249023, 1.364990…
$ tax82             <dbl> 0.4619751, 0.5718994, 0.2309875, 0.2199707, 0.571899…
$ price71_82        <dbl> 0.44378662, 0.54931641, 0.05619812, 0.05479431, 0.54…
$ tax71_82          <dbl> 0.6403809, 0.7929688, 0.3202515, 0.3049927, 0.792968…

For now, let’s focus on observations (rows) that don’t have any missing data. To do this, let’s get a summary of how much missingness we have across the 1,629 observations.

allvars <- names(nhefs)

df_miss <- data.frame(variable = allvars)

df_miss <- df_miss %>%
  dplyr::rowwise() %>%
  dplyr::mutate(n_miss = sum(is.na(nhefs[[variable]])))

print(df_miss, n = nrow(df_miss))
# A tibble: 64 × 2
# Rowwise: 
   variable          n_miss
   <chr>              <int>
 1 seqn                   0
 2 qsmk                   0
 3 death                  0
 4 yrdth               1311
 5 modth               1307
 6 dadth               1307
 7 sbp                   77
 8 dbp                   81
 9 sex                    0
10 age                    0
11 race                   0
12 income                62
13 marital                0
14 school                 0
15 education              0
16 ht                     0
17 wt71                   0
18 wt82                  63
19 wt82_71               63
20 birthplace            92
21 smokeintensity         0
22 smkintensity82_71      0
23 smokeyrs               0
24 asthma                 0
25 bronch                 0
26 tb                     0
27 hf                     0
28 hbp                    0
29 pepticulcer            0
30 colitis                0
31 hepatitis              0
32 chroniccough           0
33 hayfever               0
34 diabetes               0
35 polio                  0
36 tumor                  0
37 nervousbreak           0
38 alcoholpy              0
39 alcoholfreq            0
40 alcoholtype            0
41 alcoholhowmuch       417
42 pica                   0
43 headache               0
44 otherpain              0
45 weakheart              0
46 allergies              0
47 nerves                 0
48 lackpep                0
49 hbpmed                 0
50 boweltrouble           0
51 wtloss                 0
52 infection              0
53 active                 0
54 exercise               0
55 birthcontrol           0
56 pregnancies          903
57 cholesterol           16
58 hightax82             92
59 price71               92
60 price82               92
61 tax71                 92
62 tax82                 92
63 price71_82            92
64 tax71_82              92

That looks a little hairy. Basically, all we did here was to create data.frame, save the NHEFS variable names to its first column, then look up each variable in the NHEFS data set and count the number of missing values.

However, we could have gotten this summary in a number of different ways.

lapply(nhefs, function(x) table(is.na(x)))
$seqn

FALSE 
 1629 

$qsmk

FALSE 
 1629 

$death

FALSE 
 1629 

$yrdth

FALSE  TRUE 
  318  1311 

$modth

FALSE  TRUE 
  322  1307 

$dadth

FALSE  TRUE 
  322  1307 

$sbp

FALSE  TRUE 
 1552    77 

$dbp

FALSE  TRUE 
 1548    81 

$sex

FALSE 
 1629 

$age

FALSE 
 1629 

$race

FALSE 
 1629 

$income

FALSE  TRUE 
 1567    62 

$marital

FALSE 
 1629 

$school

FALSE 
 1629 

$education

FALSE 
 1629 

$ht

FALSE 
 1629 

$wt71

FALSE 
 1629 

$wt82

FALSE  TRUE 
 1566    63 

$wt82_71

FALSE  TRUE 
 1566    63 

$birthplace

FALSE  TRUE 
 1537    92 

$smokeintensity

FALSE 
 1629 

$smkintensity82_71

FALSE 
 1629 

$smokeyrs

FALSE 
 1629 

$asthma

FALSE 
 1629 

$bronch

FALSE 
 1629 

$tb

FALSE 
 1629 

$hf

FALSE 
 1629 

$hbp

FALSE 
 1629 

$pepticulcer

FALSE 
 1629 

$colitis

FALSE 
 1629 

$hepatitis

FALSE 
 1629 

$chroniccough

FALSE 
 1629 

$hayfever

FALSE 
 1629 

$diabetes

FALSE 
 1629 

$polio

FALSE 
 1629 

$tumor

FALSE 
 1629 

$nervousbreak

FALSE 
 1629 

$alcoholpy

FALSE 
 1629 

$alcoholfreq

FALSE 
 1629 

$alcoholtype

FALSE 
 1629 

$alcoholhowmuch

FALSE  TRUE 
 1212   417 

$pica

FALSE 
 1629 

$headache

FALSE 
 1629 

$otherpain

FALSE 
 1629 

$weakheart

FALSE 
 1629 

$allergies

FALSE 
 1629 

$nerves

FALSE 
 1629 

$lackpep

FALSE 
 1629 

$hbpmed

FALSE 
 1629 

$boweltrouble

FALSE 
 1629 

$wtloss

FALSE 
 1629 

$infection

FALSE 
 1629 

$active

FALSE 
 1629 

$exercise

FALSE 
 1629 

$birthcontrol

FALSE 
 1629 

$pregnancies

FALSE  TRUE 
  726   903 

$cholesterol

FALSE  TRUE 
 1613    16 

$hightax82

FALSE  TRUE 
 1537    92 

$price71

FALSE  TRUE 
 1537    92 

$price82

FALSE  TRUE 
 1537    92 

$tax71

FALSE  TRUE 
 1537    92 

$tax82

FALSE  TRUE 
 1537    92 

$price71_82

FALSE  TRUE 
 1537    92 

$tax71_82

FALSE  TRUE 
 1537    92 
sapply(nhefs, function(x) sum(is.na(x)))
             seqn              qsmk             death             yrdth 
                0                 0                 0              1311 
            modth             dadth               sbp               dbp 
             1307              1307                77                81 
              sex               age              race            income 
                0                 0                 0                62 
          marital            school         education                ht 
                0                 0                 0                 0 
             wt71              wt82           wt82_71        birthplace 
                0                63                63                92 
   smokeintensity smkintensity82_71          smokeyrs            asthma 
                0                 0                 0                 0 
           bronch                tb                hf               hbp 
                0                 0                 0                 0 
      pepticulcer           colitis         hepatitis      chroniccough 
                0                 0                 0                 0 
         hayfever          diabetes             polio             tumor 
                0                 0                 0                 0 
     nervousbreak         alcoholpy       alcoholfreq       alcoholtype 
                0                 0                 0                 0 
   alcoholhowmuch              pica          headache         otherpain 
              417                 0                 0                 0 
        weakheart         allergies            nerves           lackpep 
                0                 0                 0                 0 
           hbpmed      boweltrouble            wtloss         infection 
                0                 0                 0                 0 
           active          exercise      birthcontrol       pregnancies 
                0                 0                 0               903 
      cholesterol         hightax82           price71           price82 
               16                92                92                92 
            tax71             tax82        price71_82          tax71_82 
               92                92                92                92 
nhefs[complete.cases(nhefs), ]
# A tibble: 44 × 64
    seqn  qsmk death yrdth modth dadth   sbp   dbp   sex   age  race income
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
 1  1096     1     1    84    10    17   127    56     1    72     0     12
 2  1539     0     1    92     3    23   125    70     1    56     0     19
 3  1762     0     1    87     8    29   176   108     1    64     0     19
 4  1776     0     1    90    11    24   155    70     1    32     1     15
 5  2286     0     1    87     8    24   132    88     1    45     1     12
 6  4556     1     1    86     9     5   109    85     1    54     0     19
 7  5415     0     1    84     7    28   123    81     1    32     1     11
 8  6222     0     1    84     1    18   168    92     1    46     0     13
 9  6933     0     1    92     4    25   168    88     1    62     0     19
10  6944     1     1    88     7    15   138    75     1    56     0     19
# … with 34 more rows, and 52 more variables: marital <dbl>, school <dbl>,
#   education <dbl>, ht <dbl>, wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>,
#   birthplace <dbl>, smokeintensity <dbl>, smkintensity82_71 <dbl>,
#   smokeyrs <dbl>, asthma <dbl>, bronch <dbl>, tb <dbl>, hf <dbl>, hbp <dbl>,
#   pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
#   hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
#   nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <dbl>, …

Let’s drop every row with a missing value for any variable

nhefs2 <- na.omit(nhefs)
tibble::glimpse(nhefs2)
Rows: 44
Columns: 64
$ seqn              <dbl> 1096, 1539, 1762, 1776, 2286, 4556, 5415, 6222, 6933…
$ qsmk              <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1…
$ death             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ yrdth             <dbl> 84, 92, 87, 90, 87, 86, 84, 84, 92, 88, 90, 88, 88, …
$ modth             <dbl> 10, 3, 8, 11, 8, 9, 7, 1, 4, 7, 1, 3, 6, 8, 11, 7, 6…
$ dadth             <dbl> 17, 23, 29, 24, 24, 5, 28, 18, 25, 15, 20, 8, 3, 11,…
$ sbp               <dbl> 127, 125, 176, 155, 132, 109, 123, 168, 168, 138, 12…
$ dbp               <dbl> 56, 70, 108, 70, 88, 85, 81, 92, 88, 75, 64, 71, 71,…
$ sex               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ age               <dbl> 72, 56, 64, 32, 45, 54, 32, 46, 62, 56, 39, 72, 67, …
$ race              <dbl> 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ income            <dbl> 12, 19, 19, 15, 12, 19, 11, 13, 19, 19, 19, 19, 15, …
$ marital           <dbl> 3, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 3, 2, 2, 5, 3, 3…
$ school            <dbl> 10, 8, 15, 9, 12, 12, 6, 11, 15, 12, 12, 12, 8, 12, …
$ education         <dbl> 2, 1, 4, 2, 3, 3, 1, 2, 4, 3, 3, 3, 1, 3, 3, 1, 2, 1…
$ ht                <dbl> 165.5000, 153.8750, 155.0938, 164.7812, 161.5938, 16…
$ wt71              <dbl> 71.21, 49.90, 48.53, 77.00, 58.29, 66.56, 73.03, 61.…
$ wt82              <dbl> 66.67808, 58.96701, 43.09128, 92.53284, 58.51342, 74…
$ wt82_71           <dbl> -4.5319216, 9.0670081, -5.4387248, 15.5328435, 0.223…
$ birthplace        <dbl> 25, 26, 50, 37, 26, 19, 1, 22, 37, 6, 30, 29, 47, 17…
$ smokeintensity    <dbl> 20, 40, 10, 10, 8, 5, 20, 30, 20, 13, 20, 15, 3, 6, …
$ smkintensity82_71 <dbl> -20, 0, 2, 0, 2, -5, -2, -15, 5, -13, 0, 5, 3, -6, -…
$ smokeyrs          <dbl> 40, 34, 38, 14, 26, 8, 14, 26, 12, 40, 23, 41, 36, 3…
$ asthma            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ bronch            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ tb                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ hf                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ hbp               <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0…
$ pepticulcer       <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0…
$ colitis           <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ hepatitis         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ chroniccough      <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ hayfever          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ diabetes          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ polio             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ tumor             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
$ nervousbreak      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ alcoholpy         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ alcoholfreq       <dbl> 1, 0, 0, 2, 2, 1, 0, 3, 2, 0, 1, 0, 2, 1, 3, 1, 3, 3…
$ alcoholtype       <dbl> 3, 1, 3, 1, 1, 3, 1, 3, 3, 3, 1, 3, 1, 3, 3, 1, 2, 3…
$ alcoholhowmuch    <dbl> 2, 1, 1, 9, 2, 2, 1, 1, 2, 2, 3, 1, 1, 2, 1, 2, 1, 1…
$ pica              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ headache          <dbl> 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1…
$ otherpain         <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1…
$ weakheart         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ allergies         <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ nerves            <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0…
$ lackpep           <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ hbpmed            <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0…
$ boweltrouble      <dbl> 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1…
$ wtloss            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ infection         <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1…
$ active            <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1…
$ exercise          <dbl> 2, 1, 1, 2, 0, 0, 2, 2, 1, 1, 0, 2, 2, 0, 1, 1, 1, 2…
$ birthcontrol      <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pregnancies       <dbl> 4, 3, 3, 3, 4, 3, 7, 3, 3, 1, 6, 4, 1, 3, 2, 5, 6, 7…
$ cholesterol       <dbl> 260, 181, 306, 126, 191, 377, 137, 258, 204, 200, 27…
$ hightax82         <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ price71           <dbl> 2.414551, 2.099609, 2.099609, 1.506592, 2.099609, 2.…
$ price82           <dbl> 1.951172, 1.929199, 1.693848, 1.451904, 1.929199, 1.…
$ tax71             <dbl> 1.2597656, 0.9974365, 1.0498047, 0.5249023, 0.997436…
$ tax82             <dbl> 0.6379395, 0.6379395, 0.4399414, 0.2199707, 0.637939…
$ price71_82        <dbl> 0.46356201, 0.17059326, 0.40594482, 0.05479431, 0.17…
$ tax71_82          <dbl> 0.6219482, 0.3594971, 0.6099854, 0.3049927, 0.359497…

Oops, we have only 44 rows left in the data set! Perhaps we should reconsider and pick several variables that will be of interest to us for our current purposes.

selectvars <- c(
  "age", "alcoholfreq", "asthma", "wt71",
  "smokeintensity", "sbp", "allergies",
  "weakheart"
)

nhefs3 <- na.omit(nhefs[, selectvars])
tibble::glimpse(nhefs3)
Rows: 1,552
Columns: 8
$ age            <dbl> 42, 36, 56, 68, 40, 43, 56, 29, 51, 43, 43, 34, 54, 51,…
$ alcoholfreq    <dbl> 1, 0, 3, 2, 2, 3, 1, 0, 1, 0, 3, 0, 0, 4, 4, 3, 2, 1, 0…
$ asthma         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ wt71           <dbl> 79.04, 58.63, 56.81, 59.42, 87.09, 99.00, 63.05, 58.74,…
$ smokeintensity <dbl> 30, 20, 20, 3, 20, 10, 20, 2, 25, 20, 30, 40, 20, 10, 4…
$ sbp            <dbl> 175, 123, 115, 148, 118, 141, 132, 100, 163, 184, 135, …
$ allergies      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ weakheart      <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

That’s better. We’ve discarded 77 rows and focused on a subset of variables of interest.

Look back at the output we got from glimpse(). Are the variables in the formats we expect/need? I see one that we may need to update, depending on how we plan to use it. (Hint: Look back at the codebook for the variable summaries)

It seems that alcoholfreq is currently being treated as a numeric variables, but really, it’s a factor with 5 discrete levels. Furthermore, level 5 is Unknown, which we should probably treat as missing. We’ll also use a factor format for alcoholfreq.

We could leave alcoholfreq as a numeric variable and wrap it in the factor() function only when we needed to. In some cases, that approach might be preferable, in some cases not.

nhefs3$alcfreq <- as.factor(nhefs3$alcoholfreq)
table(nhefs3$alcfreq)

  0   1   2   3   4   5 
320 217 489 329 192   5 

Let’s replace those 5 missing values with NA’s, too, and drop the corresponding rows from our data set. Let’s also get rid of the original alcoholfreq variable and just hang on to the factor variable we made.

# In words: where alcfreq in nhefs3 equals 5, assign NA
nhefs3$alcfreq[nhefs3$alcfreq == 5] <- NA

nhefs4 <- na.omit(nhefs3)
nhefs4$alcoholfreq <- NULL

tibble::glimpse(nhefs4)
Rows: 1,547
Columns: 8
$ age            <dbl> 42, 36, 56, 68, 40, 43, 56, 29, 51, 43, 43, 34, 54, 51,…
$ asthma         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ wt71           <dbl> 79.04, 58.63, 56.81, 59.42, 87.09, 99.00, 63.05, 58.74,…
$ smokeintensity <dbl> 30, 20, 20, 3, 20, 10, 20, 2, 25, 20, 30, 40, 20, 10, 4…
$ sbp            <dbl> 175, 123, 115, 148, 118, 141, 132, 100, 163, 184, 135, …
$ allergies      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ weakheart      <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ alcfreq        <fct> 1, 0, 3, 2, 2, 3, 1, 0, 1, 0, 3, 0, 0, 4, 4, 3, 2, 1, 0…

Let’s summarize our dataset:

# Continuous variables
contvars <- c("age", "wt71", "smokeintensity", "sbp")
summary(nhefs4[, contvars])
      age             wt71        smokeintensity       sbp       
 Min.   :25.00   Min.   : 39.58   Min.   : 1.00   Min.   : 87.0  
 1st Qu.:33.00   1st Qu.: 59.65   1st Qu.:10.00   1st Qu.:116.0  
 Median :43.00   Median : 69.29   Median :20.00   Median :126.0  
 Mean   :43.65   Mean   : 70.90   Mean   :20.54   Mean   :128.7  
 3rd Qu.:53.00   3rd Qu.: 79.95   3rd Qu.:30.00   3rd Qu.:139.5  
 Max.   :74.00   Max.   :151.73   Max.   :80.00   Max.   :229.0  
# Categorical variables
catvars <- c("asthma", "allergies", "alcfreq", "weakheart")
lapply(
  X = catvars,
  FUN = function(x) {
    summarytools::freq(nhefs4[x])
  })
Frequencies  
asthma  
Type: Numeric  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0   1474     95.28          95.28     95.28          95.28
          1     73      4.72         100.00      4.72         100.00
       <NA>      0                               0.00         100.00
      Total   1547    100.00         100.00    100.00         100.00

allergies  
Type: Numeric  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0   1448     93.60          93.60     93.60          93.60
          1     99      6.40         100.00      6.40         100.00
       <NA>      0                               0.00         100.00
      Total   1547    100.00         100.00    100.00         100.00

alcfreq  
Type: Factor  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    320     20.69          20.69     20.69          20.69
          1    217     14.03          34.71     14.03          34.71
          2    489     31.61          66.32     31.61          66.32
          3    329     21.27          87.59     21.27          87.59
          4    192     12.41         100.00     12.41         100.00
          5      0      0.00         100.00      0.00         100.00
       <NA>      0                               0.00         100.00
      Total   1547    100.00         100.00    100.00         100.00

weakheart  
Type: Numeric  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0   1512     97.74          97.74     97.74          97.74
          1     35      2.26         100.00      2.26         100.00
       <NA>      0                               0.00         100.00
      Total   1547    100.00         100.00    100.00         100.00

Based on this tidy example.

A few plots may help us, too:

library(ggplot2)
library(ggthemes)

nhefs4 %>%
  dplyr::select(all_of(contvars)) %>%
  tidyr::gather() %>%
  ggplot(aes(x = value)) +
  facet_wrap(~key, scales = "free") +
  geom_density() +
  theme_tufte(base_size = 16)

nhefs4 %>%
  dplyr::select(all_of(catvars)) %>%
  tidyr::gather() %>%
  ggplot(aes(x = value)) +
  facet_wrap(~key, nrow = 2, scales = "free") +
  geom_bar(stat = "count", width = 0.5, fill = "lightgray") +
  theme_tufte(base_size = 16)

Don’t worry just now about the plot code. We’ll be looking in some detail at the ggplot2 package later.

6.3 Tabular Analysis

Soon you will be no stranger to two-by-two tables and contingency tables. Introductory statistics and epidemiologic methods courses will require you to analyze such tables, either to measure the strength of an exposure’s effect on some outcome or to establish a statistical association between two variables.

Going back to our NHEFS data, let’s say we’re interested in a possible association between quitting smoking and weight loss. The qsmk variable is a binary indicator of whether or not an individual quit smoking between 1971 and 1982, while the sbp variable measures an individual’s systolic blood pressure in 1982.

Typically, we would not want to dichotomize a continuous variable like sbp, but for the sake of this example, let’s begin by creating a binary indicator of whether a subject had a high (SBP > 140) or low (SBP <= 140) systolic blood pressure in 1982.

# check whether any values of sbp are missing
any(is.na(nhefs$sbp))
[1] TRUE
# count the number of missing observations
# is.na() will return TRUE if an observation is missing and FALSE if it is not
# we can use sum() to add up the number of TRUE values because R will treat a
# logical variable as numeric for this purpose, where TRUE = 1 and FALSE = 0
sum(is.na(nhefs$sbp))
[1] 77
# create the new variable, called sbp_hi
nhefs$sbp_hi <- NA #initialize with NA
nhefs$sbp_hi[nhefs$sbp > 140] <- 1
nhefs$sbp_hi[nhefs$sbp <= 140] <- 0

# make sure we coded the new variable correctly
summary(nhefs$sbp[nhefs$sbp_hi == 1])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  141.0   145.0   150.0   155.3   162.0   229.0      77 
summary(nhefs$sbp[nhefs$sbp_hi == 0])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   87.0   113.0   122.0   120.7   129.0   140.0      77 

The summary statistics we generated above should comply with our expectations: those we marked as having gained weight all have positive values of sbp_hi, while those we marked as not having gained weight all have negative values of sbp_hi.

Note, too, that we have 77 missing values in our new variable, just as we saw in the original outcome.

with(nhefs, table(sbp_hi, qsmk, exclude = NULL))
      qsmk
sbp_hi   0   1
  0    908 284
  1    251 109
  <NA>  42  35

The option exclude = NULL above includes missing values in the table output. We’ll ignore them for simplicity.

After doing so, we’re left with the following two-by-two table for analysis.

nhefs %>%
  count(qsmk, sbp_hi) %>%
  na.omit() %>%
  pivot_wider(
    names_from = sbp_hi,
    values_from = n,
    names_pref = "sbp"
  )
# A tibble: 2 × 3
   qsmk  sbp0  sbp1
  <dbl> <int> <int>
1     0   908   251
2     1   284   109

A natural choice here would be to conduct a two-sample test for the equality of proportions, where we compare the proportion of subjects in the quit smoking group (0.277) who had high systolic blood pressure in 1982 versus the proportion in the non-quit smoking group (0.217).

ptest <- prop.test(c(109, 251), c(109 + 284, 251 + 908))
ptest

    2-sample test for equality of proportions with continuity correction

data:  c(109, 251) out of c(109 + 284, 251 + 908)
X-squared = 5.7508, df = 1, p-value = 0.01648
alternative hypothesis: two.sided
95 percent confidence interval:
 0.008869684 0.112705685
sample estimates:
   prop 1    prop 2 
0.2773537 0.2165660 

The output above shows us the group proportions, the \(\chi^2\) test statistic, and the corresponding p-value. We also get a 95% confidence interval for the difference in proportions.

To get a tidier output as a tibble, the broom package often proves useful:

broom::tidy(ptest)
# A tibble: 1 × 9
  estimate1 estimate2 statistic p.value parameter conf.low conf.high method     
      <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr>      
1     0.277     0.217      5.75  0.0165         1  0.00887     0.113 2-sample t…
# … with 1 more variable: alternative <chr>

We also could have given the table directly to the prop.test() function:

Know Your Output Notice that the proportions and, as a result, the confidence interval given by the second prop.test() are different than the first, even though the \(\chi^2\) statistic and the p-value are the same. The statistical test we used applied to the distribution of cell counts in the table. However, if we needed to report the difference in proportions and its confidence interval, the second method would have given us an incorrect estimate. Always be sure you understand functions’ outputs!

with(nhefs, prop.test(table(qsmk, sbp_hi)))

    2-sample test for equality of proportions with continuity correction

data:  table(qsmk, sbp_hi)
X-squared = 5.7508, df = 1, p-value = 0.01648
alternative hypothesis: two.sided
95 percent confidence interval:
 0.008869684 0.112705685
sample estimates:
   prop 1    prop 2 
0.7834340 0.7226463 

6.4 Regression Models

In this section we will fit some regression models that you may encounter in classes or in your own research. We will leave the theoretical background to your statistics and methods courses but aim to provide enough information so that you understand the basic idea behind each approach and can refer back to this page in the future.

All upcoming examples use variables from the NHEFS dataset that we’ve referred back to several times by now.

Note that you can pass all of the fit objects below to broom::tidy() in order to retrieve a data frame containing the regression information. Doing so will often make exporting tables and results much easier for you. However, we will focus on the default summary output for now.

6.4.1 Linear regression

Exposure of interest: quitting smoking between 1971 and 1982 (qsmk) Outcome: systolic blood pressure in 1982 (continuous, sbp)

First, we can run an unadjusted model with weight change as the outcome and quitting smoking as the predictor variable. The symbol ~ indicates that the statement sbp ~ qsmk is a formula.

fitlm1 <- lm(sbp ~ qsmk, data = nhefs)
summary(fitlm1)

Call:
lm(formula = sbp ~ qsmk, data = nhefs)

Residuals:
   Min     1Q Median     3Q    Max 
-40.70 -12.70  -2.69  10.30 101.30 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 127.6989     0.5575 229.070  < 2e-16 ***
qsmk          3.9907     1.1078   3.602 0.000325 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.98 on 1550 degrees of freedom
  (77 observations deleted due to missingness)
Multiple R-squared:  0.008302,  Adjusted R-squared:  0.007663 
F-statistic: 12.98 on 1 and 1550 DF,  p-value: 0.0003254

You can think of the fitlm1 object that we saved as a list containing information about the model we fit:

names(fitlm1)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "na.action"     "xlevels"       "call"          "terms"        
[13] "model"        

We can output coefficients,

fitlm1$coefficients
(Intercept)        qsmk 
 127.698878    3.990689 

residuals,

head(fitlm1$residuals)
         1          2          3          4          5          6 
 47.301122  -4.698878 -12.698878  20.301122  -9.698878  13.301122 

and the data on which R fitted the model:

head(fitlm1$model)
  sbp qsmk
1 175    0
2 123    0
3 115    0
4 148    0
5 118    0
6 141    0

Detour

Why did we bother to note that the model object saves the “data on which R fitted the model”? Didn’t R fit the model on the data we told it to?

To answer this question, let’s take a brief detour and check that the data frame saved in fitlm1 has the same number of rows as the nhefs data frame we fed to the lm() function.

nrow(fitlm1$model)
[1] 1552
nrow(nhefs)
[1] 1629

Interesting! The number of rows in each data set don’t match, but why?

Recall that earlier on this page, we found 77 missing values of the outcome sbp, and if we subtract 1552 from 1629, we get 77. That’s because R’s modeling methods omit rows with missing data for any variables included in the model formula. The model object saves the row numbers of dropped observations in fitlm1$na, and we can verify by extracting these rows from the original nhefs data set:

# extract row numbers that were dropped from the fitlm1 model
# keep only sbp column and see that they're all missing
nhefs[fitlm1$na, c("sbp")] %>%
  count(sbp)
# A tibble: 1 × 2
    sbp     n
  <dbl> <int>
1    NA    77

Not to belabor the point, but let’s convince ourselves a bit more by identifying the rows with missing values for sbp and comparing this to the list R made:

# identify rows with missing values for sbp
# vector of numeric row indices
missindex <- which(is.na(nhefs$sbp))

## sort vectors in numerical order, generate a cross table
## NOTE sorting is unecessary in this case, but it's good
##      practice to ensure elements are ordered properly
table(sort(missindex) == sort(fitlm1$na))

TRUE 
  77 

The index vector we created manually matches the one R created and saved to fitlm1.

R also provides functions designed to extract specific information from model fits, though we’ll suppress the output since we already printed out some of this information above:

# extract model coefficients
coef(fitlm1)

# extract model residuals
residuals(fitlm1)

Let’s look at some plot diagnostics to make sure our linear regression complies with the model’s assumptions:

par(mfrow = c(2, 2)) tells R that we want a 2x2 plot grid. If you ran plot(fitlm1) by itself, R would output four images in succession.

par(mfrow = c(2, 2))
plot(fitlm1)

Based on the Q-Q plot, we may have some violations of the normality assumption (i.e., residuals are not normally distributed), as evidenced by the points’ departure from the diagonal line. In a real analysis we would need to address this issue, possibly by transforming the outcome. However, our focus here is on implementing models in code, so we will proceed, leaving issues of statistical practice to your relevant coursework.

Our unadjusted estimate, therefore, suggests that those who did not quit smoking between 1971 and 1982 had a 3.99 mm/Hg higher systolic blood pressure in 1982, on average, versus those who did quit smoking during that timeframe.

The unadjusted linear regression we just ran is actually a special case of the two-sample t-test (assuming equal variances between smoking group), to which you’ll be introduced early on in your stats classes. Compare the output of the code below to that of the linear regression.

Comparing Outputs What are the t statistics?

What are the p-values for the difference in weight gain between quit smoking groups?

What are the degrees of freedom?

When you take the difference of “mean of x” and “mean of y” from the output of t.test(), do you find that value in the linear regression summary?

with(nhefs, t.test(sbp[qsmk == 1], sbp[qsmk == 0], var.equal = T))

    Two Sample t-test

data:  sbp[qsmk == 1] and sbp[qsmk == 0]
t = 3.6023, df = 1550, p-value = 0.0003254
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.817711 6.163668
sample estimates:
mean of x mean of y 
 131.6896  127.6989 

The code below fits a linear model for the effect of qsmk on sbp adjusted for years of smoking (smokeyrs), intensity of smoking (smokeintensity), diabetes, sex, and age. Note that we allow for an “interaction” between sex and age, denoted by sex * age.

fitlm2 <- lm(
  sbp ~ qsmk + smokeyrs + smokeintensity + diabetes + sex * age,
  data = nhefs
)

summary(fitlm2)

Call:
lm(formula = sbp ~ qsmk + smokeyrs + smokeintensity + diabetes + 
    sex * age, data = nhefs)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.309 -11.188  -1.752   9.099  94.590 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    113.85479    2.95509  38.528  < 2e-16 ***
qsmk             1.51803    1.03207   1.471 0.141533    
smokeyrs         0.06761    0.08043   0.841 0.400721    
smokeintensity  -0.06613    0.03933  -1.681 0.092922 .  
diabetes        -0.62827    0.44678  -1.406 0.159859    
sex            -16.63281    3.36583  -4.942 8.59e-07 ***
age              0.39023    0.09439   4.134 3.76e-05 ***
sex:age          0.28503    0.07558   3.771 0.000169 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.35 on 1544 degrees of freedom
  (77 observations deleted due to missingness)
Multiple R-squared:  0.1743,    Adjusted R-squared:  0.1706 
F-statistic: 46.56 on 7 and 1544 DF,  p-value: < 2.2e-16

Our adjusted estimate suggests that quitting smoking was associated with an average 1.52 kg increase in weight between 1971 and 1982, conditional on the other variables we included in the second linear model.

Almost always, we will want to report the standard error estimates and/or confidence intervals with our measures of effect. The broom::tidy() function makes extracting 95% confidence intervals from regression fit objects (i.e., objects like fitlm1 or fitlm2):

broom::tidy(fitlm2, conf.int = TRUE) %>%
  dplyr::filter(term == "qsmk")
# A tibble: 1 × 7
  term  estimate std.error statistic p.value conf.low conf.high
  <chr>    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 qsmk      1.52      1.03      1.47   0.142   -0.506      3.54

Alternatively, we can calculate the same confidence interval manually:

# point estimate
est <- unname(coef(fitlm2)["qsmk"])

# standard error of point estimate
se <- unname(summary(fitlm2)$coefficients[, c("Std. Error")]["qsmk"])

# lower and upper bounds of 95% confidence limits
ll <- est - 1.96 * se
ul <- est + 1.96 * se

c(estimate = est, ll95 = ll, ul95 = ul)
  estimate       ll95       ul95 
 1.5180349 -0.5048284  3.5408982 

6.4.2 Logistic regression

Let’s revisit our tabular analysis, in which we were interested in whether quitting smoking affected the likelihood of a subject’s gaining weight between 1971 and 1982.

We can answer the same question with a logistic regression model, using the glm() function.

fitlog1 <- glm(
  sbp_hi ~ qsmk,
  data = nhefs,
  family = binomial()
)

summary(fitlog1)

Call:
glm(formula = sbp_hi ~ qsmk, family = binomial(), data = nhefs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8060  -0.6987  -0.6987  -0.6987   1.7492  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.28579    0.07131 -18.031   <2e-16 ***
qsmk         0.32817    0.13334   2.461   0.0139 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1681.2  on 1551  degrees of freedom
Residual deviance: 1675.3  on 1550  degrees of freedom
  (77 observations deleted due to missingness)
AIC: 1679.3

Number of Fisher Scoring iterations: 4

From this regression, we see that smokers who quit had an average increase of 0.328 in their log-odds of weight gain between 1971 and 1982.

Usually, you will be asked to report associations from logistic regressions as odds ratios, in which case you would simply exponentiate the coefficient of interest. In this case we would run exp(0.43304) and get an odds ratio of 1.39. That is, those who quit smoking had 1.39 times the odds of gaining weight between 1971 and 1982, compared to those who did not quit smoking.

As in the linear regression example, you could run a model adjusting for other factors, in which case you would modify the model formula accordingly. We will forego running an adjusted model here to avoid redundancy.

6.4.3 Log-binomial regression

While the logistic regression remains the most frequently used generalized linear model (GLM) for binary outcomes, epidemiologists are often interested in risk differences and risk ratios.

Luckily, we can use the log-binomial model to estimate contrasts in risks.

fitlbin <- glm(
  sbp_hi ~ qsmk,
  data = nhefs,
  family = binomial(link = "log")
)

summary(fitlbin)

Call:
glm(formula = sbp_hi ~ qsmk, family = binomial(link = "log"), 
    data = nhefs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8060  -0.6987  -0.6987  -0.6987   1.7492  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.52986    0.05587 -27.383   <2e-16 ***
qsmk         0.24740    0.09875   2.505   0.0122 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1681.2  on 1551  degrees of freedom
Residual deviance: 1675.3  on 1550  degrees of freedom
  (77 observations deleted due to missingness)
AIC: 1679.3

Number of Fisher Scoring iterations: 6

To get the risk ratio for the effect of quitting smoking on weight gain, we again exponentiate the beta coefficient of interest and, in doing so, estimate that those who quit smoking were 1.281 times as likely to gain weight versus non-smokers.

Try this Calculate the risk ratio manually and convince yourself that the log-binomial model gave you the correct answer.

6.4.4 Modified Poisson regression

Another way to get a risk ratio using GLMs involves a procedure called modified Poisson regression. “Modified” here refers to a necessary modification of the standard error estimates when using a Poisson regression model to estimate a risk ratio. Again, save the statistical details for later. All we want to know for the moment are the nuts-and-bolts: how to fit the model and how to modify the standard errors.

fitmodpois <- glm(
  sbp_hi ~ qsmk,
  data = nhefs,
  family = poisson()
)

summary(fitmodpois)

Call:
glm(formula = sbp_hi ~ qsmk, family = poisson(), data = nhefs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7448  -0.6581  -0.6581  -0.6581   1.2218  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.52986    0.06311 -24.240   <2e-16 ***
qsmk         0.24740    0.11470   2.157    0.031 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1052.1  on 1551  degrees of freedom
Residual deviance: 1047.6  on 1550  degrees of freedom
  (77 observations deleted due to missingness)
AIC: 1771.6

Number of Fisher Scoring iterations: 5

Let’s compare the qsmk coefficient estimate to the one we got using the log-binomial model:

modcompare <- rbind(
  broom::tidy(fitlbin) %>% mutate(model = "Log-binomial"),
  broom::tidy(fitmodpois) %>% mutate(model = "Modified Poisson")
  ) %>%
  filter(term == "qsmk") %>%
  select(model, estimate, std.error)

modcompare %>%
  knitr::kable()
model estimate std.error
Log-binomial 0.2473982 0.0987473
Modified Poisson 0.2473982 0.1147026

Almost exactly the same point estimate, but we can see that the standard error from the modified Poisson model is considerably larger than that of the log-binomial model’s.

To get a correct standard error for the modified Poisson model, we need to use a “robust” estimate of this error, also called the sandwich estimator:

# sandwich() provided by the sandwich package
# first, get the variance-covariance matrix
swvar_modpois <- sandwich::sandwich(fitmodpois)

# extract the variance of the qsmk coefficient and
# take the square root to get the standard error
sqrt(swvar_modpois[2, 2])
[1] 0.09874732

Compare this modified standard error to that of the log-binomial model. We’ve accounted for the conservative estimate in the original modified Poisson model.

For more complex models, the log-binomial and modified Poisson models may provide slightly different answers. Each has other benefits and drawbacks that you will learn about if and when you take a class that introduces these methods.