In this section, we’ll provide some examples in SAS of basic statistical analyses you might need to implement at some point.
The analyses you conduct as part of real research will almost certainly be more complicated than the code shown here. However, the hope is that you’ll get enough of a feel for the mechanics of SAS to begin putting the pieces together as demanded by a given project.
If you’d like to try any of these examples on your own, we’re using the NHANES I Epidemiologic Followup Study (NHEFS) data set freely available at the website for Causal Inference: What If by Miguel A. Hernán and James M. Robins.
For both SAS and R, we’ll use the CSV file they provide.
The website itself contains extensive code examples in SAS, Stata, R, and Python—everything you could want!—so those who will be taking advanced epidemiologic or causal inference coursework should refer to those resources for guidance on how to implement relevant analyses. We’ll cover a little bit here, but only briefly.
Variable name | Description |
---|---|
active | IN YOUR USUAL DAY, HOW ACTIVE ARE YOU? IN 1971, 0:very active, 1:moderately active, 2:inactive |
age | AGE IN 1971 |
alcoholfreq | HOW OFTEN DO YOU DRINK? IN 1971 0: Almost every day, 1: 2-3 times/week, 2: 1-4 times/month, 3: < 12 times/year, 4: No alcohol last year, 5: Unknown |
alcoholhowmuch | WHEN YOU DRINK, HOW MUCH DO YOU DRINK? IN 1971 |
alcoholpy | HAVE YOU HAD 1 DRINK PAST YEAR? IN 1971, 1:EVER, 0:NEVER; 2:MISSING |
alcoholtype | WHICH DO YOU MOST FREQUENTLY DRINK? IN 1971 1: BEER, 2: WINE, 3: LIQUOR, 4: OTHER/UNKNOWN |
allergies | USE ALLERGIES MEDICATION IN 1971, 1:EVER, 0:NEVER |
asthma | DX ASTHMA IN 1971, 1:EVER, 0:NEVER |
bithcontrol | BIRTH CONTROL PILLS PAST 6 MONTHS? IN 1971 1:YES, 0:NO, 2:MISSING |
birthplace | CHECK STATE CODE - SECOND PAGE |
boweltrouble | USE BOWEL TROUBLE MEDICATION IN 1971, 1:EVER, 0:NEVER, ; 2:MISSING |
bronch | DX CHRONIC BRONCHITIS/EMPHYSEMA IN 1971, 1:EVER, 0:NEVER |
cholesterol | SERUM CHOLESTEROL (MG/100ML) IN 1971 |
chroniccough | DX CHRONIC COUGH IN 1971, 1:EVER, 0:NEVER |
colitis | DX COLITIS IN 1971, 1:EVER, 0:NEVER |
dadth | DAY OF DEATH |
dbp | DIASTOLIC BLOOD PRESSURE IN 1982 |
death | DEATH BY 1992, 1:YES, 0:NO |
diabetes | DX DIABETES IN 1971, 1:EVER, 0:NEVER, 2:MISSING |
education | AMOUNT OF EDUCATION BY 1971: 1: 8TH GRADE OR LESS, 2: HS DROPOUT, 3: HS, 4:COLLEGE DROPOUT, 5: COLLEGE OR MORE |
exercise | IN RECREATION, HOW MUCH EXERCISE? IN 1971, 0:much exercise,1:moderate exercise,2:little or no exercise |
hayfever | DX HAY FEVER IN 1971, 1:EVER, 0:NEVER |
hbp | DX HIGH BLOOD PRESSURE IN 1971, 1:EVER, 0:NEVER, 2:MISSING |
hbpmed | USE HIGH BLOOD PRESSURE MEDICATION IN 1971, 1:EVER, 0:NEVER, ; 2:MISSING |
headache | USE HEADACHE MEDICATION IN 1971, 1:EVER, 0:NEVER |
hepatitis | DX HEPATITIS IN 1971, 1:EVER, 0:NEVER |
hf | DX HEART FAILURE IN 1971, 1:EVER, 0:NEVER |
hightax82 | LIVING IN A HIGHLY TAXED STATE IN 1982, High taxed state of residence=1, 0 otherwise |
ht | HEIGHT IN CENTIMETERS IN 1971 |
income | TOTAL FAMILY INCOME IN 1971 11:<$1000, 12: 1000-1999, 13: 2000-2999, 14: 3000-3999, 15: 4000-4999, 16: 5000-5999, 17: 6000-6999, 18: 7000-9999, 19: 10000-14999, 20: 15000-19999, 21: 20000-24999, 22: 25000+ |
infection | USE INFECTION MEDICATION IN 1971, 1:EVER, 0:NEVER |
lackpep | USELACK OF PEP MEDICATION IN 1971, 1:EVER, 0:NEVER |
marital | MARITAL STATUS IN 1971 1: Under 17, 2: Married, 3: Widowed, 4: Never married, 5: Divorced, 6: Separated, 8: Unknown |
modth | MONTH OF DEATH |
nerves | USE NERVES MEDICATION IN 1971, 1:EVER, 0:NEVER |
nervousbreak | DX NERVOUS BREAKDOWN IN 1971, 1:EVER, 0:NEVER |
otherpain | USE OTHER PAINS MEDICATION IN 1971, 1:EVER, 0:NEVER |
pepticulcer | DX PEPTIC ULCER IN 1971, 1:EVER, 0:NEVER |
pica | DO YOU EAT DIRT OR CLAY, STARCH OR OTHER NON STANDARD FOOD? IN 1971 1:EVER, 0:NEVER; 2:MISSING |
polio | DX POLIO IN 1971, 1:EVER, 0:NEVER |
pregnancies | TOTAL NUMBER OF PREGNANCIES? IN 1971 |
price71 | AVG TOBACCO PRICE IN STATE OF RESIDENCE 1971 (US$2008) |
price71_82 | DIFFERENCE IN AVG TOBACCO PRICE IN STATE OF RESIDENCE 1971-1982 (US$2008) |
price82 | AVG TOBACCO PRICE IN STATE OF RESIDENCE 1982 (US$2008) |
qsmk | QUIT SMOKING BETWEEN 1ST QUESTIONNAIRE AND 1982, 1:YES, 0:NO |
race | 0: WHITE 1: BLACK OR OTHER IN 1971 |
sbp | SYSTOLIC BLOOD PRESSURE IN 1982 |
school | HIGHEST GRADE OF REGULAR SCHOOL EVER IN 1971 |
seqn | UNIQUE PERSONAL IDENTIFIER |
sex | 0: MALE 1: FEMALE |
smokeintensity | NUMBER OF CIGARETTES SMOKED PER DAY IN 1971 |
smkintensity 82_71 | INCREASE IN NUMBER OF CIGARETTES/DAY BETWEEN 1971 and 1982 |
smokeyrs | YEARS OF SMOKING |
tax71 | TOBACCO TAX IN STATE OF RESIDENCE 1971 (US$2008) |
tax71_82 | DIFFERENCE IN TOBACCO TAX IN STATE OF RESIDENCE 1971-1982 (US$2008) |
tax82 | TOBACCO TAX IN STATE OF RESIDENCE 1971 (US$2008) |
tb | DX TUBERCULOSIS IN 1971, 1:EVER, 0:NEVER |
tumor | DX MALIGNANT TUMOR/GROWTH IN 1971, 1:EVER, 0:NEVER |
weakheart | USE WEAK HEART MEDICATION IN 1971, 1:EVER, 0:NEVER |
wt71 | WEIGHT IN KILOGRAMS IN 1971 |
wt82 | WEIGHT IN KILOGRAMS IN 1982 |
wt82_71 | WEIGHT CHANGE IN KILOGRAMS |
wtloss | USE WEIGHT LOSS MEDICATION IN 1971, 1:EVER, 0:NEVER |
yrdth | YEAR OF DEATH |
First, let’s read in our data and glance at it using a couple of different views.
SAS Documentation: PROC IMPORT
proc import datafile="data/nhefs.csv"
out=nhefs
dbms=csv
replace;
run;
* information about the variables in NHEFS;
proc contents data=nhefs; run;
Data Set Name | WORK.NHEFS | Observations | 1629 |
---|---|---|---|
Member Type | DATA | Variables | 64 |
Engine | V9 | Indexes | 0 |
Created | 12/15/2022 20:38:42 | Observation Length | 512 |
Last Modified | 12/15/2022 20:38:42 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | latin1 Western (ISO) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 13 |
First Data Page | 1 |
Max Obs per Page | 127 |
Obs in First Data Page | 107 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_work6AC9000060C8_jrgant-AW/nhefs.sas7bdat |
Release Created | 9.0401M7 |
Host Created | Linux |
Inode Number | 7998379 |
Access Permission | rw-rw-r– |
Owner Name | jrgant |
File Size | 896KB |
File Size (bytes) | 917504 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
53 | active | Num | 8 | BEST12. | BEST32. |
10 | age | Num | 8 | BEST12. | BEST32. |
39 | alcoholfreq | Num | 8 | BEST12. | BEST32. |
41 | alcoholhowmuch | Num | 8 | BEST12. | BEST32. |
38 | alcoholpy | Num | 8 | BEST12. | BEST32. |
40 | alcoholtype | Num | 8 | BEST12. | BEST32. |
46 | allergies | Num | 8 | BEST12. | BEST32. |
24 | asthma | Num | 8 | BEST12. | BEST32. |
55 | birthcontrol | Num | 8 | BEST12. | BEST32. |
20 | birthplace | Num | 8 | BEST12. | BEST32. |
50 | boweltrouble | Num | 8 | BEST12. | BEST32. |
25 | bronch | Num | 8 | BEST12. | BEST32. |
57 | cholesterol | Num | 8 | BEST12. | BEST32. |
32 | chroniccough | Num | 8 | BEST12. | BEST32. |
30 | colitis | Num | 8 | BEST12. | BEST32. |
6 | dadth | Num | 8 | BEST12. | BEST32. |
8 | dbp | Num | 8 | BEST12. | BEST32. |
3 | death | Num | 8 | BEST12. | BEST32. |
34 | diabetes | Num | 8 | BEST12. | BEST32. |
15 | education | Num | 8 | BEST12. | BEST32. |
54 | exercise | Num | 8 | BEST12. | BEST32. |
33 | hayfever | Num | 8 | BEST12. | BEST32. |
28 | hbp | Num | 8 | BEST12. | BEST32. |
49 | hbpmed | Num | 8 | BEST12. | BEST32. |
43 | headache | Num | 8 | BEST12. | BEST32. |
31 | hepatitis | Num | 8 | BEST12. | BEST32. |
27 | hf | Num | 8 | BEST12. | BEST32. |
58 | hightax82 | Num | 8 | BEST12. | BEST32. |
16 | ht | Num | 8 | BEST12. | BEST32. |
12 | income | Num | 8 | BEST12. | BEST32. |
52 | infection | Num | 8 | BEST12. | BEST32. |
48 | lackpep | Num | 8 | BEST12. | BEST32. |
13 | marital | Num | 8 | BEST12. | BEST32. |
5 | modth | Num | 8 | BEST12. | BEST32. |
47 | nerves | Num | 8 | BEST12. | BEST32. |
37 | nervousbreak | Num | 8 | BEST12. | BEST32. |
44 | otherpain | Num | 8 | BEST12. | BEST32. |
29 | pepticulcer | Num | 8 | BEST12. | BEST32. |
42 | pica | Num | 8 | BEST12. | BEST32. |
35 | polio | Num | 8 | BEST12. | BEST32. |
56 | pregnancies | Num | 8 | BEST12. | BEST32. |
59 | price71 | Num | 8 | BEST12. | BEST32. |
60 | price82 | Num | 8 | BEST12. | BEST32. |
63 | price71_82 | Num | 8 | BEST12. | BEST32. |
2 | qsmk | Num | 8 | BEST12. | BEST32. |
11 | race | Num | 8 | BEST12. | BEST32. |
7 | sbp | Num | 8 | BEST12. | BEST32. |
14 | school | Num | 8 | BEST12. | BEST32. |
1 | seqn | Num | 8 | BEST12. | BEST32. |
9 | sex | Num | 8 | BEST12. | BEST32. |
22 | smkintensity82_71 | Num | 8 | BEST12. | BEST32. |
21 | smokeintensity | Num | 8 | BEST12. | BEST32. |
23 | smokeyrs | Num | 8 | BEST12. | BEST32. |
61 | tax71 | Num | 8 | BEST12. | BEST32. |
62 | tax82 | Num | 8 | BEST12. | BEST32. |
64 | tax71_82 | Num | 8 | BEST12. | BEST32. |
26 | tb | Num | 8 | BEST12. | BEST32. |
36 | tumor | Num | 8 | BEST12. | BEST32. |
45 | weakheart | Num | 8 | BEST12. | BEST32. |
17 | wt71 | Num | 8 | BEST12. | BEST32. |
18 | wt82 | Num | 8 | BEST12. | BEST32. |
19 | wt82_71 | Num | 8 | BEST12. | BEST32. |
51 | wtloss | Num | 8 | BEST12. | BEST32. |
4 | yrdth | Num | 8 | BEST12. | BEST32. |
* print first five rows of dataset;
proc print data=nhefs(obs=5); run;
Obs | seqn | qsmk | death | yrdth | modth | dadth | sbp | dbp | sex | age | race | income | marital | school | education | ht | wt71 | wt82 | wt82_71 | birthplace | smokeintensity | smkintensity82_71 | smokeyrs | asthma | bronch | tb | hf | hbp | pepticulcer | colitis | hepatitis | chroniccough | hayfever | diabetes | polio | tumor | nervousbreak | alcoholpy | alcoholfreq | alcoholtype | alcoholhowmuch | pica | headache | otherpain | weakheart | allergies | nerves | lackpep | hbpmed | boweltrouble | wtloss | infection | active | exercise | birthcontrol | pregnancies | cholesterol | hightax82 | price71 | price82 | tax71 | tax82 | price71_82 | tax71_82 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 233 | 0 | 0 | . | . | . | 175 | 96 | 0 | 42 | 1 | 19 | 2 | 7 | 1 | 174.1875 | 79.04 | 68.94604024 | -10.09395976 | 47 | 30 | -10 | 29 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 3 | 7 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 2 | . | 197 | 0 | 2.18359375 | 1.7399902344 | 1.1022949219 | 0.4619750977 | 0.4437866211 | 0.6403808594 |
2 | 235 | 0 | 0 | . | . | . | 123 | 80 | 0 | 36 | 0 | 18 | 2 | 9 | 2 | 159.375 | 58.63 | 61.23496995 | 2.60496995 | 42 | 20 | -10 | 24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | . | 301 | 0 | 2.3466796875 | 1.7973632813 | 1.3649902344 | 0.5718994141 | 0.5493164063 | 0.79296875 |
3 | 244 | 0 | 0 | . | . | . | 115 | 75 | 1 | 56 | 1 | 15 | 3 | 11 | 2 | 168.5 | 56.81 | 66.22448602 | 9.41448602 | 51 | 20 | -14 | 26 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 3 | 4 | . | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 157 | 0 | 1.5695800781 | 1.5134277344 | 0.5512695313 | 0.2309875488 | 0.0561981201 | 0.3202514648 |
4 | 245 | 0 | 1 | 85 | 2 | 14 | 148 | 78 | 0 | 68 | 1 | 15 | 3 | 5 | 1 | 170.1875 | 59.42 | 64.41011654 | 4.99011654 | 37 | 3 | 4 | 53 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 2 | . | 174 | 0 | 1.5065917969 | 1.4519042969 | 0.5249023438 | 0.2199707031 | 0.0547943115 | 0.3049926758 |
5 | 252 | 0 | 0 | . | . | . | 118 | 77 | 0 | 40 | 0 | 18 | 2 | 11 | 2 | 181.875 | 87.09 | 92.07925111 | 4.98925111 | 42 | 20 | 0 | 19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 2 | . | 216 | 0 | 2.3466796875 | 1.7973632813 | 1.3649902344 | 0.5718994141 | 0.5493164063 | 0.79296875 |
For now, let’s focus on observations (rows) that don’t have any missing data. To do this, let’s get a summary of how much missingness we have across all observations.
proc means data=nhefs nmiss STACKODSOUTPUT;
ods output summary=nhefs_miss;
run;
Variable | N Miss |
---|---|
seqn | 0 |
qsmk | 0 |
death | 0 |
yrdth | 1311 |
modth | 1307 |
dadth | 1307 |
sbp | 77 |
dbp | 81 |
sex | 0 |
age | 0 |
race | 0 |
income | 62 |
marital | 0 |
school | 0 |
education | 0 |
ht | 0 |
wt71 | 0 |
wt82 | 63 |
wt82_71 | 63 |
birthplace | 92 |
smokeintensity | 0 |
smkintensity82_71 | 0 |
smokeyrs | 0 |
asthma | 0 |
bronch | 0 |
tb | 0 |
hf | 0 |
hbp | 0 |
pepticulcer | 0 |
colitis | 0 |
hepatitis | 0 |
chroniccough | 0 |
hayfever | 0 |
diabetes | 0 |
polio | 0 |
tumor | 0 |
nervousbreak | 0 |
alcoholpy | 0 |
alcoholfreq | 0 |
alcoholtype | 0 |
alcoholhowmuch | 417 |
pica | 0 |
headache | 0 |
otherpain | 0 |
weakheart | 0 |
allergies | 0 |
nerves | 0 |
lackpep | 0 |
hbpmed | 0 |
boweltrouble | 0 |
wtloss | 0 |
infection | 0 |
active | 0 |
exercise | 0 |
birthcontrol | 0 |
pregnancies | 903 |
cholesterol | 16 |
hightax82 | 92 |
price71 | 92 |
price82 | 92 |
tax71 | 92 |
tax82 | 92 |
price71_82 | 92 |
tax71_82 | 92 |
proc print data=nhefs_miss; run;
Obs | Variable | NMiss |
---|---|---|
1 | seqn | 0 |
2 | qsmk | 0 |
3 | death | 0 |
4 | yrdth | 1311 |
5 | modth | 1307 |
6 | dadth | 1307 |
7 | sbp | 77 |
8 | dbp | 81 |
9 | sex | 0 |
10 | age | 0 |
11 | race | 0 |
12 | income | 62 |
13 | marital | 0 |
14 | school | 0 |
15 | education | 0 |
16 | ht | 0 |
17 | wt71 | 0 |
18 | wt82 | 63 |
19 | wt82_71 | 63 |
20 | birthplace | 92 |
21 | smokeintensity | 0 |
22 | smkintensity82_71 | 0 |
23 | smokeyrs | 0 |
24 | asthma | 0 |
25 | bronch | 0 |
26 | tb | 0 |
27 | hf | 0 |
28 | hbp | 0 |
29 | pepticulcer | 0 |
30 | colitis | 0 |
31 | hepatitis | 0 |
32 | chroniccough | 0 |
33 | hayfever | 0 |
34 | diabetes | 0 |
35 | polio | 0 |
36 | tumor | 0 |
37 | nervousbreak | 0 |
38 | alcoholpy | 0 |
39 | alcoholfreq | 0 |
40 | alcoholtype | 0 |
41 | alcoholhowmuch | 417 |
42 | pica | 0 |
43 | headache | 0 |
44 | otherpain | 0 |
45 | weakheart | 0 |
46 | allergies | 0 |
47 | nerves | 0 |
48 | lackpep | 0 |
49 | hbpmed | 0 |
50 | boweltrouble | 0 |
51 | wtloss | 0 |
52 | infection | 0 |
53 | active | 0 |
54 | exercise | 0 |
55 | birthcontrol | 0 |
56 | pregnancies | 903 |
57 | cholesterol | 16 |
58 | hightax82 | 92 |
59 | price71 | 92 |
60 | price82 | 92 |
61 | tax71 | 92 |
62 | tax82 | 92 |
63 | price71_82 | 92 |
64 | tax71_82 | 92 |
Let’s drop every row with a missing value for any variable:
Complete case code adapted from an example on SAS blogs.
/* h/t https://blogs.sas.com/content/iml/2015/02/23/complete-cases.html */
data nhefs2;
set nhefs;
if cmiss(of _ALL_)=0;
run;
proc contents data=nhefs2; run;
Data Set Name | WORK.NHEFS2 | Observations | 44 |
---|---|---|---|
Member Type | DATA | Variables | 64 |
Engine | V9 | Indexes | 0 |
Created | 12/15/2022 20:38:45 | Observation Length | 512 |
Last Modified | 12/15/2022 20:38:45 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | latin1 Western (ISO) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 127 |
Obs in First Data Page | 44 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_work1D400000634E_jrgant-AW/nhefs2.sas7bdat |
Release Created | 9.0401M7 |
Host Created | Linux |
Inode Number | 7998383 |
Access Permission | rw-rw-r– |
Owner Name | jrgant |
File Size | 128KB |
File Size (bytes) | 131072 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
53 | active | Num | 8 | BEST12. | BEST32. |
10 | age | Num | 8 | BEST12. | BEST32. |
39 | alcoholfreq | Num | 8 | BEST12. | BEST32. |
41 | alcoholhowmuch | Num | 8 | BEST12. | BEST32. |
38 | alcoholpy | Num | 8 | BEST12. | BEST32. |
40 | alcoholtype | Num | 8 | BEST12. | BEST32. |
46 | allergies | Num | 8 | BEST12. | BEST32. |
24 | asthma | Num | 8 | BEST12. | BEST32. |
55 | birthcontrol | Num | 8 | BEST12. | BEST32. |
20 | birthplace | Num | 8 | BEST12. | BEST32. |
50 | boweltrouble | Num | 8 | BEST12. | BEST32. |
25 | bronch | Num | 8 | BEST12. | BEST32. |
57 | cholesterol | Num | 8 | BEST12. | BEST32. |
32 | chroniccough | Num | 8 | BEST12. | BEST32. |
30 | colitis | Num | 8 | BEST12. | BEST32. |
6 | dadth | Num | 8 | BEST12. | BEST32. |
8 | dbp | Num | 8 | BEST12. | BEST32. |
3 | death | Num | 8 | BEST12. | BEST32. |
34 | diabetes | Num | 8 | BEST12. | BEST32. |
15 | education | Num | 8 | BEST12. | BEST32. |
54 | exercise | Num | 8 | BEST12. | BEST32. |
33 | hayfever | Num | 8 | BEST12. | BEST32. |
28 | hbp | Num | 8 | BEST12. | BEST32. |
49 | hbpmed | Num | 8 | BEST12. | BEST32. |
43 | headache | Num | 8 | BEST12. | BEST32. |
31 | hepatitis | Num | 8 | BEST12. | BEST32. |
27 | hf | Num | 8 | BEST12. | BEST32. |
58 | hightax82 | Num | 8 | BEST12. | BEST32. |
16 | ht | Num | 8 | BEST12. | BEST32. |
12 | income | Num | 8 | BEST12. | BEST32. |
52 | infection | Num | 8 | BEST12. | BEST32. |
48 | lackpep | Num | 8 | BEST12. | BEST32. |
13 | marital | Num | 8 | BEST12. | BEST32. |
5 | modth | Num | 8 | BEST12. | BEST32. |
47 | nerves | Num | 8 | BEST12. | BEST32. |
37 | nervousbreak | Num | 8 | BEST12. | BEST32. |
44 | otherpain | Num | 8 | BEST12. | BEST32. |
29 | pepticulcer | Num | 8 | BEST12. | BEST32. |
42 | pica | Num | 8 | BEST12. | BEST32. |
35 | polio | Num | 8 | BEST12. | BEST32. |
56 | pregnancies | Num | 8 | BEST12. | BEST32. |
59 | price71 | Num | 8 | BEST12. | BEST32. |
60 | price82 | Num | 8 | BEST12. | BEST32. |
63 | price71_82 | Num | 8 | BEST12. | BEST32. |
2 | qsmk | Num | 8 | BEST12. | BEST32. |
11 | race | Num | 8 | BEST12. | BEST32. |
7 | sbp | Num | 8 | BEST12. | BEST32. |
14 | school | Num | 8 | BEST12. | BEST32. |
1 | seqn | Num | 8 | BEST12. | BEST32. |
9 | sex | Num | 8 | BEST12. | BEST32. |
22 | smkintensity82_71 | Num | 8 | BEST12. | BEST32. |
21 | smokeintensity | Num | 8 | BEST12. | BEST32. |
23 | smokeyrs | Num | 8 | BEST12. | BEST32. |
61 | tax71 | Num | 8 | BEST12. | BEST32. |
62 | tax82 | Num | 8 | BEST12. | BEST32. |
64 | tax71_82 | Num | 8 | BEST12. | BEST32. |
26 | tb | Num | 8 | BEST12. | BEST32. |
36 | tumor | Num | 8 | BEST12. | BEST32. |
45 | weakheart | Num | 8 | BEST12. | BEST32. |
17 | wt71 | Num | 8 | BEST12. | BEST32. |
18 | wt82 | Num | 8 | BEST12. | BEST32. |
19 | wt82_71 | Num | 8 | BEST12. | BEST32. |
51 | wtloss | Num | 8 | BEST12. | BEST32. |
4 | yrdth | Num | 8 | BEST12. | BEST32. |
Oops, we have only 44 rows left in the data set! Perhaps we should reconsider and pick several variables that will be of interest to us for our current purposes.
%let selectvars = age alcoholfreq asthma wt71 smokeintensity sbp allergies weakheart;
data nhefs3;
set nhefs(keep=&selectvars);
if cmiss(of _ALL_)=0;
run;
proc contents data=nhefs3; run;
Data Set Name | WORK.NHEFS3 | Observations | 1552 |
---|---|---|---|
Member Type | DATA | Variables | 8 |
Engine | V9 | Indexes | 0 |
Created | 12/15/2022 20:38:45 | Observation Length | 64 |
Last Modified | 12/15/2022 20:38:45 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | latin1 Western (ISO) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 2 |
First Data Page | 1 |
Max Obs per Page | 1021 |
Obs in First Data Page | 977 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_work4619000063F6_jrgant-AW/nhefs3.sas7bdat |
Release Created | 9.0401M7 |
Host Created | Linux |
Inode Number | 7998383 |
Access Permission | rw-rw-r– |
Owner Name | jrgant |
File Size | 192KB |
File Size (bytes) | 196608 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
2 | age | Num | 8 | BEST12. | BEST32. |
6 | alcoholfreq | Num | 8 | BEST12. | BEST32. |
8 | allergies | Num | 8 | BEST12. | BEST32. |
5 | asthma | Num | 8 | BEST12. | BEST32. |
1 | sbp | Num | 8 | BEST12. | BEST32. |
4 | smokeintensity | Num | 8 | BEST12. | BEST32. |
7 | weakheart | Num | 8 | BEST12. | BEST32. |
3 | wt71 | Num | 8 | BEST12. | BEST32. |
That’s better. We’ve discarded 77 rows and focused on a subset of variables of interest.
Before we do some exploratory data analysis, let’s recall that category 5 for one of the variables we selected (alcoholfreq
) indicated that alcohol frequency for that individual was unknown. This is missing data, and we should recode the alcoholfreq
variable so that SAS recognizes these observations as such.
In addition, SAS currently believes that alcoholfreq
is a numeric variable, though it is a categorical variable.
We can add a new variable to our nhefs3
data set that recodes alcoholfreq
to meet our needs. We’ll name our new variable alcfreqcat
CHARACTER VARIABLES How did we get alcfreqcat
to be a character variable when we based it on the numeric variable alcoholfreq
? Notice in the data step that we initialized alcfreqcat
using '.'
instead of .
. Wrapping the dot in quotes indicated to SAS that we wanted it to treat alfreqcat
as a character variable.
data nhefs3;
set nhefs3;
if alcoholfreq = 5 then alcfreqcat = '.';
else alcfreqcat = alcoholfreq;
run;
proc freq data=nhefs3;
table alcfreqcat*alcoholfreq;
run;
|
|
proc contents data=nhefs3; run;
Data Set Name | WORK.NHEFS3 | Observations | 1552 |
---|---|---|---|
Member Type | DATA | Variables | 9 |
Engine | V9 | Indexes | 0 |
Created | 12/15/2022 20:38:46 | Observation Length | 72 |
Last Modified | 12/15/2022 20:38:46 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | latin1 Western (ISO) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 2 |
First Data Page | 1 |
Max Obs per Page | 908 |
Obs in First Data Page | 866 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_work71760000654C_jrgant-AW/nhefs3.sas7bdat |
Release Created | 9.0401M7 |
Host Created | Linux |
Inode Number | 7998381 |
Access Permission | rw-rw-r– |
Owner Name | jrgant |
File Size | 192KB |
File Size (bytes) | 196608 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
2 | age | Num | 8 | BEST12. | BEST32. |
9 | alcfreqcat | Char | 1 | ||
6 | alcoholfreq | Num | 8 | BEST12. | BEST32. |
8 | allergies | Num | 8 | BEST12. | BEST32. |
5 | asthma | Num | 8 | BEST12. | BEST32. |
1 | sbp | Num | 8 | BEST12. | BEST32. |
4 | smokeintensity | Num | 8 | BEST12. | BEST32. |
7 | weakheart | Num | 8 | BEST12. | BEST32. |
3 | wt71 | Num | 8 | BEST12. | BEST32. |
Although SAS will automatically throw out observations with missing data when running regressions and other procedures, it will do so based on which variables are included in model formulas. So that we conduct all the analyses below on the same subset of the NHEFS data, let’s drop any rows from our data set with missing values of alcfreqcat
and drop the original alcoholfreq
variable.
data nhefs4;
set nhefs3(where=(alcfreqcat ne '.') drop=alcoholfreq);
run;
proc print data=nhefs4(obs=5); run;
Obs | sbp | age | wt71 | smokeintensity | asthma | weakheart | allergies | alcfreqcat |
---|---|---|---|---|---|---|---|---|
1 | 175 | 42 | 79.04 | 30 | 0 | 0 | 0 | 1 |
2 | 123 | 36 | 58.63 | 20 | 0 | 0 | 0 | 0 |
3 | 115 | 56 | 56.81 | 20 | 0 | 0 | 0 | 3 |
4 | 148 | 68 | 59.42 | 3 | 0 | 1 | 0 | 2 |
5 | 118 | 40 | 87.09 | 20 | 0 | 0 | 0 | 2 |
Let’s summarize our dataset:
* save list of continuous variables and summarize;
%let contvars = age wt71 smokeintensity sbp;
proc means data=nhefs4;
var &contvars;
run;
Variable | N | Mean | Std Dev | Minimum | Maximum | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
* save list of categorical variables and summarize;
%let catvars = asthma allergies alcfreqcat weakheart;
proc freq data=nhefs4;
tables &catvars;
run;
asthma | Frequency | Percent |
Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
0 | 1474 | 95.28 | 1474 | 95.28 |
1 | 73 | 4.72 | 1547 | 100.00 |
allergies | Frequency | Percent |
Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
0 | 1448 | 93.60 | 1448 | 93.60 |
1 | 99 | 6.40 | 1547 | 100.00 |
alcfreqcat | Frequency | Percent |
Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
0 | 320 | 20.69 | 320 | 20.69 |
1 | 217 | 14.03 | 537 | 34.71 |
2 | 489 | 31.61 | 1026 | 66.32 |
3 | 329 | 21.27 | 1355 | 87.59 |
4 | 192 | 12.41 | 1547 | 100.00 |
weakheart | Frequency | Percent |
Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
0 | 1512 | 97.74 | 1512 | 97.74 |
1 | 35 | 2.26 | 1547 | 100.00 |
Soon you will be no stranger to two-by-two tables and contingency tables. Introductory statistics and epidemiologic methods courses will require you to analyze such tables, either to measure the strength of an exposure’s effect on some outcome or to establish a statistical association between two variables.
Going back to our NHEFS data, let’s say we’re interested in a possible association between quitting smoking and weight loss. The qsmk
variable is a binary indicator of whether or not an individual quit smoking between 1971 and 1982, while the sbp
variable measures an individual’s change in weight (kilograms) between 1971 and 1982.
Typically, we would not want to dichotomize a continuous variable like sbp
, but for the sake of this example, let’s begin by creating a binary indicator of whether a subject gained weight during this timeframe.
data nhefs;
set nhefs;
if sbp > 140 then sbp_hi = 1;
else if sbp <= 140 then sbp_hi = 0;
if sbp = . then sbp_hi = .;
run;
proc freq data=nhefs;
table sbp_hi;
run;
sbp_hi | Frequency | Percent |
Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
0 | 1192 | 76.80 | 1192 | 76.80 |
1 | 360 | 23.20 | 1552 | 100.00 |
Frequency Missing = 77 |
The summary statistics we generated above should comply with our expectations: those we marked as having a high systolic blood pressure have values of sbp
above 140 mm/Hg, while those we marked as not having high systolic blood pressure all have values of sbp
less than or equal to 140 mm/Hg.
Note, too, that we have 63 missing values in our new variable, just as we saw in the original outcome.
proc freq data=nhefs;
table sbp_hi * qsmk /chisq;
run;
|
|
Statistics for Table of sbp_hi by qsmk |
Statistic | DF | Value | Prob |
---|---|---|---|
Chi-Square | 1 | 6.0872 | 0.0136 |
Likelihood Ratio Chi-Square | 1 | 5.9256 | 0.0149 |
Continuity Adj. Chi-Square | 1 | 5.7508 | 0.0165 |
Mantel-Haenszel Chi-Square | 1 | 6.0833 | 0.0136 |
Phi Coefficient | 0.0626 | ||
Contingency Coefficient | 0.0625 | ||
Cramer's V | 0.0626 |
Fisher's Exact Test | |
---|---|
Cell (1,1) Frequency (F) | 908 |
Left-sided Pr <= F | 0.9939 |
Right-sided Pr >= F | 0.0088 |
Table Probability (P) | 0.0028 |
Two-sided Pr <= P | 0.0155 |
Sample Size = 1552 Frequency Missing = 77 |
A natural choice here would be to conduct a two-sample test for the equality of proportions, where we compare the proportion of subjects in the quit smoking group with high systolic blood pressure versus the same proportion in the non-quit smoking group.
The output above shows us the group proportions, the \(\chi^2\) test statistic (elicited by specifying the table
option chisq
), and the corresponding p-value. We also get a 95% confidence interval for the difference in proportions.
In this section we will fit some regression models that you may encounter in classes or in your own research. We will leave the theoretical background to your statistics and methods courses but aim to provide enough information so that you understand the basic idea behind each approach and can refer back to this page in the future.
All upcoming examples use variables from the NHEFS dataset that we’ve referred back to several times by now.
Exposure of interest: quitting smoking between 1971 and 1982 (qsmk
)
Outcome: change in weight between 1971 and 1982 (continuous, sbp
)
First, we can run an unadjusted model with weight change as the outcome and quitting smoking as the predictor variable.
ods graphics off;
proc reg data=nhefs;
model sbp = qsmk /clb; * /clb adds confidence limits to output table;
quit;
Number of Observations Read | 1629 |
---|---|
Number of Observations Used | 1552 |
Number of Observations with Missing Values | 77 |
Analysis of Variance | |||||
---|---|---|---|---|---|
Source | DF |
Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 1 | 4673.90689 | 4673.90689 | 12.98 | 0.0003 |
Error | 1550 | 558280 | 360.18067 | ||
Corrected Total | 1551 | 562954 |
Root MSE | 18.97843 | R-Square | 0.0083 |
---|---|---|---|
Dependent Mean | 128.70941 | Adj R-Sq | 0.0077 |
Coeff Var | 14.74517 |
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Variable | DF |
Parameter Estimate |
Standard Error |
t Value | Pr > |t| | 95% Confidence Limits | |
Intercept | 1 | 127.69888 | 0.55747 | 229.07 | <.0001 | 126.60541 | 128.79235 |
qsmk | 1 | 3.99069 | 1.10782 | 3.60 | 0.0003 | 1.81771 | 6.16367 |
SAS’s PROC REG
does not handle product terms or categorical variables within the model
statement itself. Let’s say we wanted to fit an adjusted model and include a product term to test an “interaction” between sex
and age
. First, we have to add the interaction variable to our nhefs
dataset, as shown below, and include the new variable the model formula.
The code below fits a linear model for the effect of qsmk
on sbp
adjusted for years of smoking (smokeyrs
), intensity of smoking (smokeintensity
), diabetes, sex, age, and the “interaction” between sex and age (sex_age
).
data nhefs;
set nhefs;
sex_age = sex * age;
run;
ods graphics off;
proc reg data=nhefs;
model sbp = qsmk smokeyrs smokeintensity diabetes sex age sex_age /clb;
quit;
Number of Observations Read | 1629 |
---|---|
Number of Observations Used | 1552 |
Number of Observations with Missing Values | 77 |
Analysis of Variance | |||||
---|---|---|---|---|---|
Source | DF |
Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 7 | 98121 | 14017 | 46.56 | <.0001 |
Error | 1544 | 464833 | 301.05758 | ||
Corrected Total | 1551 | 562954 |
Root MSE | 17.35101 | R-Square | 0.1743 |
---|---|---|---|
Dependent Mean | 128.70941 | Adj R-Sq | 0.1706 |
Coeff Var | 13.48076 |
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Variable | DF |
Parameter Estimate |
Standard Error |
t Value | Pr > |t| | 95% Confidence Limits | |
Intercept | 1 | 113.85479 | 2.95509 | 38.53 | <.0001 | 108.05838 | 119.65120 |
qsmk | 1 | 1.51803 | 1.03207 | 1.47 | 0.1415 | -0.50638 | 3.54245 |
smokeyrs | 1 | 0.06761 | 0.08043 | 0.84 | 0.4007 | -0.09016 | 0.22537 |
smokeintensity | 1 | -0.06613 | 0.03933 | -1.68 | 0.0929 | -0.14328 | 0.01102 |
diabetes | 1 | -0.62827 | 0.44678 | -1.41 | 0.1599 | -1.50463 | 0.24809 |
sex | 1 | -16.63281 | 3.36583 | -4.94 | <.0001 | -23.23489 | -10.03073 |
age | 1 | 0.39023 | 0.09439 | 4.13 | <.0001 | 0.20507 | 0.57538 |
sex_age | 1 | 0.28503 | 0.07558 | 3.77 | 0.0002 | 0.13678 | 0.43329 |
SAS offers two additional ways to fit a linear regression, both of which handle categorical variables and interaction terms directly, unlike PROC REG
. These options are PROC GLM
and PROC GENMOD
. We’ll focus on PROC GENMOD
here.
proc genmod data=nhefs;
model sbp = qsmk /link=identity dist=normal;
run;
Model Information | |
---|---|
Data Set | WORK.NHEFS |
Distribution | Normal |
Link Function | Identity |
Dependent Variable | sbp |
Number of Observations Read | 1629 |
---|---|
Number of Observations Used | 1552 |
Missing Values | 77 |
Criteria For Assessing Goodness Of Fit | |||
---|---|---|---|
Criterion | DF | Value | Value/DF |
Deviance | 1550 | 558280.0358 | 360.1807 |
Scaled Deviance | 1550 | 1552.0000 | 1.0013 |
Pearson Chi-Square | 1550 | 558280.0358 | 360.1807 |
Scaled Pearson X2 | 1550 | 1552.0000 | 1.0013 |
Log Likelihood | -6769.1980 | ||
Full Log Likelihood | -6769.1980 | ||
AIC (smaller is better) | 13544.3961 | ||
AICC (smaller is better) | 13544.4116 | ||
BIC (smaller is better) | 13560.4380 |
Algorithm converged. |
Analysis Of Maximum Likelihood Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate |
Standard Error |
Wald 95% Confidence Limits | Wald Chi-Square | Pr > ChiSq | |
Intercept | 1 | 127.6989 | 0.5571 | 126.6070 | 128.7908 | 52540.9 | <.0001 |
qsmk | 1 | 3.9907 | 1.1071 | 1.8208 | 6.1606 | 12.99 | 0.0003 |
Scale | 1 | 18.9662 | 0.3404 | 18.3106 | 19.6453 |
Note: | The scale parameter was estimated by maximum likelihood. |
proc genmod data=nhefs;
model sbp = qsmk smokeyrs smokeintensity diabetes sex:age /link=identity dist=normal;
run;
Model Information | |
---|---|
Data Set | WORK.NHEFS |
Distribution | Normal |
Link Function | Identity |
Dependent Variable | sbp |
Number of Observations Read | 1629 |
---|---|
Number of Observations Used | 1552 |
Missing Values | 77 |
Criteria For Assessing Goodness Of Fit | |||
---|---|---|---|
Criterion | DF | Value | Value/DF |
Deviance | 1544 | 464832.8969 | 301.0576 |
Scaled Deviance | 1544 | 1552.0000 | 1.0052 |
Pearson Chi-Square | 1544 | 464832.8969 | 301.0576 |
Scaled Pearson X2 | 1544 | 1552.0000 | 1.0052 |
Log Likelihood | -6627.0482 | ||
Full Log Likelihood | -6627.0482 | ||
AIC (smaller is better) | 13272.0965 | ||
AICC (smaller is better) | 13272.2132 | ||
BIC (smaller is better) | 13320.2222 |
Algorithm converged. |
Analysis Of Maximum Likelihood Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate |
Standard Error |
Wald 95% Confidence Limits | Wald Chi-Square | Pr > ChiSq | |
Intercept | 1 | 113.8548 | 2.9475 | 108.0779 | 119.6317 | 1492.13 | <.0001 |
qsmk | 1 | 1.5180 | 1.0294 | -0.4996 | 3.5356 | 2.17 | 0.1403 |
smokeyrs | 1 | 0.0676 | 0.0802 | -0.0896 | 0.2248 | 0.71 | 0.3994 |
smokeintensity | 1 | -0.0661 | 0.0392 | -0.1430 | 0.0108 | 2.84 | 0.0919 |
diabetes | 1 | -0.6283 | 0.4456 | -1.5017 | 0.2451 | 1.99 | 0.1586 |
sex | 1 | -16.6328 | 3.3571 | -23.2127 | -10.0529 | 24.55 | <.0001 |
sex_age | 1 | 0.2850 | 0.0754 | 0.1373 | 0.4328 | 14.29 | 0.0002 |
age | 1 | 0.3902 | 0.0942 | 0.2057 | 0.5748 | 17.18 | <.0001 |
Scale | 1 | 17.3062 | 0.3106 | 16.7080 | 17.9259 |
Note: | The scale parameter was estimated by maximum likelihood. |
Let’s revisit our tabular analysis, in which we were interested in whether quitting smoking affected the likelihood of a subject’s gaining weight between 1971 and 1982.
We can answer the same question using a logistic regression model, using the glm()
function.
proc logistic data=nhefs;
model sbp_hi(event='1') = qsmk;
estimate 'qsmk (1 vs. 0)' qsmk 1 /exp;
run;
Model Information | |
---|---|
Data Set | WORK.NHEFS |
Response Variable | sbp_hi |
Number of Response Levels | 2 |
Model | binary logit |
Optimization Technique | Fisher's scoring |
Number of Observations Read | 1629 |
---|---|
Number of Observations Used | 1552 |
Response Profile | ||
---|---|---|
Ordered Value |
sbp_hi |
Total Frequency |
1 | 0 | 1192 |
2 | 1 | 360 |
Probability modeled is sbp_hi=1. |
Note: | 77 observations were deleted due to missing values for the response or explanatory variables. |
Model Convergence Status |
---|
Convergence criterion (GCONV=1E-8) satisfied. |
Model Fit Statistics | ||
---|---|---|
Criterion | Intercept Only |
Intercept and Covariates |
AIC | 1683.227 | 1679.301 |
SC | 1688.574 | 1689.996 |
-2 Log L | 1681.227 | 1675.301 |
Testing Global Null Hypothesis: BETA=0 | |||
---|---|---|---|
Test | Chi-Square | DF | Pr > ChiSq |
Likelihood Ratio | 5.9256 | 1 | 0.0149 |
Score | 6.0872 | 1 | 0.0136 |
Wald | 6.0586 | 1 | 0.0138 |
Analysis of Maximum Likelihood Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate |
Standard Error |
Wald Chi-Square |
Pr > ChiSq |
Intercept | 1 | -1.2858 | 0.0713 | 325.0998 | <.0001 |
qsmk | 1 | 0.3282 | 0.1333 | 6.0586 | 0.0138 |
Odds Ratio Estimates | |||
---|---|---|---|
Effect | Point Estimate |
95% Wald Confidence Limits |
|
qsmk | 1.388 | 1.069 | 1.803 |
Association of Predicted Probabilities and Observed Responses |
|||
---|---|---|---|
Percent Concordant | 23.1 | Somers' D | 0.065 |
Percent Discordant | 16.6 | Gamma | 0.163 |
Percent Tied | 60.3 | Tau-a | 0.023 |
Pairs | 429120 | c | 0.532 |
Estimate | |||||
---|---|---|---|---|---|
Label | Estimate | Standard Error | z Value | Pr > |z| | Exponentiated |
qsmk (1 vs. 0) | 0.3282 | 0.1333 | 2.46 | 0.0138 | 1.3885 |
proc genmod data=nhefs;
model sbp_hi(ref='0') = qsmk /link=logit dist=bin;
estimate 'qsmk (1 vs. 0)' qsmk 1 /exp;
run;
Model Information | |
---|---|
Data Set | WORK.NHEFS |
Distribution | Binomial |
Link Function | Logit |
Dependent Variable | sbp_hi |
Number of Observations Read | 1629 |
---|---|
Number of Observations Used | 1552 |
Number of Events | 360 |
Number of Trials | 1552 |
Missing Values | 77 |
Response Profile | ||
---|---|---|
Ordered Value |
sbp_hi |
Total Frequency |
1 | 1 | 360 |
2 | 0 | 1192 |
PROC GENMOD is modeling the probability that sbp_hi=‘1’. One way to change this to model the probability that sbp_hi=‘0’ is to specify the DESCENDING option in the PROC statement. |
Parameter Information | |
---|---|
Parameter | Effect |
Prm1 | Intercept |
Prm2 | qsmk |
Criteria For Assessing Goodness Of Fit | |||
---|---|---|---|
Criterion | DF | Value | Value/DF |
Log Likelihood | -837.6506 | ||
Full Log Likelihood | -837.6506 | ||
AIC (smaller is better) | 1679.3012 | ||
AICC (smaller is better) | 1679.3089 | ||
BIC (smaller is better) | 1689.9958 |
Algorithm converged. |
Analysis Of Maximum Likelihood Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate |
Standard Error |
Wald 95% Confidence Limits | Wald Chi-Square | Pr > ChiSq | |
Intercept | 1 | -1.2858 | 0.0713 | -1.4256 | -1.1460 | 325.10 | <.0001 |
qsmk | 1 | 0.3282 | 0.1333 | 0.0668 | 0.5895 | 6.06 | 0.0139 |
Scale | 0 | 1.0000 | 0.0000 | 1.0000 | 1.0000 |
Note: | The scale parameter was held fixed. |
Contrast Estimate Results | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Label | Mean Estimate | Mean | L'Beta Estimate |
Standard Error |
Alpha | L'Beta | Chi-Square | Pr > ChiSq | ||
Confidence Limits | Confidence Limits | |||||||||
qsmk (1 vs. 0) | 0.5813 | 0.5167 | 0.6433 | 0.3282 | 0.1333 | 0.05 | 0.0668 | 0.5895 | 6.06 | 0.0139 |
Exp(qsmk (1 vs. 0)) | 1.3884 | 0.1851 | 0.05 | 1.0691 | 1.8031 |
As in the linear regression example, you could run a model adjusting for other factors, in which case you would modify the model formula accordingly. We will forego running an adjusted model here to avoid redundancy.
While the logistic regression remains the most frequently used generalized linear model (GLM) for binary outcomes, epidemiologists are often interested in risk differences and risk ratios.
Luckily, we can use the log-binomial model to estimate contrasts in risks.
Try this Calculate the risk ratio manually and convince yourself that the log-binomial model gave you the correct answer.
proc genmod data=nhefs;
model sbp_hi(ref='0') = qsmk /link=log dist=bin;
estimate 'qsmk (1 vs. 0)' qsmk 1 /exp;
run;
Model Information | |
---|---|
Data Set | WORK.NHEFS |
Distribution | Binomial |
Link Function | Log |
Dependent Variable | sbp_hi |
Number of Observations Read | 1629 |
---|---|
Number of Observations Used | 1552 |
Number of Events | 360 |
Number of Trials | 1552 |
Missing Values | 77 |
Response Profile | ||
---|---|---|
Ordered Value |
sbp_hi |
Total Frequency |
1 | 1 | 360 |
2 | 0 | 1192 |
PROC GENMOD is modeling the probability that sbp_hi=‘1’. One way to change this to model the probability that sbp_hi=‘0’ is to specify the DESCENDING option in the PROC statement. |
Parameter Information | |
---|---|
Parameter | Effect |
Prm1 | Intercept |
Prm2 | qsmk |
Criteria For Assessing Goodness Of Fit | |||
---|---|---|---|
Criterion | DF | Value | Value/DF |
Log Likelihood | -837.6506 | ||
Full Log Likelihood | -837.6506 | ||
AIC (smaller is better) | 1679.3012 | ||
AICC (smaller is better) | 1679.3089 | ||
BIC (smaller is better) | 1689.9958 |
Algorithm converged. |
Analysis Of Maximum Likelihood Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate |
Standard Error |
Wald 95% Confidence Limits | Wald Chi-Square | Pr > ChiSq | |
Intercept | 1 | -1.5299 | 0.0559 | -1.6394 | -1.4204 | 749.85 | <.0001 |
qsmk | 1 | 0.2474 | 0.0987 | 0.0539 | 0.4409 | 6.28 | 0.0122 |
Scale | 0 | 1.0000 | 0.0000 | 1.0000 | 1.0000 |
Note: | The scale parameter was held fixed. |
Contrast Estimate Results | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Label | Mean Estimate | Mean | L'Beta Estimate |
Standard Error |
Alpha | L'Beta | Chi-Square | Pr > ChiSq | ||
Confidence Limits | Confidence Limits | |||||||||
qsmk (1 vs. 0) | 1.2807 | 1.0553 | 1.5542 | 0.2474 | 0.0987 | 0.05 | 0.0539 | 0.4409 | 6.28 | 0.0122 |
Exp(qsmk (1 vs. 0)) | 1.2807 | 0.1265 | 0.05 | 1.0553 | 1.5542 |
To get the risk ratio for the effect of quitting smoking on weight gain, we again exponentiate the beta coefficient of interest: those who quit smoking were 1.28 times as likely to have high systolic blood pressure as those who did not quit smoking.
Another way to get a risk ratio using GLMs involves a procedure called modified Poisson regression. “Modified” here refers to a necessary modification of the standard error estimates when using a Poisson regression model to estimate a risk ratio. Again, save the statistical details for later. All we want to know for the moment are the nuts-and-bolts: how to fit the model and how to modify the standard errors.
proc genmod data=nhefs;
class seqn;
model sbp_hi(ref='0') = qsmk /link=log dist=poisson;
repeated subject=seqn;
estimate 'qsmk (1 vs. 0)' qsmk 1 /exp;
run;
Model Information | |
---|---|
Data Set | WORK.NHEFS |
Distribution | Poisson |
Link Function | Log |
Dependent Variable | sbp_hi |
Number of Observations Read | 1629 |
---|---|
Number of Observations Used | 1552 |
Missing Values | 77 |
Class Level Information | ||
---|---|---|
Class | Levels | Values |
seqn | 1629 | 233 235 244 245 252 257 262 266 419 420 428 431 434 443 446 455 457 596 603 604 605 616 618 619 620 804 806 813 816 818 825 828 831 1094 1096 1101 1103 1104 1106 1109 1116 1120 1124 1126 1127 1129 1133 1135 1148 1476 1480 1498 1505 1513 1515 1519 1523 … |
Parameter Information | |
---|---|
Parameter | Effect |
Prm1 | Intercept |
Prm2 | qsmk |
Algorithm converged. |
GEE Model Information | |
---|---|
Correlation Structure | Independent |
Subject Effect | seqn (1629 levels) |
Number of Clusters | 1629 |
Clusters With Missing Values | 77 |
Correlation Matrix Dimension | 1 |
Maximum Cluster Size | 1 |
Minimum Cluster Size | 0 |
Algorithm converged. |
GEE Fit Criteria | |
---|---|
QIC | 2302.3462 |
QICu | 2302.4294 |
Analysis Of GEE Parameter Estimates | ||||||
---|---|---|---|---|---|---|
Empirical Standard Error Estimates | ||||||
Parameter | Estimate |
Standard Error |
95% Confidence Limits | Z | Pr > |Z| | |
Intercept | -1.5299 | 0.0559 | -1.6394 | -1.4204 | -27.38 | <.0001 |
qsmk | 0.2474 | 0.0987 | 0.0539 | 0.4409 | 2.51 | 0.0122 |
Contrast Estimate Results | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Label | Mean Estimate | Mean | L'Beta Estimate |
Standard Error |
Alpha | L'Beta | Chi-Square | Pr > ChiSq | ||
Confidence Limits | Confidence Limits | |||||||||
qsmk (1 vs. 0) | 1.2807 | 1.0553 | 1.5542 | 0.2474 | 0.0987 | 0.05 | 0.0539 | 0.4409 | 6.28 | 0.0122 |
Exp(qsmk (1 vs. 0)) | 1.2807 | 0.1265 | 0.05 | 1.0553 | 1.5542 |
In the prior chapter, which showcased the log-binomial and modified Poisson model fitting procedures in R, we noted that the “modified” in modified Poisson refers to the adjusted of the standard error. In order to fit the modified Poisson model correctly in SAS, you must specify the repeated
option as above, where the argument passed to subject
is the unique observation identifier.