7 Statistical Analysis (SAS)

7.1 Setup

In this section, we’ll provide some examples in SAS of basic statistical analyses you might need to implement at some point.

The analyses you conduct as part of real research will almost certainly be more complicated than the code shown here. However, the hope is that you’ll get enough of a feel for the mechanics of SAS to begin putting the pieces together as demanded by a given project.

7.1.1 Data

If you’d like to try any of these examples on your own, we’re using the NHANES I Epidemiologic Followup Study (NHEFS) data set freely available at the website for Causal Inference: What If by Miguel A. Hernán and James M. Robins.

For both SAS and R, we’ll use the CSV file they provide.

The website itself contains extensive code examples in SAS, Stata, R, and Python—everything you could want!—so those who will be taking advanced epidemiologic or causal inference coursework should refer to those resources for guidance on how to implement relevant analyses. We’ll cover a little bit here, but only briefly.

7.1.2 NHEFS Codebook

Variable name	Description
active	IN YOUR USUAL DAY, HOW ACTIVE ARE YOU? IN 1971, 0:very active, 1:moderately active, 2:inactive
age	AGE IN 1971
alcoholfreq	HOW OFTEN DO YOU DRINK? IN 1971 0: Almost every day, 1: 2-3 times/week, 2: 1-4 times/month, 3: < 12 times/year, 4: No alcohol last year, 5: Unknown
alcoholhowmuch	WHEN YOU DRINK, HOW MUCH DO YOU DRINK? IN 1971
alcoholpy	HAVE YOU HAD 1 DRINK PAST YEAR? IN 1971, 1:EVER, 0:NEVER; 2:MISSING
alcoholtype	WHICH DO YOU MOST FREQUENTLY DRINK? IN 1971 1: BEER, 2: WINE, 3: LIQUOR, 4: OTHER/UNKNOWN
allergies	USE ALLERGIES MEDICATION IN 1971, 1:EVER, 0:NEVER
asthma	DX ASTHMA IN 1971, 1:EVER, 0:NEVER
bithcontrol	BIRTH CONTROL PILLS PAST 6 MONTHS? IN 1971 1:YES, 0:NO, 2:MISSING
birthplace	CHECK STATE CODE - SECOND PAGE
boweltrouble	USE BOWEL TROUBLE MEDICATION IN 1971, 1:EVER, 0:NEVER, ; 2:MISSING
bronch	DX CHRONIC BRONCHITIS/EMPHYSEMA IN 1971, 1:EVER, 0:NEVER
cholesterol	SERUM CHOLESTEROL (MG/100ML) IN 1971
chroniccough	DX CHRONIC COUGH IN 1971, 1:EVER, 0:NEVER
colitis	DX COLITIS IN 1971, 1:EVER, 0:NEVER
dadth	DAY OF DEATH
dbp	DIASTOLIC BLOOD PRESSURE IN 1982
death	DEATH BY 1992, 1:YES, 0:NO
diabetes	DX DIABETES IN 1971, 1:EVER, 0:NEVER, 2:MISSING
education	AMOUNT OF EDUCATION BY 1971: 1: 8TH GRADE OR LESS, 2: HS DROPOUT, 3: HS, 4:COLLEGE DROPOUT, 5: COLLEGE OR MORE
exercise	IN RECREATION, HOW MUCH EXERCISE? IN 1971, 0:much exercise,1:moderate exercise,2:little or no exercise
hayfever	DX HAY FEVER IN 1971, 1:EVER, 0:NEVER
hbp	DX HIGH BLOOD PRESSURE IN 1971, 1:EVER, 0:NEVER, 2:MISSING
hbpmed	USE HIGH BLOOD PRESSURE MEDICATION IN 1971, 1:EVER, 0:NEVER, ; 2:MISSING
headache	USE HEADACHE MEDICATION IN 1971, 1:EVER, 0:NEVER
hepatitis	DX HEPATITIS IN 1971, 1:EVER, 0:NEVER
hf	DX HEART FAILURE IN 1971, 1:EVER, 0:NEVER
hightax82	LIVING IN A HIGHLY TAXED STATE IN 1982, High taxed state of residence=1, 0 otherwise
ht	HEIGHT IN CENTIMETERS IN 1971
income	TOTAL FAMILY INCOME IN 1971 11:<$1000, 12: 1000-1999, 13: 2000-2999, 14: 3000-3999, 15: 4000-4999, 16: 5000-5999, 17: 6000-6999, 18: 7000-9999, 19: 10000-14999, 20: 15000-19999, 21: 20000-24999, 22: 25000+
infection	USE INFECTION MEDICATION IN 1971, 1:EVER, 0:NEVER
lackpep	USELACK OF PEP MEDICATION IN 1971, 1:EVER, 0:NEVER
marital	MARITAL STATUS IN 1971 1: Under 17, 2: Married, 3: Widowed, 4: Never married, 5: Divorced, 6: Separated, 8: Unknown
modth	MONTH OF DEATH
nerves	USE NERVES MEDICATION IN 1971, 1:EVER, 0:NEVER
nervousbreak	DX NERVOUS BREAKDOWN IN 1971, 1:EVER, 0:NEVER
otherpain	USE OTHER PAINS MEDICATION IN 1971, 1:EVER, 0:NEVER
pepticulcer	DX PEPTIC ULCER IN 1971, 1:EVER, 0:NEVER
pica	DO YOU EAT DIRT OR CLAY, STARCH OR OTHER NON STANDARD FOOD? IN 1971 1:EVER, 0:NEVER; 2:MISSING
polio	DX POLIO IN 1971, 1:EVER, 0:NEVER
pregnancies	TOTAL NUMBER OF PREGNANCIES? IN 1971
price71	AVG TOBACCO PRICE IN STATE OF RESIDENCE 1971 (US$2008)
price71_82	DIFFERENCE IN AVG TOBACCO PRICE IN STATE OF RESIDENCE 1971-1982 (US$2008)
price82	AVG TOBACCO PRICE IN STATE OF RESIDENCE 1982 (US$2008)
qsmk	QUIT SMOKING BETWEEN 1ST QUESTIONNAIRE AND 1982, 1:YES, 0:NO
race	0: WHITE 1: BLACK OR OTHER IN 1971
sbp	SYSTOLIC BLOOD PRESSURE IN 1982
school	HIGHEST GRADE OF REGULAR SCHOOL EVER IN 1971
seqn	UNIQUE PERSONAL IDENTIFIER
sex	0: MALE 1: FEMALE
smokeintensity	NUMBER OF CIGARETTES SMOKED PER DAY IN 1971
smkintensity 82_71	INCREASE IN NUMBER OF CIGARETTES/DAY BETWEEN 1971 and 1982
smokeyrs	YEARS OF SMOKING
tax71	TOBACCO TAX IN STATE OF RESIDENCE 1971 (US$2008)
tax71_82	DIFFERENCE IN TOBACCO TAX IN STATE OF RESIDENCE 1971-1982 (US$2008)
tax82	TOBACCO TAX IN STATE OF RESIDENCE 1971 (US$2008)
tb	DX TUBERCULOSIS IN 1971, 1:EVER, 0:NEVER
tumor	DX MALIGNANT TUMOR/GROWTH IN 1971, 1:EVER, 0:NEVER
weakheart	USE WEAK HEART MEDICATION IN 1971, 1:EVER, 0:NEVER
wt71	WEIGHT IN KILOGRAMS IN 1971
wt82	WEIGHT IN KILOGRAMS IN 1982
wt82_71	WEIGHT CHANGE IN KILOGRAMS
wtloss	USE WEIGHT LOSS MEDICATION IN 1971, 1:EVER, 0:NEVER
yrdth	YEAR OF DEATH

7.2 Exploratory Data Analysis

First, let’s read in our data and glance at it using a couple of different views.

SAS Documentation: PROC IMPORT

proc import datafile="data/nhefs.csv"
            out=nhefs
            dbms=csv
            replace;
run;

* information about the variables in NHEFS;
proc contents data=nhefs; run;

Data Set Name	WORK.NHEFS	Observations	1629
Member Type	DATA	Variables	64
Engine	V9	Indexes	0
Created	12/15/2022 20:38:42	Observation Length	512
Last Modified	12/15/2022 20:38:42	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	latin1 Western (ISO)

Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	13
First Data Page	1
Max Obs per Page	127
Obs in First Data Page	107
Number of Data Set Repairs	0
Filename	/tmp/SAS_work6AC9000060C8_jrgant-AW/nhefs.sas7bdat
Release Created	9.0401M7
Host Created	Linux
Inode Number	7998379
Access Permission	rw-rw-r–
Owner Name	jrgant
File Size	896KB
File Size (bytes)	917504

Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
53	active	Num	8	BEST12.	BEST32.
10	age	Num	8	BEST12.	BEST32.
39	alcoholfreq	Num	8	BEST12.	BEST32.
41	alcoholhowmuch	Num	8	BEST12.	BEST32.
38	alcoholpy	Num	8	BEST12.	BEST32.
40	alcoholtype	Num	8	BEST12.	BEST32.
46	allergies	Num	8	BEST12.	BEST32.
24	asthma	Num	8	BEST12.	BEST32.
55	birthcontrol	Num	8	BEST12.	BEST32.
20	birthplace	Num	8	BEST12.	BEST32.
50	boweltrouble	Num	8	BEST12.	BEST32.
25	bronch	Num	8	BEST12.	BEST32.
57	cholesterol	Num	8	BEST12.	BEST32.
32	chroniccough	Num	8	BEST12.	BEST32.
30	colitis	Num	8	BEST12.	BEST32.
6	dadth	Num	8	BEST12.	BEST32.
8	dbp	Num	8	BEST12.	BEST32.
3	death	Num	8	BEST12.	BEST32.
34	diabetes	Num	8	BEST12.	BEST32.
15	education	Num	8	BEST12.	BEST32.
54	exercise	Num	8	BEST12.	BEST32.
33	hayfever	Num	8	BEST12.	BEST32.
28	hbp	Num	8	BEST12.	BEST32.
49	hbpmed	Num	8	BEST12.	BEST32.
43	headache	Num	8	BEST12.	BEST32.
31	hepatitis	Num	8	BEST12.	BEST32.
27	hf	Num	8	BEST12.	BEST32.
58	hightax82	Num	8	BEST12.	BEST32.
16	ht	Num	8	BEST12.	BEST32.
12	income	Num	8	BEST12.	BEST32.
52	infection	Num	8	BEST12.	BEST32.
48	lackpep	Num	8	BEST12.	BEST32.
13	marital	Num	8	BEST12.	BEST32.
5	modth	Num	8	BEST12.	BEST32.
47	nerves	Num	8	BEST12.	BEST32.
37	nervousbreak	Num	8	BEST12.	BEST32.
44	otherpain	Num	8	BEST12.	BEST32.
29	pepticulcer	Num	8	BEST12.	BEST32.
42	pica	Num	8	BEST12.	BEST32.
35	polio	Num	8	BEST12.	BEST32.
56	pregnancies	Num	8	BEST12.	BEST32.
59	price71	Num	8	BEST12.	BEST32.
60	price82	Num	8	BEST12.	BEST32.
63	price71_82	Num	8	BEST12.	BEST32.
2	qsmk	Num	8	BEST12.	BEST32.
11	race	Num	8	BEST12.	BEST32.
7	sbp	Num	8	BEST12.	BEST32.
14	school	Num	8	BEST12.	BEST32.
1	seqn	Num	8	BEST12.	BEST32.
9	sex	Num	8	BEST12.	BEST32.
22	smkintensity82_71	Num	8	BEST12.	BEST32.
21	smokeintensity	Num	8	BEST12.	BEST32.
23	smokeyrs	Num	8	BEST12.	BEST32.
61	tax71	Num	8	BEST12.	BEST32.
62	tax82	Num	8	BEST12.	BEST32.
64	tax71_82	Num	8	BEST12.	BEST32.
26	tb	Num	8	BEST12.	BEST32.
36	tumor	Num	8	BEST12.	BEST32.
45	weakheart	Num	8	BEST12.	BEST32.
17	wt71	Num	8	BEST12.	BEST32.
18	wt82	Num	8	BEST12.	BEST32.
19	wt82_71	Num	8	BEST12.	BEST32.
51	wtloss	Num	8	BEST12.	BEST32.
4	yrdth	Num	8	BEST12.	BEST32.

* print first five rows of dataset;
proc print data=nhefs(obs=5); run;

Obs	seqn	death	yrdth	modth	dadth	sbp	dbp	sex	age	race	income	marital	school	education	ht	wt71	wt82	wt82_71	birthplace	smokeintensity	smkintensity82_71	smokeyrs	hbp	pepticulcer	hayfever	diabetes	tumor	alcoholpy	alcoholfreq	alcoholtype	alcoholhowmuch	headache	otherpain	weakheart	nerves	hbpmed	boweltrouble	infection	active	exercise	birthcontrol	pregnancies	cholesterol	price71	price82	tax71	tax82	price71_82	tax71_82
1	233	0	.	.	.	175	96	0	42	1	19	2	7	1	174.1875	79.04	68.94604024	-10.09395976	47	30	-10	29	1	1	0	1	0	1	1	3	7	1	0	0	0	1	0	0	0	2	2	.	197	2.18359375	1.7399902344	1.1022949219	0.4619750977	0.4437866211	0.6403808594
2	235	0	.	.	.	123	80	0	36	0	18	2	9	2	159.375	58.63	61.23496995	2.60496995	42	20	-10	24	0	0	0	0	0	1	0	1	4	1	0	0	0	0	0	1	0	0	2	.	301	2.3466796875	1.7973632813	1.3649902344	0.5718994141	0.5493164063	0.79296875
3	244	0	.	.	.	115	75	1	56	1	15	3	11	2	168.5	56.81	66.22448602	9.41448602	51	20	-14	26	0	0	1	0	1	1	3	4	.	1	1	0	1	0	0	0	0	2	0	2	157	1.5695800781	1.5134277344	0.5512695313	0.2309875488	0.0561981201	0.3202514648
4	245	1	85	2	14	148	78	0	68	1	15	3	5	1	170.1875	59.42	64.41011654	4.99011654	37	3	4	53	1	0	0	0	0	1	2	3	4	0	1	1	0	0	0	0	1	2	2	.	174	1.5065917969	1.4519042969	0.5249023438	0.2199707031	0.0547943115	0.3049926758
5	252	0	.	.	.	118	77	0	40	0	18	2	11	2	181.875	87.09	92.07925111	4.98925111	42	20	0	19	0	0	0	0	0	1	2	1	2	1	0	0	0	0	1	0	1	1	2	.	216	2.3466796875	1.7973632813	1.3649902344	0.5718994141	0.5493164063	0.79296875

For now, let’s focus on observations (rows) that don’t have any missing data. To do this, let’s get a summary of how much missingness we have across all observations.

proc means data=nhefs nmiss STACKODSOUTPUT;
    ods output summary=nhefs_miss;
run;

Variable	N Miss
seqn	0
qsmk	0
death	0
yrdth	1311
modth	1307
dadth	1307
sbp	77
dbp	81
sex	0
age	0
race	0
income	62
marital	0
school	0
education	0
ht	0
wt71	0
wt82	63
wt82_71	63
birthplace	92
smokeintensity	0
smkintensity82_71	0
smokeyrs	0
asthma	0
bronch	0
tb	0
hf	0
hbp	0
pepticulcer	0
colitis	0
hepatitis	0
chroniccough	0
hayfever	0
diabetes	0
polio	0
tumor	0
nervousbreak	0
alcoholpy	0
alcoholfreq	0
alcoholtype	0
alcoholhowmuch	417
pica	0
headache	0
otherpain	0
weakheart	0
allergies	0
nerves	0
lackpep	0
hbpmed	0
boweltrouble	0
wtloss	0
infection	0
active	0
exercise	0
birthcontrol	0
pregnancies	903
cholesterol	16
hightax82	92
price71	92
price82	92
tax71	92
tax82	92
price71_82	92
tax71_82	92

proc print data=nhefs_miss; run;

Obs	Variable	NMiss
1	seqn	0
2	qsmk	0
3	death	0
4	yrdth	1311
5	modth	1307
6	dadth	1307
7	sbp	77
8	dbp	81
9	sex	0
10	age	0
11	race	0
12	income	62
13	marital	0
14	school	0
15	education	0
16	ht	0
17	wt71	0
18	wt82	63
19	wt82_71	63
20	birthplace	92
21	smokeintensity	0
22	smkintensity82_71	0
23	smokeyrs	0
24	asthma	0
25	bronch	0
26	tb	0
27	hf	0
28	hbp	0
29	pepticulcer	0
30	colitis	0
31	hepatitis	0
32	chroniccough	0
33	hayfever	0
34	diabetes	0
35	polio	0
36	tumor	0
37	nervousbreak	0
38	alcoholpy	0
39	alcoholfreq	0
40	alcoholtype	0
41	alcoholhowmuch	417
42	pica	0
43	headache	0
44	otherpain	0
45	weakheart	0
46	allergies	0
47	nerves	0
48	lackpep	0
49	hbpmed	0
50	boweltrouble	0
51	wtloss	0
52	infection	0
53	active	0
54	exercise	0
55	birthcontrol	0
56	pregnancies	903
57	cholesterol	16
58	hightax82	92
59	price71	92
60	price82	92
61	tax71	92
62	tax82	92
63	price71_82	92
64	tax71_82	92

Let’s drop every row with a missing value for any variable:

Complete case code adapted from an example on SAS blogs.

/* h/t https://blogs.sas.com/content/iml/2015/02/23/complete-cases.html */

data nhefs2;
    set nhefs;
    if cmiss(of _ALL_)=0;
run;

proc contents data=nhefs2; run;

Data Set Name	WORK.NHEFS2	Observations	44
Member Type	DATA	Variables	64
Engine	V9	Indexes	0
Created	12/15/2022 20:38:45	Observation Length	512
Last Modified	12/15/2022 20:38:45	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	latin1 Western (ISO)

Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	127
Obs in First Data Page	44
Number of Data Set Repairs	0
Filename	/tmp/SAS_work1D400000634E_jrgant-AW/nhefs2.sas7bdat
Release Created	9.0401M7
Host Created	Linux
Inode Number	7998383
Access Permission	rw-rw-r–
Owner Name	jrgant
File Size	128KB
File Size (bytes)	131072

Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
53	active	Num	8	BEST12.	BEST32.
10	age	Num	8	BEST12.	BEST32.
39	alcoholfreq	Num	8	BEST12.	BEST32.
41	alcoholhowmuch	Num	8	BEST12.	BEST32.
38	alcoholpy	Num	8	BEST12.	BEST32.
40	alcoholtype	Num	8	BEST12.	BEST32.
46	allergies	Num	8	BEST12.	BEST32.
24	asthma	Num	8	BEST12.	BEST32.
55	birthcontrol	Num	8	BEST12.	BEST32.
20	birthplace	Num	8	BEST12.	BEST32.
50	boweltrouble	Num	8	BEST12.	BEST32.
25	bronch	Num	8	BEST12.	BEST32.
57	cholesterol	Num	8	BEST12.	BEST32.
32	chroniccough	Num	8	BEST12.	BEST32.
30	colitis	Num	8	BEST12.	BEST32.
6	dadth	Num	8	BEST12.	BEST32.
8	dbp	Num	8	BEST12.	BEST32.
3	death	Num	8	BEST12.	BEST32.
34	diabetes	Num	8	BEST12.	BEST32.
15	education	Num	8	BEST12.	BEST32.
54	exercise	Num	8	BEST12.	BEST32.
33	hayfever	Num	8	BEST12.	BEST32.
28	hbp	Num	8	BEST12.	BEST32.
49	hbpmed	Num	8	BEST12.	BEST32.
43	headache	Num	8	BEST12.	BEST32.
31	hepatitis	Num	8	BEST12.	BEST32.
27	hf	Num	8	BEST12.	BEST32.
58	hightax82	Num	8	BEST12.	BEST32.
16	ht	Num	8	BEST12.	BEST32.
12	income	Num	8	BEST12.	BEST32.
52	infection	Num	8	BEST12.	BEST32.
48	lackpep	Num	8	BEST12.	BEST32.
13	marital	Num	8	BEST12.	BEST32.
5	modth	Num	8	BEST12.	BEST32.
47	nerves	Num	8	BEST12.	BEST32.
37	nervousbreak	Num	8	BEST12.	BEST32.
44	otherpain	Num	8	BEST12.	BEST32.
29	pepticulcer	Num	8	BEST12.	BEST32.
42	pica	Num	8	BEST12.	BEST32.
35	polio	Num	8	BEST12.	BEST32.
56	pregnancies	Num	8	BEST12.	BEST32.
59	price71	Num	8	BEST12.	BEST32.
60	price82	Num	8	BEST12.	BEST32.
63	price71_82	Num	8	BEST12.	BEST32.
2	qsmk	Num	8	BEST12.	BEST32.
11	race	Num	8	BEST12.	BEST32.
7	sbp	Num	8	BEST12.	BEST32.
14	school	Num	8	BEST12.	BEST32.
1	seqn	Num	8	BEST12.	BEST32.
9	sex	Num	8	BEST12.	BEST32.
22	smkintensity82_71	Num	8	BEST12.	BEST32.
21	smokeintensity	Num	8	BEST12.	BEST32.
23	smokeyrs	Num	8	BEST12.	BEST32.
61	tax71	Num	8	BEST12.	BEST32.
62	tax82	Num	8	BEST12.	BEST32.
64	tax71_82	Num	8	BEST12.	BEST32.
26	tb	Num	8	BEST12.	BEST32.
36	tumor	Num	8	BEST12.	BEST32.
45	weakheart	Num	8	BEST12.	BEST32.
17	wt71	Num	8	BEST12.	BEST32.
18	wt82	Num	8	BEST12.	BEST32.
19	wt82_71	Num	8	BEST12.	BEST32.
51	wtloss	Num	8	BEST12.	BEST32.
4	yrdth	Num	8	BEST12.	BEST32.

Oops, we have only 44 rows left in the data set! Perhaps we should reconsider and pick several variables that will be of interest to us for our current purposes.

%let selectvars = age alcoholfreq asthma wt71 smokeintensity sbp allergies weakheart;

data nhefs3;
    set nhefs(keep=&selectvars);
    if cmiss(of _ALL_)=0;
run;

proc contents data=nhefs3; run;

Data Set Name	WORK.NHEFS3	Observations	1552
Member Type	DATA	Variables	8
Engine	V9	Indexes	0
Created	12/15/2022 20:38:45	Observation Length	64
Last Modified	12/15/2022 20:38:45	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	latin1 Western (ISO)

Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	2
First Data Page	1
Max Obs per Page	1021
Obs in First Data Page	977
Number of Data Set Repairs	0
Filename	/tmp/SAS_work4619000063F6_jrgant-AW/nhefs3.sas7bdat
Release Created	9.0401M7
Host Created	Linux
Inode Number	7998383
Access Permission	rw-rw-r–
Owner Name	jrgant
File Size	192KB
File Size (bytes)	196608

Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
2	age	Num	8	BEST12.	BEST32.
6	alcoholfreq	Num	8	BEST12.	BEST32.
8	allergies	Num	8	BEST12.	BEST32.
5	asthma	Num	8	BEST12.	BEST32.
1	sbp	Num	8	BEST12.	BEST32.
4	smokeintensity	Num	8	BEST12.	BEST32.
7	weakheart	Num	8	BEST12.	BEST32.
3	wt71	Num	8	BEST12.	BEST32.

That’s better. We’ve discarded 77 rows and focused on a subset of variables of interest.

Before we do some exploratory data analysis, let’s recall that category 5 for one of the variables we selected (alcoholfreq) indicated that alcohol frequency for that individual was unknown. This is missing data, and we should recode the alcoholfreq variable so that SAS recognizes these observations as such.

In addition, SAS currently believes that alcoholfreq is a numeric variable, though it is a categorical variable.

We can add a new variable to our nhefs3 data set that recodes alcoholfreq to meet our needs. We’ll name our new variable alcfreqcat

CHARACTER VARIABLES How did we get alcfreqcat to be a character variable when we based it on the numeric variable alcoholfreq? Notice in the data step that we initialized alcfreqcat using '.' instead of .. Wrapping the dot in quotes indicated to SAS that we wanted it to treat alfreqcat as a character variable.

data nhefs3;
    set nhefs3;
    if alcoholfreq = 5 then alcfreqcat = '.';
    else alcfreqcat = alcoholfreq;
run;

proc freq data=nhefs3;
    table alcfreqcat*alcoholfreq;
run;

0.00

0.32

100.00

0.32

320

20.62

100.00

0.00

320

20.62

0.00

217

13.98

100.00

0.00

217

13.98

0.00

489

31.51

100.00

0.00

489

31.51

0.00

329

21.20

100.00

0.00

329

21.20

0.00

192

12.37

100.00

0.00

192

12.37

Total

320

20.62

217

13.98

489

31.51

329

21.20

192

12.37

0.32

1552

100.00

proc contents data=nhefs3; run;

Data Set Name	WORK.NHEFS3	Observations	1552
Member Type	DATA	Variables	9
Engine	V9	Indexes	0
Created	12/15/2022 20:38:46	Observation Length	72
Last Modified	12/15/2022 20:38:46	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	latin1 Western (ISO)

Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	2
First Data Page	1
Max Obs per Page	908
Obs in First Data Page	866
Number of Data Set Repairs	0
Filename	/tmp/SAS_work71760000654C_jrgant-AW/nhefs3.sas7bdat
Release Created	9.0401M7
Host Created	Linux
Inode Number	7998381
Access Permission	rw-rw-r–
Owner Name	jrgant
File Size	192KB
File Size (bytes)	196608

Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
2	age	Num	8	BEST12.	BEST32.
9	alcfreqcat	Char	1
6	alcoholfreq	Num	8	BEST12.	BEST32.
8	allergies	Num	8	BEST12.	BEST32.
5	asthma	Num	8	BEST12.	BEST32.
1	sbp	Num	8	BEST12.	BEST32.
4	smokeintensity	Num	8	BEST12.	BEST32.
7	weakheart	Num	8	BEST12.	BEST32.
3	wt71	Num	8	BEST12.	BEST32.

Although SAS will automatically throw out observations with missing data when running regressions and other procedures, it will do so based on which variables are included in model formulas. So that we conduct all the analyses below on the same subset of the NHEFS data, let’s drop any rows from our data set with missing values of alcfreqcat and drop the original alcoholfreq variable.

data nhefs4;
    set nhefs3(where=(alcfreqcat ne '.') drop=alcoholfreq);
run;

proc print data=nhefs4(obs=5); run;

Obs	sbp	age	wt71	smokeintensity	weakheart	alcfreqcat
1	175	42	79.04	30	0	1
2	123	36	58.63	20	0	0
3	115	56	56.81	20	0	3
4	148	68	59.42	3	1	2
5	118	40	87.09	20	0	2

Let’s summarize our dataset:

* save list of continuous variables and summarize;
%let contvars = age wt71 smokeintensity sbp;
proc means data=nhefs4;
    var &contvars;
run;

age
wt71
smokeintensity
sbp

1547

43.6528765

70.9031157

20.5416936

128.7039431

12.0298947

15.3891998

11.7258480

19.0608817

25.0000000

39.5800000

1.0000000

87.0000000

74.0000000

151.7300000

80.0000000

229.0000000

* save list of categorical variables and summarize;
%let catvars = asthma allergies alcfreqcat weakheart;
proc freq data=nhefs4;
    tables &catvars;
run;

asthma	Frequency	Percent	Cumulative Frequency	Cumulative Percent
0	1474	95.28	1474	95.28
1	73	4.72	1547	100.00

allergies	Frequency	Percent	Cumulative Frequency	Cumulative Percent
0	1448	93.60	1448	93.60
1	99	6.40	1547	100.00

alcfreqcat	Frequency	Percent	Cumulative Frequency	Cumulative Percent
0	320	20.69	320	20.69
1	217	14.03	537	34.71
2	489	31.61	1026	66.32
3	329	21.27	1355	87.59
4	192	12.41	1547	100.00

weakheart	Frequency	Percent	Cumulative Frequency	Cumulative Percent
0	1512	97.74	1512	97.74
1	35	2.26	1547	100.00

7.3 Tabular Analysis

Soon you will be no stranger to two-by-two tables and contingency tables. Introductory statistics and epidemiologic methods courses will require you to analyze such tables, either to measure the strength of an exposure’s effect on some outcome or to establish a statistical association between two variables.

Going back to our NHEFS data, let’s say we’re interested in a possible association between quitting smoking and weight loss. The qsmk variable is a binary indicator of whether or not an individual quit smoking between 1971 and 1982, while the sbp variable measures an individual’s change in weight (kilograms) between 1971 and 1982.

Typically, we would not want to dichotomize a continuous variable like sbp, but for the sake of this example, let’s begin by creating a binary indicator of whether a subject gained weight during this timeframe.

data nhefs;
    set nhefs;
    if sbp > 140 then sbp_hi = 1;
    else if sbp <= 140 then sbp_hi = 0;
    if sbp = . then sbp_hi = .;
run;

proc freq data=nhefs;
    table sbp_hi;
run;

sbp_hi	Frequency	Percent	Cumulative Frequency	Cumulative Percent
0	1192	76.80	1192	76.80
1	360	23.20	1552	100.00
Frequency Missing = 77

The summary statistics we generated above should comply with our expectations: those we marked as having a high systolic blood pressure have values of sbp above 140 mm/Hg, while those we marked as not having high systolic blood pressure all have values of sbp less than or equal to 140 mm/Hg.

Note, too, that we have 63 missing values in our new variable, just as we saw in the original outcome.

proc freq data=nhefs;
    table sbp_hi * qsmk /chisq;
run;

908

58.51

76.17

78.34

284

18.30

23.83

72.26

1192

76.80

251

16.17

69.72

21.66

109

7.02

30.28

27.74

360

23.20

Total

1159

74.68

393

25.32

1552

100.00

Statistics for Table of sbp_hi by qsmk

Statistic	DF	Value	Prob
Chi-Square	1	6.0872	0.0136
Likelihood Ratio Chi-Square	1	5.9256	0.0149
Continuity Adj. Chi-Square	1	5.7508	0.0165
Mantel-Haenszel Chi-Square	1	6.0833	0.0136
Phi Coefficient		0.0626
Contingency Coefficient		0.0625
Cramer's V		0.0626

Fisher's Exact Test
Cell (1,1) Frequency (F)	908
Left-sided Pr <= F	0.9939
Right-sided Pr >= F	0.0088

Table Probability (P)	0.0028
Two-sided Pr <= P	0.0155

Sample Size = 1552
Frequency Missing = 77

A natural choice here would be to conduct a two-sample test for the equality of proportions, where we compare the proportion of subjects in the quit smoking group with high systolic blood pressure versus the same proportion in the non-quit smoking group.

The output above shows us the group proportions, the $\chi^2$ test statistic (elicited by specifying the table option chisq), and the corresponding p-value. We also get a 95% confidence interval for the difference in proportions.

7.4 Regression Models

In this section we will fit some regression models that you may encounter in classes or in your own research. We will leave the theoretical background to your statistics and methods courses but aim to provide enough information so that you understand the basic idea behind each approach and can refer back to this page in the future.

All upcoming examples use variables from the NHEFS dataset that we’ve referred back to several times by now.

7.4.1 Linear regression

Exposure of interest: quitting smoking between 1971 and 1982 (qsmk)
Outcome: change in weight between 1971 and 1982 (continuous, sbp)

First, we can run an unadjusted model with weight change as the outcome and quitting smoking as the predictor variable.

ods graphics off;
proc reg data=nhefs;
    model sbp = qsmk /clb; * /clb adds confidence limits to output table;
quit;

Model: MODEL1

Dependent Variable: sbp

Number of Observations Read	1629
Number of Observations Used	1552
Number of Observations with Missing Values	77

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	4673.90689	4673.90689	12.98	0.0003
Error	1550	558280	360.18067
Corrected Total	1551	562954

Root MSE	18.97843	R-Square	0.0083
Dependent Mean	128.70941	Adj R-Sq	0.0077
Coeff Var	14.74517

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|	95% Confidence Limits
Intercept	1	127.69888	0.55747	229.07	<.0001	126.60541	128.79235
qsmk	1	3.99069	1.10782	3.60	0.0003	1.81771	6.16367

SAS’s PROC REG does not handle product terms or categorical variables within the model statement itself. Let’s say we wanted to fit an adjusted model and include a product term to test an “interaction” between sex and age. First, we have to add the interaction variable to our nhefs dataset, as shown below, and include the new variable the model formula.

The code below fits a linear model for the effect of qsmk on sbp adjusted for years of smoking (smokeyrs), intensity of smoking (smokeintensity), diabetes, sex, age, and the “interaction” between sex and age (sex_age).

data nhefs;
    set nhefs;
    sex_age = sex * age;
run;

ods graphics off;
proc reg data=nhefs;
    model sbp = qsmk smokeyrs smokeintensity diabetes sex age sex_age /clb;
quit;

Model: MODEL1

Dependent Variable: sbp

Number of Observations Read	1629
Number of Observations Used	1552
Number of Observations with Missing Values	77

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	7	98121	14017	46.56	<.0001
Error	1544	464833	301.05758
Corrected Total	1551	562954

Root MSE	17.35101	R-Square	0.1743
Dependent Mean	128.70941	Adj R-Sq	0.1706
Coeff Var	13.48076

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|	95% Confidence Limits
Intercept	1	113.85479	2.95509	38.53	<.0001	108.05838	119.65120
qsmk	1	1.51803	1.03207	1.47	0.1415	-0.50638	3.54245
smokeyrs	1	0.06761	0.08043	0.84	0.4007	-0.09016	0.22537
smokeintensity	1	-0.06613	0.03933	-1.68	0.0929	-0.14328	0.01102
diabetes	1	-0.62827	0.44678	-1.41	0.1599	-1.50463	0.24809
sex	1	-16.63281	3.36583	-4.94	<.0001	-23.23489	-10.03073
age	1	0.39023	0.09439	4.13	<.0001	0.20507	0.57538
sex_age	1	0.28503	0.07558	3.77	0.0002	0.13678	0.43329

SAS offers two additional ways to fit a linear regression, both of which handle categorical variables and interaction terms directly, unlike PROC REG. These options are PROC GLM and PROC GENMOD. We’ll focus on PROC GENMOD here.

proc genmod data=nhefs;
    model sbp = qsmk /link=identity dist=normal;
run;

Model Information
Data Set	WORK.NHEFS
Distribution	Normal
Link Function	Identity
Dependent Variable	sbp

Number of Observations Read	1629
Number of Observations Used	1552
Missing Values	77

Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	1550	558280.0358	360.1807
Scaled Deviance	1550	1552.0000	1.0013
Pearson Chi-Square	1550	558280.0358	360.1807
Scaled Pearson X2	1550	1552.0000	1.0013
Log Likelihood		-6769.1980
Full Log Likelihood		-6769.1980
AIC (smaller is better)		13544.3961
AICC (smaller is better)		13544.4116
BIC (smaller is better)		13560.4380

Algorithm converged.

Analysis Of Maximum Likelihood Parameter Estimates
Parameter	DF	Estimate	Standard Error	Wald 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept	1	127.6989	0.5571	126.6070	128.7908	52540.9	<.0001
qsmk	1	3.9907	1.1071	1.8208	6.1606	12.99	0.0003
Scale	1	18.9662	0.3404	18.3106	19.6453

Note:

The scale parameter was estimated by maximum likelihood.

proc genmod data=nhefs;
    model sbp = qsmk smokeyrs smokeintensity diabetes sex:age /link=identity dist=normal;
run;

Model Information
Data Set	WORK.NHEFS
Distribution	Normal
Link Function	Identity
Dependent Variable	sbp

Number of Observations Read	1629
Number of Observations Used	1552
Missing Values	77

Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	1544	464832.8969	301.0576
Scaled Deviance	1544	1552.0000	1.0052
Pearson Chi-Square	1544	464832.8969	301.0576
Scaled Pearson X2	1544	1552.0000	1.0052
Log Likelihood		-6627.0482
Full Log Likelihood		-6627.0482
AIC (smaller is better)		13272.0965
AICC (smaller is better)		13272.2132
BIC (smaller is better)		13320.2222

Algorithm converged.

Analysis Of Maximum Likelihood Parameter Estimates
Parameter	DF	Estimate	Standard Error	Wald 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept	1	113.8548	2.9475	108.0779	119.6317	1492.13	<.0001
qsmk	1	1.5180	1.0294	-0.4996	3.5356	2.17	0.1403
smokeyrs	1	0.0676	0.0802	-0.0896	0.2248	0.71	0.3994
smokeintensity	1	-0.0661	0.0392	-0.1430	0.0108	2.84	0.0919
diabetes	1	-0.6283	0.4456	-1.5017	0.2451	1.99	0.1586
sex	1	-16.6328	3.3571	-23.2127	-10.0529	24.55	<.0001
sex_age	1	0.2850	0.0754	0.1373	0.4328	14.29	0.0002
age	1	0.3902	0.0942	0.2057	0.5748	17.18	<.0001
Scale	1	17.3062	0.3106	16.7080	17.9259

Note:

The scale parameter was estimated by maximum likelihood.

7.4.2 Logistic regression

Let’s revisit our tabular analysis, in which we were interested in whether quitting smoking affected the likelihood of a subject’s gaining weight between 1971 and 1982.

We can answer the same question using a logistic regression model, using the glm() function.

proc logistic data=nhefs;
    model sbp_hi(event='1') = qsmk;
    estimate 'qsmk (1 vs. 0)' qsmk 1 /exp;
run;

Model Information
Data Set	WORK.NHEFS
Response Variable	sbp_hi
Number of Response Levels	2
Model	binary logit
Optimization Technique	Fisher's scoring

Number of Observations Read	1629
Number of Observations Used	1552

Response Profile
Ordered Value	sbp_hi	Total Frequency
1	0	1192
2	1	360

Probability modeled is sbp_hi=1.

Note:

77 observations were deleted due to missing values for the response or explanatory variables.

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
AIC	1683.227	1679.301
SC	1688.574	1689.996
-2 Log L	1681.227	1675.301

Testing Global Null Hypothesis: BETA=0
Test	Chi-Square	DF	Pr > ChiSq
Likelihood Ratio	5.9256	1	0.0149
Score	6.0872	1	0.0136
Wald	6.0586	1	0.0138

Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept	1	-1.2858	0.0713	325.0998	<.0001
qsmk	1	0.3282	0.1333	6.0586	0.0138

Odds Ratio Estimates
Effect	Point Estimate	95% Wald Confidence Limits
qsmk	1.388	1.069	1.803

Association of Predicted Probabilities and Observed Responses
Percent Concordant	23.1	Somers' D	0.065
Percent Discordant	16.6	Gamma	0.163
Percent Tied	60.3	Tau-a	0.023
Pairs	429120	c	0.532

Estimate
Label	Estimate	Standard Error	z Value	Pr > \|z\|	Exponentiated
qsmk (1 vs. 0)	0.3282	0.1333	2.46	0.0138	1.3885

proc genmod data=nhefs;
    model sbp_hi(ref='0') = qsmk /link=logit dist=bin;
    estimate 'qsmk (1 vs. 0)' qsmk 1 /exp;
run;

Model Information
Data Set	WORK.NHEFS
Distribution	Binomial
Link Function	Logit
Dependent Variable	sbp_hi

Number of Observations Read	1629
Number of Observations Used	1552
Number of Events	360
Number of Trials	1552
Missing Values	77

Response Profile
Ordered Value	sbp_hi	Total Frequency
1	1	360
2	0	1192

PROC GENMOD is modeling the probability that sbp_hi=‘1’. One way to change this to model the probability that sbp_hi=‘0’ is to specify the DESCENDING option in the PROC statement.

Parameter Information
Parameter	Effect
Prm1	Intercept
Prm2	qsmk

Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Log Likelihood		-837.6506
Full Log Likelihood		-837.6506
AIC (smaller is better)		1679.3012
AICC (smaller is better)		1679.3089
BIC (smaller is better)		1689.9958

Algorithm converged.

Analysis Of Maximum Likelihood Parameter Estimates
Parameter	DF	Estimate	Standard Error	Wald 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept	1	-1.2858	0.0713	-1.4256	-1.1460	325.10	<.0001
qsmk	1	0.3282	0.1333	0.0668	0.5895	6.06	0.0139
Scale	0	1.0000	0.0000	1.0000	1.0000

Note:

The scale parameter was held fixed.

Contrast Estimate Results
Label	Mean Estimate	Mean		L'Beta Estimate	Standard Error	Alpha	L'Beta		Chi-Square	Pr > ChiSq
Label	Mean Estimate	Confidence Limits		L'Beta Estimate	Standard Error	Alpha	Confidence Limits		Chi-Square	Pr > ChiSq
qsmk (1 vs. 0)	0.5813	0.5167	0.6433	0.3282	0.1333	0.05	0.0668	0.5895	6.06	0.0139
Exp(qsmk (1 vs. 0))				1.3884	0.1851	0.05	1.0691	1.8031

As in the linear regression example, you could run a model adjusting for other factors, in which case you would modify the model formula accordingly. We will forego running an adjusted model here to avoid redundancy.

7.4.3 Log-binomial regression

While the logistic regression remains the most frequently used generalized linear model (GLM) for binary outcomes, epidemiologists are often interested in risk differences and risk ratios.

Luckily, we can use the log-binomial model to estimate contrasts in risks.

Try this Calculate the risk ratio manually and convince yourself that the log-binomial model gave you the correct answer.

proc genmod data=nhefs;
    model sbp_hi(ref='0') = qsmk /link=log dist=bin;
    estimate 'qsmk (1 vs. 0)' qsmk 1 /exp;
run;

Model Information
Data Set	WORK.NHEFS
Distribution	Binomial
Link Function	Log
Dependent Variable	sbp_hi

Number of Observations Read	1629
Number of Observations Used	1552
Number of Events	360
Number of Trials	1552
Missing Values	77

Response Profile
Ordered Value	sbp_hi	Total Frequency
1	1	360
2	0	1192

PROC GENMOD is modeling the probability that sbp_hi=‘1’. One way to change this to model the probability that sbp_hi=‘0’ is to specify the DESCENDING option in the PROC statement.

Parameter Information
Parameter	Effect
Prm1	Intercept
Prm2	qsmk

Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Log Likelihood		-837.6506
Full Log Likelihood		-837.6506
AIC (smaller is better)		1679.3012
AICC (smaller is better)		1679.3089
BIC (smaller is better)		1689.9958

Algorithm converged.

Analysis Of Maximum Likelihood Parameter Estimates
Parameter	DF	Estimate	Standard Error	Wald 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept	1	-1.5299	0.0559	-1.6394	-1.4204	749.85	<.0001
qsmk	1	0.2474	0.0987	0.0539	0.4409	6.28	0.0122
Scale	0	1.0000	0.0000	1.0000	1.0000

Note:

The scale parameter was held fixed.

Contrast Estimate Results
Label	Mean Estimate	Mean		L'Beta Estimate	Standard Error	Alpha	L'Beta		Chi-Square	Pr > ChiSq
Label	Mean Estimate	Confidence Limits		L'Beta Estimate	Standard Error	Alpha	Confidence Limits		Chi-Square	Pr > ChiSq
qsmk (1 vs. 0)	1.2807	1.0553	1.5542	0.2474	0.0987	0.05	0.0539	0.4409	6.28	0.0122
Exp(qsmk (1 vs. 0))				1.2807	0.1265	0.05	1.0553	1.5542

To get the risk ratio for the effect of quitting smoking on weight gain, we again exponentiate the beta coefficient of interest: those who quit smoking were 1.28 times as likely to have high systolic blood pressure as those who did not quit smoking.

7.4.4 Modified Poisson regression

Another way to get a risk ratio using GLMs involves a procedure called modified Poisson regression. “Modified” here refers to a necessary modification of the standard error estimates when using a Poisson regression model to estimate a risk ratio. Again, save the statistical details for later. All we want to know for the moment are the nuts-and-bolts: how to fit the model and how to modify the standard errors.

proc genmod data=nhefs;
    class seqn;
    model sbp_hi(ref='0') = qsmk /link=log dist=poisson;
    repeated subject=seqn;
    estimate 'qsmk (1 vs. 0)' qsmk 1 /exp;
run;

Model Information
Data Set	WORK.NHEFS
Distribution	Poisson
Link Function	Log
Dependent Variable	sbp_hi

Number of Observations Read	1629
Number of Observations Used	1552
Missing Values	77

Class Level Information
Class	Levels	Values
seqn	1629	233 235 244 245 252 257 262 266 419 420 428 431 434 443 446 455 457 596 603 604 605 616 618 619 620 804 806 813 816 818 825 828 831 1094 1096 1101 1103 1104 1106 1109 1116 1120 1124 1126 1127 1129 1133 1135 1148 1476 1480 1498 1505 1513 1515 1519 1523 …

Parameter Information
Parameter	Effect
Prm1	Intercept
Prm2	qsmk

Algorithm converged.

GEE Model Information
Correlation Structure	Independent
Subject Effect	seqn (1629 levels)
Number of Clusters	1629
Clusters With Missing Values	77
Correlation Matrix Dimension	1
Maximum Cluster Size	1
Minimum Cluster Size	0

Algorithm converged.

GEE Fit Criteria
QIC	2302.3462
QICu	2302.4294

Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter	Estimate	Standard Error	95% Confidence Limits		Z	Pr > \|Z\|
Intercept	-1.5299	0.0559	-1.6394	-1.4204	-27.38	<.0001
qsmk	0.2474	0.0987	0.0539	0.4409	2.51	0.0122

Contrast Estimate Results
Label	Mean Estimate	Mean		L'Beta Estimate	Standard Error	Alpha	L'Beta		Chi-Square	Pr > ChiSq
Label	Mean Estimate	Confidence Limits		L'Beta Estimate	Standard Error	Alpha	Confidence Limits		Chi-Square	Pr > ChiSq
qsmk (1 vs. 0)	1.2807	1.0553	1.5542	0.2474	0.0987	0.05	0.0539	0.4409	6.28	0.0122
Exp(qsmk (1 vs. 0))				1.2807	0.1265	0.05	1.0553	1.5542

In the prior chapter, which showcased the log-binomial and modified Poisson model fitting procedures in R, we noted that the “modified” in modified Poisson refers to the adjusted of the standard error. In order to fit the modified Poisson model correctly in SAS, you must specify the repeated option as above, where the argument passed to subject is the unique observation identifier.