# Research Project (3): Data Management for SAS

I. Introduction/ II. Research question (revised)/III. Variables

IV. Data Management

1. Code out missing data

2. Code in valid data

3. Create secondary variables

4. Group values

V. CODE/ VI. RESULTS

I. Introduction

This post is for the week 3 assignment of the Coursera course Data Management and Visualization by Wesleyan University.

Instructions for this assignment:

STEP 1: Make and implement data management decisions for the variables you selected.

Data management includes such things as coding out missing data, coding in valid data, recoding variables, creating secondary variables and binning or grouping variables. Not everyone does all of these, but some is required.

STEP 2: Run frequency distributions for your chosen variables and select columns, and possibly rows.

Your output should be interpretable (i.e. organized and labeled).

II. Research question (revised)

Among young adults aged 18-26 who have met symptom criteria for major depression and have sought help, is help-seeking behavior associated with history of suicide attempts?

*”help-seeking behavior” include two aspects: (1) time interval between age of first episode and age when sought help for the first time; (2) sought time for one time, or sought time for more than one time. Also, those who didn’t sought help at all are not included in my study.

I revised my research question due to two reasons.

1. Only participants whose worst period met symptom criteria for major depression were asked “age of seeking help for the first time”, so I need to state that the chosen population is limited to young adults with major depression.
2. If I just look at age of seeking help for the first time, it could happen right after participants first had a major depression episode, or after a long time struggle with depression. It makes more sense to examine whether they had timely help after the first episode & how many times they sought help. Therefore, I would create a secondary variable to represent time interval between age of first episode and age of first sought help, and another secondary variable to represent how many times they sought help.

III. Variables

NESARC Code book:

SECTION 4A: MAJOR DEPRESSION (LOW MOOD I)

IV. Data Management

1. Code out missing data

IF S4AQ4A16=9 THEN S4AQ4A16=.;
IF S4AQ6A=99 THEN S4AQ4A16=.;
IF S4AQ19A=99 THEN S4AQ19A=.;
IF S4AQ19B=99 THEN S4AQ19B=.;

So the observations with value 9 and 99 are put into the missing group.

2. Code in valid data

I was expecting to recovery missing data for variable S4CQ17A (“Age at First Time Sought Help”), in order to know who didn’t sought help at all. But then I realized it would be hard to distinguish them from those who just didn’t answer for other reasons. That’s why I limited my study sample to those who had sought help for at least one time.

3. Create secondary variables

(1) INTERVAL

INTERVAL=S4AQ19A-S4AQ6A;

IF INTERVAL < 0 THEN INTERVAL=.;

INTERVAL is a secondary variable I created to represent time interval between age of first episode and age of first sought help.

When INTERVAL is a negative value, it is not meaning and needs to be put into the missing group.

(2) NUMSH

IF S4AQ19A=. OR S4AQ19B=. THEN NUMSH=.;
ELSE IF S4AQ19A=S4AQ19B THEN NUMSH=1;
ELSE IF S4AQ19B-S4AQ19A > 0 THEN NUMSH=2;
ELSE NUMSH=.; /*S4AQ19B-S4AQ19A < 0, NUMSH=.*/

/*NUMSH=1,sought help only once*/
/*NUMSH=2,sought help twice or more*/

4. Group values

Now that I’ve known that the value of variable INTERVAL varies from 0 to 18. I will divide the sample into three groups based on the time interval: interval=0, interval=[1, 2], interval=[3,18]. Syntax:

IF INTERVAL=. THEN INTERVALGROUP=.;
ELSE IF INTERVAL=0 THEN INTERVALGROUP=1;
ELSE IF INTERVAL LE 2 THEN INTERVALGROUP=2;
ELSE INTERVALGROUP=3;

V. CODE

LOG: No errors. No warnings.

`NOTE: There were 6466 observations read from the data set WORK.NEW.`

VI. RESULTS

First, below is part of the table resulted from the PROC PRINT function. (I won’t put the whole table here, because it’s too long.)

From this table, I can check whether the logic statements worked for each observation (row). Yes, it did. For example, INTERVAL indeed equeals S4AQ19A-S4AQ6A.

Second, below are the frequency tables of the five chosen variables:

From the frequency table of AGE, we can see that those aged 18, 19, 20, 21, 22, 23, 24, 25, 26 each took roughly about 10% of the sample. They appear to be equally distributed among different ages from 18 to 26.

From the frequency table of Attempted Suicide, we can see that 12%(N=236) of these young adults attempted suicide, 88% (N=1769) never attempted suicide, and 4461 were missing data.

According to the frequency table of Time Interval, 5823 were missing data, 643 were valid data. The years of time interval have a range [0,18], and 59% of them had the value 0, indicating they sought help for the first time in the same year when they first developed an episode of major depression. The frequency tends to decrease with the increase in interval time, and it’s quite rate for someone to seek help for the first time after more than 9 years from their first episode.

As we can see from the frequency table of three groups of time interval, 5823 were missing data. 380 participants in group 1 (59%) sought help for the first time within the same year when they had the first episode of major depression, 133 participants in group 2 (21%) sought help for the first time after 1-2 years from their first episode, and 130 participants in group 3 (20%) waited 3 years or more to seek help after their first episode.

According to the frequency table above,  5821 were missing data, 369 (57%) participants sought help for major depression only once, and 276(43%) sought help for two or more times.