During data analysis frequently we have to manipulate data to generate, recode, or rename variables. Data management is an essential component of data analysis plan. In this blog post, I will show how to import a dataset, generate new variables, recode and rename variables.
Ideally, data management and data analysis are done using a do-file. Instead of typing commands in Stata, you can create a text file containing commands and instruct Stata to execute the commands stored in that file using the command do. Do-files are important because they allow you to replicate your results. I usually first run my analysis interactively and then copy my commands to a do-file.
I will create a separate blog post on creating do-files. For now, I will show how to perform data management steps interactively.
Importing health.csv File
The Health dataset is a small dataset based on a class survey I did few years ago. There are no personal identifiers in the dataset.
In the Stata command window, type:
insheet using health.csv, clear
You can also use the menu bar File > Import > Text Data (delimited, “.csv,..)
In the pop-up window browse to the folder where health.csv is located. Make sure variable case is selected as lower. Click OK.
The data is now imported into Stata. Click on the browse tab to browse the data. There should be 11 columns of data. Since we did not named any variable, Stata, by default, label variables as v1, v2,v3… etc.
Let’s name the 11 variables before we import data from health.csv. We will use the option clear (specified after a comma) to erase data from Stata memory. Refer to the health codesheet for variable names. Remember to use lowercase for all variable names. I never use uppercase for variable names.
insheet id age gender height weight health allergy liberal race alcohol eversmok using health.csv, clear
You will notice that v1, v2, … are replaced with actual variable names.
Now, save the dataset using the following command
save health.dta, replace
The option replace will overwrite any previous version of health.dta.
Before we do any type of data management let’s describe the data. In the command window type
and press the ENTER key.
As you can see the dataset contains 76 observations and 11 variables. All variables except Gender are numeric (byte, integer).
Gender is a string variable. Although, Stata will let you do some basic frequency distribution on string variables, you will need to convert it to numeric to run other analysis.
Renaming and labeling variables, and variable values
Renaming allows you to change the name of the variable. Labels allow you to document the variables with more detailed descriptions. I will go over how to label variables using the command line. Alternatively, you can also use the menu bar Data > Data Utilities but in my experience students find it easier to use Stata commands.
Rename a variable
If the variable name is not intuitive then it is helpful to change the variable name to something more descriptive. Although you can use long names, but I advise you to 1) limit the name to no more than 8 characters, 2) there must not be any space between letters, and 3) always use lowercase letters.
label variable id “Unique number for each participant”
label variable gender “Respondents’ sex”
Label rest of the variables in the dataset. Use the description provided in the codesheet.
Converting a string variable to a numeric variable
As stated earlier Stata will let you do some basic frequency distribution on string variables but you will need to convert it to numeric to run other types of analyses.
Let’s convert the variable gender to numeric format and name it sex. Then describe and tabulate the variable sex.
encode gender,generate (sex)
tabulate sex, nolabel
The variable sex is now a numeric variable (long format).
When you use the encode command Stata automatically generate value label.
A value label superimpose text on numeric values. In this case Stata generates a new numeric variable – sex – code it as 1 and 2 and then creates a value label – sex – with 1 as Female and 2 as Male.
As you can see it is often useful to give values of a variable a descriptive label.
What if you want to assign a value label to a numeric variable? This is a two step process:
- Define a label
- Attach the defined label to the variable
The variable ALLERGY in the dataset is a binary variable, 0=No, 1=Yes.
To attach No to 0 and Yes to 1 we will first define a label using the following Stata command.
label define yesno 0 “No” 1 “Yes”
“yesno” is the name of the label. You can use any name of your choice. Hit the ENTER key on the keyboard.
Next, you will attach the yesno label to 0 and 1 in the variable ALLERGY by typing the following command.
label values allergy yesno
You can tabulate the frequency distribution of ALLERGY to confirm if the value label successfully attached to the values.
Generating new variables
Generate creates a new variable. The values of the variable are specified by = expression. You can find out more about allowable expressions by looking up “functions” in the help menu.
When generating a new variable it is common to use if option with relational operators.
The if option uses logical expressions which are created with the following relational operators:
- == is equal to (double equal sign)
- ~= not equal to (can also use != or <.)
- > is greater than
- < is less than
- >= is greater than or equal to
- <= is less than or equal to
- & specifies AND
- | specifies OR
- ~ specifies NOT
You can create simple logical expressions, such as
age < 30 (i.e., age less than 30)
or more complex expressions such as
(age < 30 & sex==1) | (age<30 & sex==0)
age less than 30 and sex equals 1 OR age less than 30 and sex equals 0
For complicated expressions use parentheses to make sure expressions are evaluated in the order that you want.
Earlier you used generate command with encode. Now, you will use generate along with another stata command- replace.
Generate a new variable “age25” if age is greater than or equal to 25 years. First type in the command window:
Press the ENTER key on the keyboard. Next, type:
replace age25=1 if age>=25 & age<.
Remember to specify age <. (age not equal missing) otherwise all missing values of age will be coded as 1
To look at the new “age25” variable tabulate age25
Now label and value label the newly generated variable.
Generating a variable from two or more existing variables
We can also generate a variable using data from two (or more) existing variables. Let’s generate a new variable “bmi” using data from two existing variables “height” and “weight.” The body mass index formula I am using is 703.03 * weight in lbs. / height in inches2
Because the newly generated variable BMI is quantitative, we will use Stata command summarize.
Recode changes the values of a numeric variable and is often used to code variables into categories.
One of the cardinal rule of data management is NEVER to change your original variable. Always generate another variable and make any changes to that copy.
Let’s change the original variable sex (female=1 and male=0) to (female=0 and male=1) and save it in a new variable sex2.
recode sex (1=0) (0=1),generate(sex2)
label define sex2 0 “Female” 1 “Male”
label val sex2 sex2
You can also use recode, generate, and label values in a single command
recode sex (1=0 “Female”) (0=1 “Male”), generate(sex2) label (sex2)
To drop a variable e.g., sex2 from the dataset type
- Label all variables in the dataset
- Recode and label values of the variable HEALTH
2=VeryGood or Good
- Recode and label values of thevariable LIBERAL
1=very or somewhat liberal
3=very or somewhat conservative
- Recode and label values of the variable RACE
- Recode and label values of the variable ALCOHOL
1=1 or more times