Data are the raw material of statistics. Descriptive statistics are used to describe data from a population or a sample. We measure characteristics of study subjects using variables. Age, gender, race, income, systolic blood pressure, serum cholesterol, blood group are examples of variables.
To demonstrate descriptive statistics, I will use a subset of the data (n=3,000) collected in the Framingham Heart Study. The Framingham Heart Study is a longitudinal study to assess risk factors for cardiovascular disease. Details of the study, its design, data, and other information can be found at www.framingham.com/heart.
The following table (codesheet) shows variable names, as they appear in the Stata dataset, along with brief descriptions and coding details for each variable.
Table 1. Codesheet
|ID||Random, unique number for each participant||1-3000|
|AGE||Age at exam, in years||32-70|
|MALE||Male sex||1=male, 0=female|
|TOTAL CHOL||Total Cholesterol, mg/dL||113-696|
|SBP||Systolic blood pressure, mmHg||83.5-295|
|DBP||Diastolic blood pressure, mmHg||48-141|
|BP MEDS||Anti-hypertensive medications||0=no, 1=yes|
|BMI||Body mass index, kg/meters2||15.54-51.28|
|CURRENT SMOKER||Currently smoking cigarettes||0=no, 1=yes|
|CIGS PER DAY||Number of cigarettes smoked per day||0-70|
|GLUCOSE||Serum glucose mg/dL||40-394|
|HEART RATE||Heart rate, beats/minute||45-143|
|DEATH||Death from any cause over 24-year follow-up||0=no, 1=yes|
|STROKE||Stroke over 24-year follow-up||0=no, 1=yes|
|CVD||Cardiovascular disease over 24-year follow-up||0=no, 1=yes|
|HYPERTENSION||Hypertension over 24-year follow-up||0=no, 1=yes|
|Blood Pressure- 4 categories
1=Pre-hypertension 2=Stage 1 hypertension 3=Stage 2 hypertension
Now, you are going to open the Framingham dataset. First, start the new Stata session. Next, change the working directory (to link the Stata session to a folder) and create a log file. Name the log file framingham.log. To open the Framingham dataset (framingham.dta) from the menu bar select File > Open. Browse to the data folder, click on framingham.dta, and click Open. The dataset is now loaded in Stata. Since you have already linked the Stata session to data folder, you can also load the dataset by typing
in the Command Window and press the Enter or Return key to execute (run) the command. If you had not changed the working directory you would have to type the complete folder path in the Command Window to execute the Stata command.
Once the dataset is loaded all variables in the dataset will appear in the Variables Window.
Now, browse the complete dataset by typing browse in the Command Window. Press the Enter or Return key to execute the command.
Let’s run some descriptive statistics now. There are three Stata commands that are frequently used to describe data: summarize and tabstat to describe quantitative (continuous or discrete) variables and tabulate to describe qualitative/categorical (nominal or ordinal) data.
Look at the codesheet in Table 1 and identify the type of variables: quantitative-continuous, quantitative-discrete, qualitative-nominal, or qualitative-ordinal. Download the answer key.
Suppose we want to know mean age of study participants. In the Command Window type
and press the Enter or Return key to execute the command. The results of executing the command will appear in the Results Window.
Note: All Stata commands are lowercase.
Mean age of participants is 49.9 years with a standard deviation of 8.6 years. The minimum age is 32 years and the maximum is 70 years.
What if we want to know the median age of participants? You can get an expanded output by typing:
summarize age, detail
You can also use Stata’s graphical user interface (GUI) a.k.a point-and-click to execute commands. All statistics commands can be accessed from the Statistics menu located on top of the screen.
Descriptive statistics for quantitative variables can be computed by clicking Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics. Select age from the drop-down menu. Select Display additional statistics. Click OK.
The output will appear in the Results Window. The median (50%) age is 49 years. The interquartile range (IQR=Q3-Q1) is 57-43 = 14. The variance (SD2) is 73.9.
The mean, median, and mode are called measures of central location. Whereas, standard deviation and IQR are called measures of variability or measures of dispersion.
tabstat – Compact table of summary statistics
We can also use tabstat to summarize quantitative variables in a single table. This Stata command is especially useful for stratified analysis. Let’s say we want to know the mean, median, standard deviation, and IQR of the age of participants stratified by gender. Using the point-and-click menu:
Statistics > Summaries, tables, and tests > Other tables > Compact table of summary statistics
Qualitative variables can be described using Stata’s tabulate command. In the Framingham dataset there are eight nominal (0/1, binary) and one ordinal variable. To compute frequency distribution of CURRENT SMOKER click Statistics > Summaries, tables, and tests > Frequency tables > One-way table. Alternatively, type
in the Command Window. 49% of participants are current smokers. To know the frequency and % of missing values use the option miss, e.g.,
tab current_smoker, miss
Lastly, calculate frequency and percent of CURRENT SMOKING by GENDER by first following the steps outlined above and then clicking by/if/in to repeat the command by selecting the variable MALE. 60% of current smokers are male and 41% are female.
- Descriptive statistics are used to describe data
- To summaize quantitative (continuous or discrete) use Stata’s summarize or tabstat commands
- To describe qualtitative/categorical (nominal or ordinal) use Stata’s tabulate command