Descriptive Statistics

 

Data are the raw material of statistics. Descriptive statistics are used to describe data from a population or a sample. We measure characteristics of study subjects using variables. Age, gender, race, income, systolic blood pressure, serum cholesterol, blood group are examples of variables.

To demonstrate descriptive statistics, I will use a subset of the data (n=3,000) collected in the Framingham Heart Study.  The Framingham Heart Study is a longitudinal study to assess risk factors for cardiovascular disease.  Details of the study, its design, data, and other information can be found at www.framingham.com/heart.

The Stata dataset is labeled framingham.dta. All Stata datasets have .dta extension. Download the dataset and save it to the data folder.

The following table (codesheet) shows variable names, as they appear in the Stata dataset, along with brief descriptions and coding details for each variable.

Table 1. Codesheet

Variable Name Description Coding
ID Random, unique number for each participant 1-3000
AGE Age at exam, in years 32-70
MALE Male sex 1=male, 0=female
TOTAL CHOL Total Cholesterol, mg/dL 113-696
SBP Systolic blood pressure, mmHg 83.5-295
DBP Diastolic blood pressure, mmHg 48-141
BP MEDS Anti-hypertensive medications 0=no, 1=yes
BMI Body mass index, kg/meters2 15.54-51.28
CURRENT SMOKER Currently smoking cigarettes 0=no, 1=yes
CIGS PER DAY Number of cigarettes smoked per day 0-70
GLUCOSE Serum glucose mg/dL 40-394
DIABETES Diabetic 0=no, 1=yes
HEART RATE Heart rate, beats/minute 45-143
DEATH Death from any cause over 24-year follow-up 0=no, 1=yes
STROKE Stroke over 24-year follow-up 0=no, 1=yes
CVD Cardiovascular disease over 24-year follow-up 0=no, 1=yes
HYPERTENSION Hypertension over 24-year follow-up 0=no, 1=yes
BP4

 

Blood Pressure- 4 categories

 

0= Normal
1=Pre-hypertension 2=Stage 1 hypertension 3=Stage 2 hypertension

 

Now, you are going to open the Framingham dataset. First, start the new Stata session. Next, change the working directory (to link the Stata session to a folder) and create a log file. Name the log file framingham.log. To open the Framingham dataset (framingham.dta) from the menu bar select File > Open. Browse to the data folder, click on framingham.dta, and click Open. The dataset is now loaded in Stata. Since you have already linked the Stata session to data folder, you can also load the dataset by typing

use framingham.dta

in the Command Window and press the Enter or Return key to execute (run) the command. If you had not changed the working directory you would have to type the complete folder path in the Command Window to execute the Stata command.

Use

Once the dataset is loaded all variables in the dataset will appear in the Variables Window.

Variable Window

Now, browse the complete dataset by typing browse in the Command Window.  Press the Enter or Return key to execute the command.

Browse


Let’s run some descriptive statistics now. There are three Stata commands that are frequently used to describe data: summarize and tabstat to describe quantitative (continuous or discrete) variables and tabulate to describe qualitative/categorical (nominal or ordinal) data.

Look at the codesheet in Table 1 and identify the type of variables: quantitative-continuous, quantitative-discrete, qualitative-nominal, or qualitative-ordinal. Download the answer key.

Suppose we want to know mean age of study participants. In the Command Window type

summarize age

and press the Enter or Return key to execute the command. The results of executing the command will appear in the Results Window.

summarize age

 

 

 

 

Note: All Stata commands are lowercase.

Mean age of participants is 49.9 years with a standard deviation of 8.6 years. The minimum age is 32 years and the maximum is 70 years.

What if we want to know the median age of participants? You can get an expanded output by typing:

summarize age, detail

You can also use Stata’s graphical user interface (GUI) a.k.a point-and-click to execute commands. All statistics commands can be accessed from the Statistics menu located on top of the screen.

Descriptive statistics for quantitative variables can be computed by clicking Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics. Select age from the drop-down menu. Select Display additional statistics. Click OK.

 

Summarize

The output will appear in the Results Window. The median (50%) age is 49 years. The interquartile range (IQR=Q3-Q1) is 57-43 = 14. The variance (SD2) is 73.9.

The mean, median, and mode are called measures of central location. Whereas, standard deviation and IQR are called measures of variability or measures of dispersion.


tabstat – Compact table of summary statistics

We can also use tabstat to summarize quantitative variables in a single table. This Stata command is especially useful for stratified analysis. Let’s say we want to know the mean, median, standard deviation, and IQR of the age of participants stratified by gender. Using the point-and-click menu:

Statistics > Summaries, tables, and tests > Other tables > Compact table of summary statistics

 


 

Qualitative variables can be described using Stata’s tabulate command. In the Framingham dataset there are eight nominal (0/1, binary) and one ordinal variable. To compute frequency distribution of CURRENT SMOKER click Statistics > Summaries, tables, and tests > Frequency tables > One-way table. Alternatively, type 

tabulate current_smoker

or

tab current_smoker

in the Command Window. 49% of participants are current smokers. To know the frequency and % of missing values use the option miss, e.g.,

tab current_smoker, miss

Tabulate Tabulate

 

Lastly, calculate frequency and percent of CURRENT SMOKING by GENDER by first following the steps outlined above and then clicking by/if/in to repeat the command by selecting the variable MALE. 60% of current smokers are male and 41% are female.

 

Now, download and complete Mock Table 2 and Mock Table 3.

 

SUMMARY

  1. Descriptive statistics are used to describe data
  2. To summaize quantitative (continuous or discrete) use Stata’s summarize or tabstat commands
  3. To describe qualtitative/categorical (nominal or ordinal) use Stata’s tabulate command

 

Save