Liz's Stata Guide

First Ten Stata Commands

(Note: this page assumes that you know a little basic statistics.)

  1. Run Stata. The first step is to open a dataset with which to work. If you are using Stata for Windows or Mac OS, the easiest way is to use File -> Open. If you are using Unix Stata though, or want to write a .do file for your analysis, you need to use the command use, followed by the location of the dataset you want to use.
    use "C:\Documents and Settings\EFoster\My Documents\stata guide\nps_example.dta"
  2. Next, use the command describe to look and see what kind of data you have.
    . describe;
    
    Contains data from C:\Documents and Settings\EFoster\nps_example.dta
      obs:         1,095                          Sierra Leone 2005 National
                                                    Public Services Survey
     vars:            30                          23 Nov 2007 14:43
     size:        59,130 (99.9% of memory free)   (_dta has notes)
    -------------------------------------------------------------------------------
                  storage  display     value
    variable name   type   format      label      variable label
    -------------------------------------------------------------------------------
    province        byte   %9.0g       provinces
                                                  province
    district        byte   %18.0g      districts
                                                  district
    localcouncil    int    %21.0g      localcouncils
                                                  Local Council Area
    ea_code         long   %12.0f                 enumeration area code
    hh_no           byte   %9.0g                  household number within EA
    stratum         byte   %8.0g       rural_urban
                                                  urban or rural
    ...
    srno            float  %9.0g                  
    -------------------------------------------------------------------------------
    Sorted by:  ea_code  hh_no
    
  3. To look at categorical variable, use the command tab which gives you a break down of the different values the variable takes on with their absolute and relative frequencies.
    . tab religion
    
       religion |      Freq.     Percent        Cum.
    ------------+-----------------------------------
      Christian |        252       23.01       23.01
         Muslim |        837       76.44       99.45
          Other |          6        0.55      100.00
    ------------+-----------------------------------
          Total |      1,095      100.00
    
  4. For a continuous variable, sum shows you various useful facts like the minimum, the maximum, the mean and the variance.
    . sum age
    
        Variable |       Obs        Mean    Std. Dev.       Min        Max
    -------------+--------------------------------------------------------
             age |      1088     41.3557     15.4491         18         90
    
    Note that we only have 1088 observations for age so there are 7 observations where it is missing. The age of respondents in our dataset ranges from 18 to 90 with a mean of 41.4.
  5. These two commands can be combined (actually we use tab with the sum option) to allow us to look at average age by religion.
    . tab religion, sum(age)
    
                |           Summary of age
       religion |        Mean   Std. Dev.       Freq.
    ------------+------------------------------------
      Christian |   39.458167   15.367018         251
         Muslim |          42   15.440129         831
          Other |        31.5    11.84483           6
    ------------+------------------------------------
          Total |   41.355699   15.449099        1088
    
  6. Now that we've had a look at our data, let's do some basic statistics. Suppose we want to do a T test on the hypothesis that the average age of male and female respondents is the same. We'll use the command ttest.
    . ttest age, by(gender)
    
    Two-sample t test with equal variances
    ------------------------------------------------------------------------------
    Variable |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
        male |     559    43.65653     .663978    15.69855    42.35233    44.96073
      female |     529    38.92439    .6439895    14.81176    37.65929    40.18948
    ---------+--------------------------------------------------------------------
    combined |    1088     41.3557    .4683696     15.4491    40.43669    42.27471
    ---------+--------------------------------------------------------------------
        diff |            4.732144    .9264646                2.914281    6.550007
    ------------------------------------------------------------------------------
        diff = mean(male) - mean(female)                              t =   5.1077
    Ho: diff = 0                                     degrees of freedom =     1086
    
        Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
     Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000
    
    This command produces a lot of output, but I've highlighted the most important parts: the average age of men is 43.7 and of women is 38.9. The p-value for our hypothesis is essentially 0, so we reject the hypothesis that male and female respondents have the same average age.
  7. Next, let's run a regression. Since most of our respondents are the heads of their households (or their spouses) we would expect that older respondents have more children and therefore bigger households. Let's regress household size on the age of the respondent to see if this is true, using the command regress.
    . reg hhsize age, r
    
    Linear regression                                      Number of obs =    1076
                                                           F(  1,  1074) =   10.31
                                                           Prob > F      =  0.0014
                                                           R-squared     =  0.0140
                                                           Root MSE      =  4.6486
    
    ------------------------------------------------------------------------------
                 |               Robust
          hhsize |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             age |   .0358748    .011174     3.21   0.001     .0139494    .0578001
           _cons |   6.013781   .4596593    13.08   0.000     5.111849    6.915714
    ------------------------------------------------------------------------------
    
    (The option , r specifies that we want robust standard errors.) This command estimates that hhsize = 6.01 + 0.036 * age. The coefficient on age is positive (older respondents have bigger households on average) as we expected and statistically significant (p-value of 0.001).
  8. To explore the relationship between age and household size, we might want to fit a quadratic model -- that is, estimate an equation of the form hhsize = a + b1 × age + b2 × age2. To do this, we need to create a variable equal to age squared and add it to the regression. To create a new variable, we use the command generate. (To create more complicated new variables you'll also need replace.)
    . gen age2 = age*age
    (7 missing values generated)
    
    . reg hhsize age age2, r
    
    Linear regression                                      Number of obs =    1076
                                                           F(  2,  1073) =    5.37
                                                           Prob > F      =  0.0048
                                                           R-squared     =  0.0143
                                                           Root MSE      =  4.6501
    
    ------------------------------------------------------------------------------
                 |               Robust
          hhsize |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             age |   .0111948   .0614485     0.18   0.855    -.1093781    .1317677
            age2 |   .0002617   .0006779     0.39   0.700    -.0010685    .0015919
           _cons |   6.524329    1.28781     5.07   0.000     3.997417    9.051241
    ------------------------------------------------------------------------------
    
    (Note that the coefficient on age squared is not significant, so the quadratic model does not fit the data better.)
  9. Let's explore our data graphically now and create a histogram of household sizes.
    . histogram hhsize
    (bin=30, start=0, width=1.1666667)
    
  10. Now let's look at a scatter plot of household size versus the respondent's age.
    scatter hhsize age
    

    (This is not a very useful graphic, and the relationship between age and household size that we saw in our regression does not show up plainly here.)

contact: djiboliz@gmail.com
last modified: 17 Sept 2008