# Liz's Stata Guide

## Fixed Effects

Suppose we want to study the relationship between household size and satisfaction with schooling*. We can run a simple regression for the model
sat_school = a + b hhsize
(First, we drop observations where sat_school is missing -- this is mostly households that didn't have any children in primary school).
```. drop if sat_school >= .;
(398 observations deleted)

. reg sat_school hhsize, r;

Regression with robust standard errors                 Number of obs =     692
F(  1,   690) =    3.92
Prob > F      =  0.0482
R-squared     =  0.0081
Root MSE      =  .76476

------------------------------------------------------------------------------
|               Robust
sat_school |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
hhsize |  -.0140157    .007082    -1.98   0.048    -.0279205   -.0001109
_cons |   3.476232   .0635027    54.74   0.000      3.35155    3.600914
------------------------------------------------------------------------------
```
We see a significant relationship where larger households are less satisfied with the schooling received by their children. We might be worried that larger families are found in poorer, more rural areas where the overall quality of education is lower. To control for this we can add fixed effects for the census enumeration area or EA (this is the level on which our data is clustered -- we have 5 households in each census enumeration area). This controls for the socio-economic status of the community and (in most cases) the school the children attend. Thus we want the model:
sat_schoolit = a + b hhsizeit + ct
where t indices the EA and i indices the households within it. There are many, many ways to do this in Stata.

### Make and Add Dummy Variables

We can make a dummy variable for each EA and add them to the regression. This creates a lot of output. We see that the coefficient on household size is still negative and signficant.
```. qui tab ea_code, gen(eac_);

. reg sat_school hhsize eac_*, r;

Regression with robust standard errors                 Number of obs =     692
F(187,   484) =       .
Prob > F      =       .
R-squared     =  0.4850
Root MSE      =  .65793

------------------------------------------------------------------------------
|               Robust
sat_school |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
hhsize |  -.0194672   .0066937    -2.91   0.004    -.0326195   -.0063149
eac_1 |   .0389344   .0212165     1.84   0.067    -.0027535    .0806222
eac_2 |   .0778688   .3477253     0.22   0.823    -.6053689    .7611064
...
eac_205 |   1.111936    .051018    21.79   0.000     1.011692    1.212181
eac_206 |   .7190059   .2901567     2.48   0.014     .1488836    1.289128
eac_207 |   1.077869   .0267748    40.26   0.000      1.02526    1.130478
_cons |   3.038934   .0133874   227.00   0.000      3.01263    3.065239
------------------------------------------------------------------------------
```
If we want to test whether the fixed effects are jointly significiant, we would use
```. testparm eac_*;

( 1)  eac_1 = 0
( 2)  eac_2 = 0
...
F(187,   484) =  523.69
Prob > F =    0.0000

```
This method works perfectly fine, but it is unwiedly and involves three seperate commands.

### Use the prefix xi

The prefix xi allows you to include terms of the form i.varx in a variable list. Stata will automatically create a dummy variable for each value of varx and include them. This saves us one line of code, but again the output is bulky and we would have to do a seperate test to determine the joint significance of the dummy variables.
```.  xi: reg sat_school hhsize i.ea_code, r;
i.ea_code         _Iea_code_11020308-42080602(naturally coded;
_Iea_code_11020308 omitted)

Regression with robust standard errors                 Number of obs =     692
F(186,   484) =       .
Prob > F      =       .
R-squared     =  0.4850
Root MSE      =  .65793

------------------------------------------------------------------------------
|               Robust
sat_school |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
hhsize |  -.0194672   .0066937    -2.91   0.004    -.0326195   -.0063149
_Ie~11040201 |   .0389344   .3466722     0.11   0.911     -.642234    .7201028
_Ie~11040503 |  -.0389344   .0212165    -1.84   0.067    -.0806222    .0027535
...
_Ie~42080508 |   .6800716   .2864624     2.37   0.018     .1172081    1.242935
_Ie~42080602 |   1.038934   .0212165    48.97   0.000     .9972465    1.080622
_cons |   3.077869   .0314294    97.93   0.000     3.016114    3.139624
------------------------------------------------------------------------------
```
This is the most efficient method when you have a small number of categories and care about the estimated value of the fixed effect for each category.

### Use areg or xtreg

Stata has two built-in commands to implement fixed effects models: areg and xtreg, fe.
```. areg sat_school hhsize, a(ea_code) r;

Regression with robust standard errors                 Number of obs =     692
F(  1,   484) =    8.46
Prob > F      =  0.0038
R-squared     =  0.4850
Root MSE      =  .65793

------------------------------------------------------------------------------
|               Robust
sat_school |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
hhsize |  -.0194672   .0066937    -2.91   0.004    -.0326195   -.0063149
_cons |   3.522633   .0626253    56.25   0.000     3.399582    3.645684
-------------+----------------------------------------------------------------
ea_code |   absorbed                                     (207 categories)

. xtreg sat_school hhsize, fe i(ea_code);

Fixed-effects (within) regression               Number of obs      =       692
Group variable (i): ea_code                     Number of groups   =       207

R-sq:  within  = 0.0188                         Obs per group: min =         1
between = 0.0003                                        avg =       3.3
overall = 0.0081                                        max =         5

F(1,484)           =      9.29
corr(u_i, Xb)  = -0.0505                        Prob > F           =    0.0024

------------------------------------------------------------------------------
sat_school |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
hhsize |  -.0194672   .0063855    -3.05   0.002     -.032014   -.0069204
_cons |   3.522633   .0598293    58.88   0.000     3.405075     3.64019
-------------+----------------------------------------------------------------
sigma_u |  .56106439
sigma_e |  .65793019
rho |  .42103494   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0:     F(206, 484) =     2.18            Prob > F = 0.0000
```
Note that xtreg does not allow the , r option for robust standard errors. areg is my favorite command for fixed effects regressions although it doesn't display the joint significance of the fixed effects when you have a large number of categories.

### Demeaning

This is a technique to manipulate your data before running a simple regression. Consider our model
yit = a + b xit + ct
Where y = sat_school and x = hhsize. This means that for each EA we have the set of equations
(1)     y1t = a + b x1t + ct
(2)     y2t = a + b x2t + ct
(3)     y3t = a + b x3t + ct
(4)     y4t = a + b x4t + ct
(5)     y5t = a + b x5t + ct
By making a linear combination of equations (1) - 1/5 [(1) + (2) + (3) + (4) + (5)] we see that
y1t - yt = a + b (x1t - xt)
where yt is the average value of y within EA t and similarly for xt. We could do the same thing for y2t etc. If we "demean" our variables sat_school and hhsize -- that is, substract off the average value for the EA -- we can then run a simple regression on the demeaned variables.
```. bys ea_code: egen h_m = mean(hhsize);

. bys ea_code: egen s_m = mean(sat_school);

. gen h_dm = hhsize - h_m;
(10 missing values generated)

. gen s_dm = sat_school - s_m;

. reg s_dm h_dm, r;

Regression with robust standard errors                 Number of obs =     692
F(  1,   690) =   11.93
Prob > F      =  0.0006
R-squared     =  0.0188
Root MSE      =  .55111

------------------------------------------------------------------------------
|               Robust
s_dm |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
h_dm |  -.0194672    .005635    -3.45   0.001     -.030531   -.0084034
_cons |   -.000289   .0209499    -0.01   0.989    -.0414222    .0408442
------------------------------------------------------------------------------
```
Note that all these models give exactly the same value for b, the coefficient on hhsize. Demeaning gives different (slightly inacurate) standard errors.
Disclaimer: I have no explanation of why this relationship exists, is negative, and is so robust. It is robust to including controls for education level of respdonent, religion and ethnicity, type of school attended and household income proxies.

contact: djiboliz@gmail.com