CHAPTER 8 - REGRESSION ANALYSIS
Example of using regression analysis: Mpls Star Tribune article on test scores in 11 school districts as a function of poverty (operationalized as % of students receiving free/reduced lunches).

Regression analysis is the preferred technique for analyzing variables that can be precisely measured quantitatively in dollars, years, percent, etc.  Regression analysis conforms most closely to the scientific method, and allows us to make conclusions such as: After controlling for education, years of continuous experience, marital status, racial status, number of children, etc., we find that women make X% less than men, and our results are significant at the 1% level.  Regression analysis is the closest we can get to doing a controlled, laboratory experiment in social sciences, policy research, etc.

Example:  We want to examine the relationship between: a) the number of friends who use drugs (X), and b) the number of opportunities to use drugs (Y), over the past six months for a group of people who have completed a drug rehab program.  Our hypothesis is the number of opportunities to use drugs (dependent variable) is a function of the number of friends who use drugs (independent variable).

Assume we look at a sample of 5 people and observe the following (unit of analysis = ________________ ):

Drug user friends      Opportunities to use drugs
Person 1            2                              12
Person 2            4                              24
Person 3            6                              36
Person 4            8                              48
Person 5           10                             60

Scatter diagram of the relationship is shown on p. 213.

EVIEWS command: scat y x

Default: simple scatter diagram
Graph options Menu: "Scatter Diagram" in lower right corner, two options:
a. Connect points
b. Regression line

Or: scat(r) y x will automatically put a regression line in the diagram.

Or: Quick/Graph, then list the series, Select "Scatter Diagram" under Graph Type, Click "Show Options" and click "Regression line" to add line.

Type: "show X Y" then select "View/Graph/Scatter/Scatter with Regression/OK."

See page 213.  We have a deterministic, linear relationship between Y and X. All data points fit exactly on the line.

Y = 6X describes the line, and 6 is the slope of the regression line.

Says that drug use opportunities (Y) increase by 6 for every drug-using friend.

6 is the coefficient for the variable X.  Interpretation: for every one unit change in X, Y will change by 6 units. In this case, for every one friend (one unit change of X), drug opportunities will increase by 6 (6 units of Y).  Knowing the number of drug-using friends allows us to precisely predict the number of drug opportunities a person will have.  Quantifying the relationship allows us to predict the number of drug opportunities, knowing just the number of drug-using friends, even if the number of friends is outside of the range studied.  For example, we could predict the number of drug opportunities for someone with 1, 12 or 5 friends, even those values are either outside the range studied (1 and 12) or not one of the original observations (5).

Regression analysis, OLS, assumes linear relationships, also called linear regression.  Used most often by social science, policy research, econ., etc.

In this case, the value of regression is that we can derive a reasonably precise estimate of the relationship between two variables.

LOGIC of REGRESSION

We almost never have "perfect"/deterministic relationships between Y and X, like in the example above.  Data almost never fits right on the line.  Meaning that there is always some error in estimation from a sample.  It is impossible to predict Y using X without some error, but we want to try to achieve the closest fit possible between Y and X.  Explanatory power increases the closer we are to a deterministic relationship.  OLS finds the best-fitting line (solving for the intercept and the slope), and that will give us an idea of how closely Y and X are related.   The closer the observations are to the fitted line, the closer the relationship, and the greater the explanatory power.

Example of regression: Study the impact of juvenile unemployment (operationalized as the local teen unemployment rate) on juvenile delinquency in 100 cities (operationalized as the number of teen drug arrests over the past year per 1000 teen population).  Unit of analysis: city, and N=100 cities.  For every city there is a combination of data points - Y (del. rate) and X (un. rate).

Some of the data is plotted out on page 215. Scatter diagram.  For every un. rate (whole number), there are several different rates of juv. del (drug arrests), indicating that we cannot predict drug arrests from un rate without error.  The data points represent different cities, and show the city's un rate and # arrests.  Y-bar represents the average juv. del. arrest rate for all cities.  The connecting line is upward sloping, indicating a pos. relationship between un. rate and del. arrest rate.

If we were asked to predict the number of arrests for an individual city, and had no other information except the average for all cities, we would use the overall mean as our best guess for any individual city.  That would be low half the time, and high the other half.  Obviously, using the un rate improves our ability to predict arrest rates, over and above just using the mean value of arrests, but we can't predict perfectly, there will some amount of error, some amount of variation for each un rate.

Imagine if the un rate (X) had NO predictive power to explain arrests in a city, if there was no relationship between Y and X, we would have fig 8-3 on p. 216, a pattern of NO relationship. Un rate has no influence on del. rate.  Variation in the number of arrests is NOT explained by the variation in the teenage un rate.  The number of arrests is INDEPENDENT of the un rate, un rate has no explanatory power.

There will always be some error, except in hypothetical, deterministic examples.  The way to show error in our estimations, mathematically:

Yi = a + b Xi + ei (i = observations, in this case cities, i represents the unit of analysis; for time series data t is the subscript).

a = constant or the intercept, EVIEWS: (alpha) c, always added by default
b = slope of the line
e = errors, or residuals, EVIEWS: resid always added by default as a variable

The actual observations will fall above or below the estimated regression line, so for an individual observation Yi, there are two components.

Total Variation in Y = Explained/Predicted Variation accounted for by variation in X + Unexplained Variation (error).

The explained variation is represented by the regression line.  Or the distance from the horizontal line (which is Y-Bar, the mean of Y) to the regression line. The distance between the regression line and the actual observation is the error, or residual. If the observation is above the line the error is positive, below the line it is a negative error.  We want to maximize the explained variance and minimize the unexplained variance.  OLS does that automatically.

In this case, variation in the un rate among cities explains much of the variation in teenage drug arrests (explained variation), but not all of the variation (unexplained, or explained by factors other than teenage un rate).  The explained variation is represented by the estimated regression line on page 217, and the unexplained variation is represented by the distance away from the line.

Warning: we can find a stat. relationship, but we can't prove "causality."  There could be "spurious" relationship - co-movement between two variables without real causality, e.g. Type I Error.  OLS allows us to precisely measure the probability of Type I Error.

LINEAR REGRESSION: Y = a + b X + e

Based on linear relationship between the independent variable X, and the dependent variable Y.  b = regression coefficient or the slope coefficient.  Remember from algebra that the slope of a line is the: 1) rise/run or 2) the change in Y / change in X, or 3) dY/dX, where d = change.  See p. 218.

B = slope = (Y4 - Y3) / (X4 - X3) = dY / dX =  rise / run

Interpretation of beta: For a one unit change in X, how much does Y change?  Slope tells us the answer.  In this case, a one unit change in X is a one unit (%) change in the un. rate.  For every one unit change in the un. rate, how much does the drug arrest rate change?  For example, suppose we estimate b and find that:

b = 15, that would mean that for a one percent increase in X (teen un rate), the number of arrests/1000 goes up by 15.
b = 5, that would mean that for a one percent decrease in X (teen un rate), the number of arrests/1000 goes down by 5, etc.

Interpretation of b: The predicted amount that Y changes (in its units) with a one-unit change in X (measured in X's units).

The greater the slope (steeper the line), the greater the influence X has on Y.  Regression coefficient (b) is always more important than the intercept (a).  There is usually no economic or scientific interpretation of the constant (a).  It is just a scaling variable - up or down the Y axis to get the best fit.

The error terms e (also called the residuals or disturbances) reflect the unexplained variation in Y, the variation NOT explained by X.  We want to minimize the errors, and OLS does that.  OLS, or linear regression, will fit a line through the data that will minimize the sum of the squared errors.  OLS selects a and b (intercept and slope of line) to get the best possible fit, which is to minimize the sum of the squared errors.  Squared errors are used, to a) give equal weight to pos and neg errors and b) penalize large errors.

DERIVING VALUES FOR a AND b IN LINEAR REGRESSION

Linear regression (OLS) assumes that we can estimate a linear, straight line relationship between Y and X, and we calculate coefficients (or parameters) a and b to find the best fitting line through the data points.  To calculate slope (b): see book page 219.  Line always goes through X-Bar and Y-Bar, so once we know b, we can calculate a (a = Y-Bar - (b * X-Bar)).

Numerator of b is based on the covariation, or joint variation, between Y and X.  When X is above its mean, is Y above or below its mean?  If X is above its mean when Y is above (below) its mean, on average, the relationship will be pos. (negative).  How do variations in X compare/coincide with variations in Y?

Denominator of b is the variance of X.  b = COVxy/VARx.

Mathematically answers the question: how does the covariation between X and Y compare to the variation of X by itself, or the total variation of X.  Ratio of: COVARIANCE to VARIANCE.  Determines the slope, which expresses the mathematical relationship between X and Y.

See page 220 for sample calculations of COV, large pos COV, large neg COV and small neg COV.  In all cases X-Bar and Y-Bar = 3, what varies is the degree to which X and Y co-vary (co-varation = co-movement).   Also see graphs on page 221 of the data sets in Table 8-1.

LINEAR REGRESSION: AN EXAMPLE

Table 8-2 (p. 222) shows calculations involved in linear regression, estimating a and b for 10 cities, given their teenage un rate and teenage arrest rate.  First step, calculate the mean of X (15%) and Y (470 arrests), then calculate the COVxy and the VARx.

b = COVxy / VARx = 6900 / 258 = 26.7

a = Y-BAR - b * (X-BAR) =  470 - (26.7 * 15) = 69.5.

Regression line: YP = 69.5 + 26.7 * X

Interpretation of b (26.7) is: For a one unit change (1%) in teenage un rate, the number of teen drug arrests changes by 26.7 arrests.  If the un rate goes up (down) by 1%, the number of arrests will go up (down) by 26.7 arrests.

Applications of the estimated regression equation:
a) For a city with teen un rate of 20%, what is the predicted number of drug arrests?  YP = 69.5 + (26.7 * 20) = 603.50 teen arrests.
b) If the teenage un rate goes up by 3% in a city what will happen to the number of arrests?  3% * 26.7 = 80.10 increase in arrests.
c) If we can reduce teenage un rate by 5%, what will happen to teen drug arrests?  -5% * 26.7 = 133.50 fewer arrests.

GOODNESS OF FIT

After estimating a regression line, we would like to evaluate and measure the "goodness of fit," i.e. how closely the line fits the data.  Review the regression lines on p. 221, A and B have a perfect fit, and C has a very poor fit.   EVIEWS (and other software) typically reports several measures of "goodness of fit" as part of the standard regression output.  Goodness of fit measures:

1. Correlation coefficient (for two variables only):   -1  <  rxy  < +1, measures the association between two variables X and Y.  rxy = + 1 is perfect positive association between X and Y (see page 227, graph A),  rxy = -1 is perfect negative association between X and Y (panel B), and rxy = 0 means that there is no association between X and Y (X and Y are not related), see page 216.

rxy        =               COVxy
Std Dev X * Std Dev Y

rxy  =   Covariance between X and Y divided by the product of the standard deviations of X and Y.  As the covariance between X and Y gets closer to 0, the correlation coefficient approaches 0.  Dividing the COV by the product of the std deviations forces the correlation coefficient to fall between -1 and +1.

Correlation coefficient is not part of the standard regression output, because it ONLY applies to two variables, most regressions have more than one X variable.  EVIEWS command for correlation coefficient: cor X Y.

2. Coefficient of Determination (R2), ranges from 0 - 1, IS always part of the standard regression output, the most reported, most important measure of goodness of fit.

R2 = correlation coefficient (r) squared, since the range of r is from -1 to +1, squaring r forces R2 to fall between 0 and 1.

Intuition for R2: Remember that the Total Variation in Y (dependent variable) = Explained Variation by X (independent variable) + Unexplained Variation.

R2 equals the ratio of Explained Variation in Y / Total Variation in Y.

R2 = Percentage (%) of the total variation in Y that is explained by the regression equation, or explained by X.

See page 228.  For a given observation of Y, Yi when X = 2,
The Total Variation in Yifrom its mean is = (Yi - YBar)
The explained variation of Y from its mean, (the portion that the regression explains) = (Yp - YBar)
The unexplained portion of the variation in Y from its mean (error, or residual portion) = (Yi - Yp)

Therefore: (Yi - YBar)    =     (Yp - YBar)   +    (Yi - Yp)
Total Variation  =     Explained      +   Unexplained
100%           =     % Explained  +   % Unexplained

R2 =  % Explained Variation    =      SUM (Yp - YBar)
% Total Variation                  SUM (Yi - YBar)

In the case of teenage unemployment and drug arrests, we have R2 = .66, indicating that variation in X (un. rate) explains about 66% of the variation in Y (drug arrests). 34% of the variation in drug arrests is unexplained by teenage un. rate, explainable by other factors, omitted variables, random variation, etc.  Or could be a non-linear relationship - specification error (OLS assumes linear relationship).

Note: the correlation coefficient (rho) of Y,X is .81. The R2 is .66 or (.81)2.

See page 154 in Practice of Econometrics, A Word of Warning About the R2 Value (handout).  We shouldn't put too much emphasis on R2, t-stats are more important. However, R2, or some other measure of goodness of fit is expected in reported empirical results.

3. Adjusted R2 or R2-BAR is usually part of the regression output, e.g. EVIEWS.  Adjusted R2 calculation adjusts for the degrees of freedom.  Reason: adding a variable like X2 to the model can NEVER decrease the original R2, it can ONLY increase it.  Adding variables, even those without any explanatory power, garbage, will always increase R2.  Adjusted R2 accounts for, and adjusts for, the loss of a degree of freedom for every additional variable added.  Can be used to evaluate the addition of a variable to a regression equation, or can be used to evaluate alternative specifications/models.

4. Standard error of estimate or standard error of regression, also part of the standard regression output in Eviews.  Standard error of estimate is the standard deviation of the errors (e): Yi - Yp = ei , or the standard deviation of the unexplained variation in Y.  See equation on page 230.  Without the square root, the calculation is the error variance.  Taking the square root, we have the std. dev of the error terms or the standard error of estimate (s.e.).  The smaller the standard error the better the fit.  When all points fall on the line the standard error is 0.  The upper limit of the standard error is the standard error of Y, when X has NO predictive power at all.  In that case the regression line is horizontal, parallel to the X axis.  The predicted value of any Y is the mean of Y, so that Yp would be replaced by Y-bar in the formula for standard error, and the standard error of estimate would equal the s.d. of Y.

0  <  Standard Error  < Sigma y

The lower the s.e. the better.  Also, s.e. has an interpretation in terms of the normal curve.  If Y and X are normally distributed, 68% of the actual values of Y should fall within + 1 s.e. of the regression line, and 95% within + 2 s.e.

See page 230-231 for a demonstration of how the standard error is calculated.
Step 1.  Calculate the predicted values of Y using the regression equation: Y = 69.5 + 26.7 (X), putting in actual X values.
Step 2.  Subtract the predicted values for Y from the actual values of Y.
Step 3.  Square the values from Step 2.
Step 4.  Sum the values from Step 3, divide by N, to get the error variance.
Step 5.  Take the square root of the error variance to get the standard deviation of the errors, or s.e. of estimate.

For the regression from the example of teenage un rate vs. drug arrests, the standard error is 98.2 (page 230).  If X and Y are normally distributed, then we can predict that 68% of the cities should fall within a range of + 98.2 arrests from the regression line.  (Note: Std error in EVIEWS is 109, N-2 is used).

T-TESTS FOR SIGNIFICANCE OF INDIVIDUAL VARIABLES:

We are usually concerned with the statistical significance of individual variables in a multiple regression (more than one X variable).

Example: Yi = a + B1 X1i + B2 X2i + ei

i = 1.....n

Linear regression will now estimate three coefficients: a, B1 and B2. a = constant, and B1 and B2 are slope coefficients.

We are interested in the statistical significance of B1 and B2, the estimated coefficients.

Ho: B1 = 0
Ha: B1 n.e. 0

Ho: B2 = 0
Ha: B2 n.e. 0

t-stat = (B-hat - BHo) / S.E.(B)

Since Ho is usually Ho: B=0, the t-stat usually simplifies to B-hat / S.E.(B)

S.E.(B) is the standard error of the regression coefficient, and B-hat is the estimated beta coefficient.  Measures the distribution of all possible Betas around 0. We have a point estimate, one value for Beta out of a distribution of infinite values around 0.

t-stat = B-hat / S.E. measures the number of standard errors/std. deviations away from zero our point estimate is.

Our B-hat is 26.744, our standard error is 6.83, so our t-stat is 3.91 (26.744 / 6.83), and the probability = .0045.

The probability is the exact level of stat sig, the probability of making Type I Error.  A t-stat of 3.91 indicates that the estimated B falls almost 4 s.d. away from 0, so we are very confident that B is not 0, and the teen un rate affects teen drug arrests.  We would say that the un. rate has a pos. and stat sig affect on teen drug arrest at the 1% level of sig.

We are actually comparing the t-stat of 3.91 to the critical value at the 1% level (two-tailed test) of 3.355 (page 377). D.F. = N - k = 10 - 2 = 8.  EVIEWS does this automatically.

The prob. of .0045 is the probability of Type I error.  There is only a 4.5/1000 chance that we could get a point estimate of B 3.91 s.e. away from 0, when the true value of B=0.  Or there is only a 4.5/1000 chance that we would get our results of a sig relationship between un rate and teen drug arrest due to RANDOMNESS or CHANCE.

There is only a 4.5/1000 chance of making Type I Error - falsely rejecting the Ho in favor of the Ha when the Ho is actually true.  A 4.5/1000 chance of finding a sig relationship between teen un rate and drug arrests when there really is none.

Pretty remote chance that we ever find a point estimate of Beta almost 4 s.d away from zero due to chance.  There is less than a  1% chance of that ever happening.