Chapter 4 - USING THE COMPUTER IN POLICY RESEARCH

PLANNING FOR COMPUTER ANALYSIS:

Statistical packages like EVIEWS are designed to be efficient and powerful when it comes to actual statistical analysis and estimation.  However, they are often not as efficient as spreadsheets for the initial data collection, data entry, data management, etc.  Data can easily be entered in EVIEWS, but for large data sets it may be easier to manipulate and manage the data initially in Excel or Lotus, or some other spreadsheet.  Also some data that you purchase or retrieve off the Internet may come in a spreadsheet format, so it is useful to know how to use a spreadsheet in your research.  Example: Jeff Wilson's data on police stops/tickets.

EVIEWS is perfectly compatible with Excel, you can import/export data from/to EXCEL, as the computer demo will show.

Example of data collection/management, see page 90.  You are going to conduct research on the income of 5000 state employees, to test for the possibility of sex or racial discrimination.  Income = f (Sex, race, job classification, years experience (total), years experience with the state, years of education, age).  Issue: After controlling for education, experience, age, job classification, etc., to what degree does sex or racial discrimination exist?   Unit of analysis: INDIVIDUAL, employees.

Entering the data (p. 93): For such a huge data set (5000 employees and 8+ variables for each one), you should use a spreadsheet, e.g. Excel, and not EVIEWS.  Each row will represent one employee, so you will have 5000 rows.  Each column will represent one variable: Income, sex, race, education, age, experience, etc., so you end up with a huge matrix with 5000 rows and 8 or more columns.

You decide on 8 or more variables for each employee: Gender, Ethnic status, Years of state employment, Years in current position, Years of Education, Salary, Job Classification, and Age.  Also, each employee should have a unique identification number for several reasons: a) confidentiality, b) to identify individuals in case you need to update or edit data later on individuals who retire, change jobs, and c) to transfer data to EVIEWS or another stat package (numbers only).

Columns of Data:

1.  Identification number

2.  Income: Annual salary in dollars for each employee, or if hourly: Wage x 2080 hours.

3.  Gender: Use dummy variable, e.g. G = 0 if the employee is female and G = 1 if employee is male.  The variable G can be used to classify two categories: M and F, and will be a column of 0s and 1s.   Remember that is nominal-level data, so the numbers don't mean anything, not ordinal or ranked.

4.  Ethnic status:  Use dummy variables, e.g. D1 = 1 if black and 0 otherwise, D2 = 1 if white and 0 otherwise, D3 = 1 if Hispanic and 0 otherwise and D4 = 1 if Asian and 0 otherwise.

5.  Education: Number of years in school, e.g. 8 - 24 years, or you could classify/group like the book (p. 92).

6.  Years of state service

7.  Years in current position

8.  Years total work experience (not in book)

9.  Age (years)

10.  Job classifications, series of dummy variables, J1 = 1 if administration and 0 otherwise, J2 = 1 if profession and 0 otherwise, etc.

You can create a code book like on 92-93 to keep track of your variables, and the various classifications.  Point: You usually want to end up with ALL of your data converted to NUMBERS.

Note: It is very easy to edit data in EVIEWS or enter small data sets, but large data is more manageable in Excel.

See criteria on page 95 for statistical packages: EVIEWS has excellent graphics, is compatible with Excel/Lotus/Text files, is both easy to use and powerful/flexible, etc.
 
 
 





Chapter 5 - SIMPLE DESCRIPTIVE STATISTICS


This chapter - single variable analysis. Next we look at multivariate statistics. In many cases/applications, it is very useful/appropriate to provide simple descriptive stats, or summary statistics, of a SINGLE variable.

How many full time students at UM-F?
How many grad students at UM-F?
How many students grad in 4 years?
What is average class load?

Every job/company/agency requires some sort of descriptive stats, sometimes not frequently (once a year), other times frequently - daily or hourly.  Examples: How many library books were checked out in a year, how many people were treated at regional mental health centers, how many people were charged with/convicted of felonies, etc.

How to show or display data for a presentation, study, article, to make it easy to understand?
 

FREQUENCY DISTRIBUTIONS - see page 107.

Frequency distribution table of response time of fire dept. in Oregon.

Key points on Table 5-1:

1. Descriptive, clearly understandable, specific title/label, with a specific time period.

2. Clear labels of each column.

3. Appropriate categories. For these data, categories of < 5 min, 5-10 min, 10-15 min, and > 15, seem appropriate. There is some data in each category.  Not too concentrated.  Some optimal number of categories - not too many and not too few. May depend on audience - public vs. consultant for fire dept. to improve response time.

4. Present both frequencies and corresponding percentages, for better understanding.  In the example, we can easily see that 80% of calls have response time of less than ten minutes.  Also allows comparison to previous years' data.

5. Summary statistics - may be appropriate. Mean (average), median, standard deviation. Measures of: a) central tendency (mean, median), and b) dispersion (standard deviation).

6. Identify source of data. Important.

Note: The burden falls on the researcher to make the table/graph/figure easily understandable, complete, clear, etc.

Putting together a frequency distribution - see book, page 108-110. Start with raw data, collapse into appropriate distribution frequencies.
 

GRAPHS AND CHARTS

Allows us to see/visualize/picture the data, more intuitive than a table of data, especially for a general audience.  Book - example - former student works as a police planner for a boss who will ONLY look at charts and graphs, and no tables.

EVIEWS - good for bar graphs, histograms, line graphs, pie charts, scatter diagrams, etc.

1. Bar graphs, see examples on page 112 - 113. Comparison of M/F professor salaries.  Same data, presented two ways in Figures 5-1 and 5-2.  Easy to understand.  Cross-sectional data, collected at one point in time, 97-98 school year.  Note: data source is indicated.

Limitation: the graph doesn't reveal if the difference in salaries is statistically significant.

2. Line graphs. Typically used for time series data, to show how a variable moves through time, e.g. salaries, home prices, population, sales/revenue, GDP, unemployment, test scores on SAT, building permits, stock prices, etc.  Usually time is on the horizontal axis, so time is the "independent" variable, see page 115 which plots three variables over time.

EVIEWS easily allows "dual scale" when the variables are in different units of measurement, e.g. unemployment rates (%) and school enrollments (# people), or are extremely different (average hourly wage ($) vs. retail sales($m)).

Note: Line graphs should always be clearly labeled: the X and Y axis with appropriate units and the individual lines.

3. Pie charts - visually shows a breakdown by percentage, for example. Useful for budgeting applications, e.g. sources of revenue or expenditures.  See page 118.
 
 

MEASURES OF CENTRAL TENDENCY

Measures of central tendency convey information about "typical" or "average" values. "Average" here has no statistical meaning.

Three measures of central tendency: 1) mean, 2) median and 3) mode. NOTE: for normal distribution (bell shaped curve), all measures of central tendency will be exactly the same value/number.

We will also look at the "weighted average" or "expected average."

1) MEAN is used most often to measure central tendency, or average.

         X-BAR  =  SUM    Xi  /  N

where X-BAR is the mean of the Xs, N is the number of X values, the Xis are the individual Xs, and Sigma (SUM) is the summation sign.

Example: 4, 5, 6, 7 and 8.  X-BAR =  (4 + 5 + 6 + 7 + 8) / 5 = 6 (mean of Xs)

Example: Book, page 120. # Days in hospital for 24 patients. Range 1 - 150 days (MIN = 1 and MAX = 150). The mean (typical) hospital stay is 15.8 days per patient, which would possibly indicate that: a) most patients stayed in for about 16 days, or b) about 50% of patients stayed for more than 16 days and 50% stayed for fewer than 16 days.  However, this example shows the limitation of using the mean.  Nobody stayed exactly 16 days, and ALL patients except 3 stayed fewer than 16 days. Mean in this case is not a good measure of central tendency.  Why????

2. MEDIAN - this is a better measure of central tendency in the Hospital case.

Median - like median on highway - 50% above and 50% below when data is ranked ordinally, from lowest to highest or vice versa, in ascending or descending order.  For the data (7, 5, 3, 6, 9), we order/rank the data (3, 5, 6, 7, 9) or (9, 7, 6, 5, 3) and find the middle number: 6, which is the MEDIAN.  Half of the numbers (2) are below the median and half (2) are above.  When N is odd, we find the middle value.  When N is even, we find the midpoint BETWEEN the TWO middle values.

For example, N=8: (2, 5, 8, 11, 13, 20, 25, 30), so we find the two middle values (11 and 13), and take the midpoint or mean (12), which is the MEDIAN.  In this case the median (12) is not actually one of the actual values in the data set.

In the Hospital case, there are 24 observations. When ranked from lowest to highest (page 122), Observations #12 (4) and #13 (5) are the middle, or median observations. The median stay would be 4.5 days, meaning that exactly 50% of patients stayed fewer than 4.5 days and 50% stayed longer than 4.5 days.

Note that the median (4.5 days) is much different than the mean (15.8 days).  Why the difference?

Reasoning: Mean is much more affected by extreme values (low or high), OUTLIERS.  In this case the two patients who stayed 100 and 150 days had a large influence on the calculation of the mean.  Note: If the patient who stayed 150 days increased their stay to 200 or 300 days, the mean would go UP, but the median would stay exactly the same, it would not be affected by the OUTLIER.  What would be another approach in this case to adjust the MEAN to make it a better measure of central tendency?

In which types of cases would the MEDIAN be a better measure/indicator of central tendency than MEAN?
 

Examples:  Mean or Median?
GPA -

Life expectancy -

Housing prices -

Income -
 

3. MODE - value occurring most frequently. Example: mode is 3 days, which was the most frequent hospital stay. More people (6) stayed for exactly 3 days than any other single value.

Summary: The mean stay is 15.8 days, the median stay is 4.5 days, and the mode is 3 days.  In practice, mean and median are used most often, mode is not.  EVIEWS does not report mode, only mean and median.  EVIEWS command for the mean and median of X, type: stats X (where X is the variable name).
 

4. WEIGHTED AVERAGE

X = SUM Wi Xi
i = 1.......N

SUM Wi = 1.

Examples:

a. Three 100 point tests with different weights. Test 1 - 25%. Test 2 - 35%. Test 3 - 40%. Scores: 80, 85 and 90.

Wtd. Average: .25(80) + .35(85%) + .4(90) = 85.75  (Note: .25 + .35 + .40 = 1)

b. Portfolio = 60% Bonds with 8% return, and 40% stocks with 26% return.

Wtd. average = .6(8) + .4(26) = 15.2%

c. Expected value (return) of a risky investment.

20% chance of -20% loss, 50% chance of a 25% return and a 30% chance of a 50% return.
Exp value (return) = .20 (-20%) + .50 (25%) + .30 (50%) = 23.5%
 
 

MEASURES OF DISPERSION

Two distributions could have the same mean, but very different degrees of dispersion - measure of the variability of the values. What is the probability of an individual value being very far above or below the mean? The greater the probability, the greater the dispersion. Tight distribution (low dispersion) vs. wide distribution (high dispersion).

RANGE - Simple measure of dispersion, shows the extreme values, MIN and MAX, or the difference:

MAX VALUE - MIN VALUE = RANGE

Example: page 124.  EZ Care: range is 10 (min) - 50(max) patients/doctor/day or 40,  Welrun: range is 28 - 32, or 4 patients/doctor.  Mean is 30 patients/doctor/day for both Clinics.

AVERAGE DEVIATION - not used very often, don't worry about it.

STANDARD DEVIATION - most commonly used measure of dispersion. Formula:

SIGMA =
 

Steps for computing std deviation:

1. Calculate the mean: X-Bar.
2. Subtract the mean from each observation/value (xi), to calculate the "deviations from the mean)
3. Square each value from step 2 to calculate the "squared deviations from the mean."
4. Sum the total of squared deviations from the mean over n observations.   
5. Divide the sum by N (# observations) = VARIANCE (Sigma2 )
6. Take square root of variance to get STD DEV (Sigma).

Example in book, page 127. EZ Care and Welrun.

EZ Care - Variance = 200, STD DEV = 14.14

Welrun - Variance = 2, STD DEV = 1.41.

Mean is 30 for both.  Welrun has a tight distribution, EZ has a wide distribution. The larger the std deviation, in relation to the mean, the greater the dispersion of values around the mean, the greater the possibility of a value between far away from the mean (very high or very low).

Advantage of Std Deviation over Variance: Unit of measurement for Std Dev is the same as the mean. Example: patients per doctor is the unit of measurement for mean and std dev. Units for variance are Patients Squared per doctor.  EVIEWS only reports Std Dev, not Variance.
 

NORMAL DISTRIBUTION

Many distributions of large samples will follow a normal distribution, or a normal curve, bell-shaped distribution. See page 128.  Examples: results from standardized academic tests would typically follow a normal curve. The mode, median and mean are all 50, and a majority of values are clustered near the mean.

If a distribution is normal, we know that 68% of all observations fall within + 1 s.d. of the mean (from Mean - 1 Std Dev to Mean + 1 Std Dev), 95.46% of all observations fall within + 2 s.d. and 99.7% fall within + 3 s.d. from mean.

Example: If the mean test score is 50 points and the std. deviation is 10 (page 128), then 68% of test scores will fall between 40 - 60, about 96% will fall between 30 - 70, and 99% between 20 - 80.

Example: If N=1000, then 683 students would score between 40 - 60, 960 would score between 30 - 70 and 997 would score between 20 - 80.  Three students would score either below 20 or above 80.

Interpretation: Back to previous example, if you have a normal distribution of doctors, and you find that the average patient load is 30 with a s.d. of 5, that would mean that 68% of all doctors see between 25-35 patients/day, 95% see between 20-40 and 99% see 15-45 patients per day.
 

STANDARD Z SCORES

We may want to compare scores/values in two different distributions that have different means and s.d.. Example: GRE/GMAT/LSAT for grad school, SAT/ACT for undergrad school, or comparing standardized test scores for elementary or secondary schools in different states with tests scaled differently.

Using Z-scores allows us to do that.

Example: Test A:  mean = 100,     s.d. = 10
               Test B:  mean = 750,     s.d. = 100

How does a score of 75 on Test A compare with a score of 600 on test B?

We first convert the test scores to standard scores, or Z scores, then compare.

                Z = (Xi - X-BAR) / s.d.

(Z-score = Number of std dev above/below the mean.)

Note: Z-scores will be distributed with a mean of 0 and a std deviation of 1.
Example: If mean = 10 and std dev = 5, then Z-score for mean = (10 - 10) / 5 = 0
The Z for the "mean + 1 std dev" = (15 - 10) / 5 = 1

Test A:  Z-score = (75-100) / 10 = -2.5
Test B:  Z-score = (600-750) / 100 = -1.5

Score A (75 points) was 2.5 std deviations below the mean, Score B (600 points) was 1.5 std deviations below the mean.  Result: Score A was worse.  Converting to Z-scores allows comparison, because the Z distribution is normal with mean of 0, std dev of 1.
 

CONVERTING TO A DIFFERENT DISTRIBUTION

We can also compare scores from tests with different scales by converting a score from Test A to a score on the scale for Test B (e.g. convert SAT to ACT score).

Conversion formula = (Z-score from Test A * Std dev of Test B) + mean of Test B = converted score from A to B.
Conversion formula = (Z-score from Test B * Std dev of Test A) + mean of Test A = converted score from B to A.

Converting score on Test A (75 points) to a score on Test B: -2.5 (100) + 750 = 500 points.  500 points on Test B would be equivalent to a score of 75 points on Test A, they are both 2.5 std deviations below the mean.

Convert B (600 points) to score on Test A: -1.5 (10) + 100 = 85.  600 points on Test B is equivalent to 85 points on Test A, they are both 1.5 std dev below the mean.  
 

PROPORTIONS OF A DISTRIBUTION BETWEEN TWO VALUES

We can also use a Z value to find the proportion of observations that fall in a given range between the mean and some value above or below the mean.

Example: Test C, mean = 450 points and s.d. = 120

What percentage of all scores fall between 350 and 450?

Z-score  = (350 - 450) / 120 = -.83

Find .83 (ignore sign) in the normal distribution table on page 351. The value is .2967. That means that almost 30% of the scores will fall in the range of 350-450.

What percentage of scores falls between 450 and 550?  Z = 550 - 450 / 120 = .83
The value is also .2967, which means that almost 30% of scores fall within 450 - 550.

If we look at the range from 350 - 550 points, about 60% of scores fall within that range.  Intuition: + .83 std deviations represents about 60% of total observations, compared to about 68% for + 1 std deviation, for a NORMAL DISTRIBUTION.

Note: Z-scores/Z-table assumes NORMALITY.
 

OTHER KINDS OF DISTRIBUTIONS

1. Bimodal distributions. Two distinct, separated, clusters or groups of observations - two distributions.

Example: test scores. Realistic. Students either get it or don't. No one is in the middle, most of the class gets either A or a D/F.

Smoking distribution. Spiked distribution at 0 cigs/day, and then a normal distribution around 1 pack/day.

Graphing out the data, doing a histogram, is helpful to detect this.

2. Uniform distribution - uniformly distributed values over the entire range. No clustering in middle.

3. Skewed distribution - skewed right or skewed left.  See page 133.  Possible distribution of test scores.  Clustered on the right side.  Outliers on the low side.  In this case, the distribution is skewed LEFT.  The mean is to the right of the median.

In the other case, the distribution is skewed RIGHT.  The mean is on the RIGHT side of the median. The outliers are on the high side.  Examples: housing prices and income.

Not all distributions are normal, but we typically assume normality to invoke the greatest amount of statistical tools and inferences.  Higher level, advanced and theoretical statistics and econometrics pay more attention to skewness and kurtosis.  EVIEWS does tests for skewness, kurtosis and normality (Jarque-Bera) as part of the standard output using the "stats" command (type command: stats X, where X is the variable name).