Chapter 7 - HYPOTHESIS TESTING AND STATISTICAL SIGNIFICANCE

Illustrations of the topics in this chapter:

1. Study of 100 new state employees. Females w/MPA and no experience are paid an average of \$21,400 and men w/MPA, no experience, are paid \$23,000.  Does this difference of \$1600/year indicate discrimination?  How likely is it that this salary difference would exist due to totally to chance, random variation?

2. Suppose you are a lawyer for a black defendant in a jury trial.  The community is 50% black, but the jury is 100% white.  Does this indicate discrimination in jury selection process?  How likely is it that we would find this jury composition due totally to chance, random variation?  Unlikely, but exactly how unlikely is it?

3. You are evaluating a counseling program for juvenile offenders.  You find that 25% of those in program commit a crime within a year compared to 40% rate for those NOT in the program.  Does this difference indicate success of program?  How likely would that difference happen due to chance?

In these cases, we are analyzing a limited sample - 100 employees, 1 jury (12 members), X number of individuals in the program - to assess the population - ALL state employees, ALL juries, ALL juveniles.  Statistical inference is the process that allows us to assess the stat. significance of the findings of our sample, and assess the degree to which we can generalize our results.  Or how likely is it, given our findings for a small sample, that we would find those patterns entirely due to chance and NOT due to some hypothesized variable or relationship?

How likely is it that a \$1600 salary difference would exist due to CHANCE and NOT to discrimination?  How likely is an all white jury due to chance and NOT due to discrimination?  How likely is a 15% difference in crime rates due to chance and NOT due to the program?

To answer these questions, we need tests of statistical significance.  As the text mentions, this is one of the most difficult concepts in the book, but also probably the MOST important concept of the course/text.

THE CONCEPT OF STATISTICAL SIGNIFICANCE

Due to constraints of time, money and resources, we usually only look at a small sample (or set) of all available units of interest.  Unit of analysis could be individuals, counties, cities, states - it would usually be too expensive to sample the entire population - so for practical purposes we look at a sample.  Ex pede herculeum.

Possible Exception: Census study every 10 years is supposed to investigate the entire POPULATION, and not just a sample.  Sampling for the census has been an issue, and was proposed for use in the 2000 census, became a controversial political issue.

Examples of sampling:
blood sample
TV ratings
Polls for elections
Medical research
FDA drug approval
Failure rate testing of manufacturing process. Batch testing of sample.
Water tests - river, lake, ocean, drinking water, etc.
Air tests

Statistical inference: making inferences about the population from a sample (or inferences from animals to humans, etc.).

Important issue: How large should the sample be to ensure representation of the population?  Ex pede herculeum.

In our three examples, we are interested in:

1. Does the salary pattern of the sample of 100 new employees accurately reflect the salary pattern for all new managers working for the state?

2. What is the exact probability of an all white jury from a 50% black community?  How likely is it that manipulation/bias was involved to get that result?

Point: we can't actually "prove" with statistical analysis that manipulation exists with 100% certainty, we can only measure how likely it is.

Example: "At the 1% level of statistical significance...X has a positive and statistically significant effect on Y."

3. Was the 15% difference in recidivism due to the small number of people who had gone through the program?

Are the properties of the sample accurate reflections of the population?  100% certainty is not possible, we can only measures levels of stat. sig.  We CAN estimate precisely the likelihood (probability) that differences found between two groups is due to chance.  For example, we might be able to say, after performing statistical analysis: There is only a 1% chance that a \$1600 earnings differential between men and women is due to chance, and NOT to discrimination (assuming ceteris paribus).  In other words, a study may show that there we are 99% confident that discrimination does exist.

Generally accepted practice is to report one of three levels of significance:
1% level of statistical significance = 99% confidence level (HIGHEST LEVEL OF STAT SIG)
5% level of statistical significance = 95% level of confidence and
10% level of statistical significance = 90% confidence level.

SAMPLING AND SAMPLING PROCEDURES

How to get a valid, random smpl.?

Probability sample is a way to guarantee a random sample.

Simple Random Sample technique - drawing names out of a hat.  Example: 20,000 students in a university.  We want to interview a sample of 200.  Every student has a 200/20,000 chance or 1/100 chance, or 1% chance, if we do a simple random sample.

One way to guarantee a random sample is to use a random number table. See page 349, and the example on page 170-171.

In EVIEWS, you can generate random numbers: Example:

set sample size = 1000
genr x = rnd

This will create 1000 random numbers between 0-1 - uniform distribution, not normal distribution.

Systematic sample - pick one name at random from 1-100. Count down 100 names and pick that name, keep counting down every 100 names to get 200.

Example: Traffic checkpoints, stop every third car.

Stratified Samples - Divide population into strata, or classes, and draw random samples from the groups.

Example: study on a campus of 20,000 students to assess differences in the experience of African American and whites students (two classes or strata).  You would like to interview 200 black students and 200 white students.  However, there are 1000 black students (5% of student body) and 19,000 white students (95% of student body).

If you selected 200 African American students out of 1000 = that would be 20% of black students.  If you used the same sampling fraction (1 out of 5), you would select 20% of 19,000 white students or 3800 students, so you would have to interview a total of 4000 students (sample size = 4000).  That may too costly, too timely.

Possible Solution: Go ahead with 200 AA and 200 W, make an adjustment later using the weighted mean approach to weight the two classes (strata) differently, according to the proportion they represent of the total.

Example: AA are 5% (.05) of student body (1000 / 20,000) and W are 95% (.95).  Assume that you find that the mean study time for AA is 35 hours/wk and the mean study time for W is 25 hours.  If you failed to adjust your scores, you would find an average (unweighted, or equally weighted) of  (25 + 35) / 2 = 30 hours/wk.

To adjust for the unequal weighting, you would use weighted average (mean): (pages 172-173)

WTD MEAN = .05 (35 hours for AA) + .95 (25 hours for W) = 25.5 hours / week (vs. 30 hours unweighted).

Cluster Sample - Assume there are 100,000 households in a city and you want to sample 1000 residents for a study on satisfaction with city parks or some other citywide program.  You want to do face-to-face interviews of 1% or 1/100 households.  Local power companies and the Census Bureau are a source of data for the number of households.  You could interview every 100th household, but it would be time consuming and expensive.  Alternative strategy: identify every city block, pick 1% of the blocks at random and interview EVERY household on the block.  Saves time of going around the entire city to 1000 randomly located houses.  Each household still has 1/100 chance of being selected.

Or you could apply the clustering approach to voter registration lists, city directories, etc., especially for a mail questionnaire.  Pick every 1000th name and then pick the cluster of 10 names before the name.  Cautions: Voting records should be used ONLY if your population is all registered voters.  Also, if voting records are organized by voting precinct, you would have to be careful that using clustering didn't result in certain precincts being over represented (homogeneous, nonrandom sample).

SRC (Survey Research Center) of UM-AA conducts national election studies every two years using cluster sampling for in-person interviews. See page 175-177.

Random-digit dialing a method for telephone surveys (page 177).  Telephone surveys have largely replaced in-person surveys - much cheaper to call than to visit.

Determine the sampling area (city, state, county, country, etc.), get a list of all telephone prefixes (first 3 digits in a 7 digit phone number) in that area, and then randomly call numbers from that list.  For example, if one prefix was 762, there would be 10,000 numbers, 762-0000 to 762-9999, and you would sample from within those numbers using a random number table or a computer.  You would sample more numbers than the desired sample size, since only about 22% of phone numbers yield a household contact - unassigned numbers, business numbers, people not home in five tries, people refuse to participate., etc. -

Solution/strategy (page 177): start with a primary, random phone number like 864-5347, and if that number doesn't result in a valid contact, call another secondary number between 864-5300 and 864-5399, and keep calling within that range until you get a certain number of pre-determined contacts are established, e.g. six secondary numbers.  Then start with a new primary number and repeat the process.

New issue: Cell phones.  Why??

Nonprobability Samples - Samples NOT based on random selection, and we don't know how nonrandom the sample actually is.  Limited scientific validity, but may be convenient or cheap, e.g. sampling users of a public park or facility over a week's period or interviewing the first 100 clients to appear at a health clinic or a social agency (clients coming early may not represent the population - more often unemployed?  more children? etc.)  Exit polls?

Example of the limitations of NONRANDOMLY selected samples (p. 180): Literary Digest presidential poll of 1936.  They sampled/polled 10m voters, almost 25% of population (all voters).  Poll results showed that Alfred Landon, Rep, would win with 60% of votes, when FDR actually got 62% landslide.  Sample was taken from telephone directories and auto registration, and was NOT a truly representative sample of voters.  Only 40% of Americans had phones and 55% owned cars, so only the upper income voters were being polled, more likely to vote Republican.

POINT: With a random sample, only about 1000-2000 households were needed in the previous example to get accurate results.  In this case, sample size was 10m for a population of 40m and the results were flawed.

Examples: Internet polls and TV viewer polls (900 numbers).  Non random samples - WHY?

SAMPLE SIZE - Two issues:
1. randomly selected sample (is sample random?)
2. sample size (is the random sample large enough to represent the population?)

How big should the sample be to guarantee that it is representative of the population?  If we use a PROBABILITY SAMPLE (random) then we can determine the exact margin of error, using formulas and tables with precise measures of expected margin of error, or sampling error.

Sampling error - the expected difference between the results obtained from the sample, and the results from using the entire population.

Exact sample size depends on the error that can be tolerated in making inferences about the pop from the sample.  The error that can be tolerated determines the appropriate sample size.

Example: very close race between two candidates. Commercial pollsters would want a very small margin of error, compared to an expected landslide election, where a larger sampling error could be justified.

Example: policy analyst wants to get an idea of how many people in a rural community would use a mobile library.  An approximate number is acceptable, precision and exactness are not critical.  Margin of error could be quite large.  For example, suppose that a survey showed that 60% of people would use a mobile library at least once a year, with a margin of error of 10%.  That margin of error might be acceptable in this case.

Example p. 181: Assume we are interested in the income levels of parents of children in a free breakfast program.  N=10 children.  Parents' income is: \$3000, 4000, 5000, 6000, 7000, 8000, 9000, 10,000, 11,000 and 12,000.

Mean income = \$7500, denoted as µ (pop mean).  Suppose that we try to estimate µ (population mean) from a sample mean of 2 incomes.  The range of sample means would be \$3500 (mean of two lowest - \$3000 and 4000) to \$11,500 (mean of two highest - \$11,000 and 12,000).

The total number of possible combinations of samples of two from ten values is: N (N-1) / 2 =  (10 x 9) / 2 = 45).  The distribution of all 45 possible outcomes of pairwise combinations is graphed on page 184 - Sampling Distribution of Mean.  Mean of the sampling distribution is equal to the pop mean of \$7500.

With a large sample, the sampling distribution of the mean approaches a normal distribution (bell-shaped curve).  And we know some characteristics of a normal dist.  We know that about 68% of the observations fall within one std. dev. of the mean and about 96% fall with two std. dev, etc.  By convention, we normally like to use the standard values of 90%, 95% and 99% - as "confidence intervals."

90% of observations fall within 1.64 std deviations of the mean.  (see Z table on p. 351, find .4500)
95% of observations fall within 1.96 std deviations. "                  (see Z table, find .4750)
99% of observations fall within 2.58 std deviations. "                  (see Z table, find .4950)

Means that if we sample from a population 100 times, the sample means of 99 would be within 2.58 std deviations, 95 would within 1.96 std. deviations and 90 would be within 1.64 std. deviations.

The std deviation of a sampling distribution can also be called the "standard error of the mean."

Standard error = Standard deviation (approx.)

The ranges - 90%, 95% and 99% - are called confidence intervals.  The sample size you need depends on your choice of a confidence interval.

A 99% confidence interval means that your sample will be wrong (not reflect the population) 1 time out of a 100.

A 95% confidence interval means that your sample will be wrong (not reflect the population) 5 times out of a 100.

A 90% confidence interval means that your sample will be wrong (not reflect the population) 10 times out of a 100.

A 90% confidence interval requires a smaller sample than a 99% level.

There are specific formulas to determine sample size, see page 184.  Also see graph page 186 showing that as the sample size increases the sampling error at each confidence level (90, 95, 99%) decreases.  There are also tables that summarize the calculations based on the formula on page 184.  For very large populations (500,000 and up), the sample size requirements are summarized on p. 187, Table 7-1 for the 95% and 99% confidence levels at different margins of error (1 to 7).

For a 1% tolerated error and a 95% confidence level, we would need a sample size of 9,604, and a sample size of 16,587 for a 99% confidence level.

A 3% or 4% margin of error is usually acceptable at the 95% confidence level for policy research and polling, so N = 1067 (for a 3% margin of error) or 600 (for a 4% margin).

A more comprehensive table is on page 198 showing sample size requirements for smaller populations, at the 95% level of confidence.

For example, from a population of 10,000, we would need a sample of 964 at the 3% margin of error and 566 for a 4% margin.

From a population of 500,000 (or more) with a margin or error of 3%, we need a sample of 1065. For a 4% margin of error, we would need 600 people.

POINT: Once you get a random sample of about 1000 people, you have exhausted most of the benefits of randomization, and increasing sample size has very few benefits, even as POP approaches infinity.  A sample size of 1000 is like the normal amount of blood drawn for a blood test.  You could more than double the sample size from 1065 to 2,390 and would would only reduce the margin of error from 3% to 2%.   Or once you get about 1000 people, you reach "critical mass" sample for a population anywhere between 10,000 and 10B.

Reporting results: At a 95 confidence level, or 95% of the time, the results from this poll should differ by no more than 3%, in either direction, (+ 3%) from the results that would have been obtained from interviewing the entire population of all adults (or registered voters) in the U.S. (Requires sample of 1065 if the population is the entire US).  If the sample size was 600 the margin of error would be 4%.  For 384 people, it would be 5%. If they had 1843 people in sample, it could be reported at the 99% level (page 187).  The latest Reuters poll shows Bush with 43% and Gore with 42%, meaning that it is too close to call, since Bush could have between 40-46% and Gore 39-45% of the entire population.  Or Clinton now has a 59% favorable job rating from a sample of 1000, when 95% of the time the actual population would actually range between 56-62%.

Estimating total sample size: If we have a desired sample size of 1000, we have to start with a much larger initial sample to account for a failure rate of about 1/3 when dealing with a human population: some people won't be home or will refuse to be interviewed, or won't answer all questions, etc.  For example, if you start with an initial sample 1500 and you have a 1/3 refusal rate, 500 people won't participate, and you will end up with your desired sample of 1000 actual responses/interviews, see formula on page 187.

STATISTICAL INFERENCE AND HYPOTHESIS TESTING

STATISTICAL INFERENCE - We are interested in testing a hypothesis about the relationship between an independent variable and a dependent variable.  Even when we find a statistically significant relationship, there is still always a small chance that our results are due to chance and not due to a real relationship.  Since we are using a sample to make inferences about the population, there is always the possibility, even with the correct sample size, that we have gotten our results just due to randomness, or chance. We can never really "prove" or "disprove" a hypothesis with 100% accuracy, but we can evaluate the probability that our results are due to chance.

For example, an all white, or an all black jury is possible due to either 1) jury tampering or 2) chance.  Our sample size here is only 1.  What is the probability that we could get an all-white jury by chance in a community that is only 50% white?

Statistical significance is way to measure the likelihood that chance explains our results. If the probability that chance can explain our results is only 1/1000, then we are can be very confident of our results. There is a 99.9% chance, or level of confidence, that we have established a relationship.

What is the probability of getting an all white (or all AA) jury?  P = .512 (.5 is the percentage of the population that is AA (and white)).  P = .0002 or 1/5000 chance that an all white jury is due to chance with random selection of jurors.  See page 191 for a complete probability table of all possible outcomes.

How to determine that the jury selection process is biased? Would 9 W and 3 AA indicate bias?  Or how do we define "bias" in this case?  How many AAs would it take to not be a biased jury?  An equal outcome of 6 AA and 6W only occurs 22.56% of the time, so that in over 75% of the cases of randomly selected juries we would expect more of one race than the other.

Formulate a Hypothesis - Starting point: null hypothesis.

Null Hypothesis (Ho): There is NO bias in jury selection, 12W/0B jury happened by chance.
Alternative Hypothesis (Ha): The jury selection IS biased (or the jury selection is biased in favor of whites)

Ha: alternative hypothesis, or the "research" hypothesis.  Null hypothesis is usually stated that there is no or zero or "null" effect of the independent var. on the dependent variable, as the "strawman."  We suspect discrimination or jury tampering, but we set up the hypothesis in null form: There is NO bias.  If we reject the Ho, then we accept the Ha, that there IS in fact bias.  If we fail to reject the Ho, then we accept the Ho, that there is NO bias.

Jury selection = f (discrimination)
Ho: there is NO discrimination/bias
Ha: there is bias against blacks

Income for Women = f (discrimination)
Ho: there is no difference between M income and F income / NO discrimination
Ha: there is a difference/discrimination against women.

Y = f (X)
Ho: X has NO effect on Y
Ha: X has an effect on Y

SPECIFYING A SIGNIFICANCE LEVEL, CRITICAL VALUE AND CRITICAL REGION

Level of statistical significance is probability of rejecting the Ho when in fact it is TRUE. This is called Type I error.

TYPE I ERROR - the probability of FALSELY rejecting the Ho.

In this case, it would be the probability that we reject the Ho of no bias and "accept" the Ha of bias, when in fact there is no bias.  Type I Error in this case is the probability that we could find bias when in fact there really isn't any.

We have to select a level of probability that we think is reasonable.  Standards level of statistical significance are: 10%, 5% and 1%.

Levels of significance = Prob. of Type I Error.

By reducing the level of sig (from 10% to 5%, or from 5% to 1%) we reduce Type I error, so we reduce the probability of rejecting the Ho when we should accept it.

Example: Court Trial.
Ho: Defendant is NOT guilty.
Ha: Defendant is guilty.

Type I Error: Falsely convicting an innocent person, sending an innocent person to jail.  We have falsely rejected the Ho and falsely accepted the Ha.
Type II Error: Letting a guilty person go free, freeing a guilty defendant. We have falsely accepted the Ho, and falsely rejected the Ha.

Type II Error is the probability of falsely "accepting" the Ho when in fact it is false.

POINT:  when you reduce Type I error, you increase Type II error.  Criminal trial - burden of proof is very strong.  Evidence has to be overwhelming.  Protection of our rights.  "Better to let 100 guilty people go free than to convict an innocent person."  Burden of proof: Beyond a reasonable doubt. If error is to occur, we would rather commit Type II error than Type I.

Civil trial - the burden of proof is lower - only a preponderance of the evidence. Explains why you could have a different outcome in a criminal vs civil trial. The level of stat. sig is higher in a criminal trial (1%) vs. civil trial (10%).

See page 193.
Ho: There is NO bias in jury selection.
Ha: Jury selection is racially biased.

Suppose that we do KNOW the TRUTH about the Ho and Ha, there are four possible outcomes:

A. TRUTH: NO BIAS EXISTS

1. If the TRUTH is that there really is NO bias, and we accept (fail to reject) Ho, then we have made the correct decision.  Jury selection is actually unbiased, and we find that there is no bias.

2. If the TRUTH is that there really is NO bias, and we falsely reject the Ho in favor of Ha, then we have made a decision error.  We find bias, when there really isn't any. We have committed Type I Error.  We have falsely rejected a true Ho, and falsely accepted a Ha that is not true.

B. TRUTH: BIAS REALLY EXISTS

3. If the TRUTH is that bias does exist, and we correctly reject the Ho, we have made the correct decision.  The jury is biased and we find bias.  We correctly reject Ho and correctly accept Ha.

4. If the TRUTH is that bias does exist and we falsely accept the Ho (no bias), we have committed Type II Error.  We failed to find racial bias when there actually is bias. Falsely accepted a Ho that is NOT correct.  We falsely reject a true Ha.

Example: FDA Drug Testing
Ho: Drug is NOT safe
Ha: Drug is safe.

TRUTH: DRUG IS NOT SAFE

1. Truth: Drug is not safe. We accept the Ho. We make the correct decision.

2. Truth: Drug is not safe. We falsely reject Ho. Falsely reject a true Ho, and falsely find Ha. Type I Error - drug may be marketed and cause illness. Falsely found that drug is safe when it is really not safe.

TRUTH: DRUG IS SAFE

3. Truth: Drug is safe.  If we reject the Ho and find in favor of Ha, we have made the correct decision.  We allow a safe drug to be marketed.

4. Truth: Drug is safe. If we falsely accept the Ho, and find that the drug is NOT safe when it really is safe, we have made Type II Error.  We prevent people from getting the benefits of a safe drug by keeping a safe drug off the market.  We incorrectly determine a safe drug to NOT be safe.

Reason it takes 7 years for FDA approval.  They would rather prevent safe drugs from being marketed (commit Type II Error) than allow unsafe drugs on the market (commit Type I Error).  By minimizing Type I error, and requiring a strict approval process, they are minimizing Type I Error but increasing Type II Error.

MAKING THE DECISION TO ACCEPT OR REJECT THE Ho.

1. State the Ho and the Ha.
2. Choose a statistical test and derive the sampling distribution.
3. Specify a significance level (.10, .05, or .01), determine the critical value(s) and define the critical area (rejection region).
4. Decide whether to "accept" or reject the Ho.

Example: jury bias case. Assume that the defense lawyer chooses a significance level of 10%.  We would reject the Ho if the combined probabilities of having very few blacks on a jury are less than 10%.  The critical value is 10%.  The probability of 3 or fewer blacks on jury is:

3B and 9W:     .0537
2B and 10W:   .0161
1B and 11W:   .0029
0B and 12W:   .0002.

Adding these probabilities together equals .0729, which would represent the combined probability of having 3 or fewer blacks on any jury, and would be our "test statistic."

Our test statistic falls within the critical region, so we reject the Ho when there are 3 or fewer Blacks on jury.

If we set the level of stat. sig to 5%, we would reject the Ho (no bias) when then are 2 or fewer blacks on jury (.016 + .0029 + .0002 = .0192).  At the 5% level, 3B/9W (.0731) would not fall in the rejection region (.0731 > .05), but .0192 would fall in the rejection region (.0192 < .05).

If we set sig level to 1%, we would reject only 1B/11W and OB/12W (.0029 + .0002 = .0031), and .0031 < .01, so we would reject the Ho at the 1% level when there are one or 0 AA on the jury.  Conclusion: At the 1% level of stat sig we find that there is bias in jury selection when there are juries with only 0 or 1 African Americans.  We reject the Ho of NO BIAS in favor of the Ha (BIAS), and there is less than a 1% chance that we have found BIAS when in fact there is none.  Only 1% chance of falsely finding bias (Type I Error = 1%).  Or there is less than a 1% chance that you would find a jury with 0 or 1 AA by CHANCE and NOT due to BIAS, so we are 99% confident that we have correctly found bias when a jury has 1 or fewer AA jurors.

The jury example is of a ONE-TAILED test.  We didn't suspect bias in general, we suspected, or tested for, bias against AAs only.  We had only one rejection region.

If we wanted to test for bias in general, against AA OR W, we would use a two-sided test.  See page 196.  For a significance level of 10%, we would have two rejection regions/critical areas, one at each tail of the distribution, each area representing a .05 (5%) probability.

Ho: No Bias
Ha: Racial bias in either direction (AA or W).

At the 10% level, we would reject the Ho and find bias if either race had 10 or more members (prob = .0161 + .0029 + .0002 = .0192), since .0192 falls in the rejection region (.0192 < .05).  At the 1% (.01) level, we would have two areas of .005 probability each, and we would find bias only if there were 11 or 12 of one race on the jury (.0029 + .0002 = .0031), since .0031 would fall in the critical area (rejection region).

Two-tailed test is appropriate when the predicted relationship between Y and X is either POS or NEG.

Examples:
Ho: Smaller class size has NO effect on test scores.
Ha: Smaller class size increases test scores. (One-tailed test).
vs.
Ho: Class size has NO effect on test scores.
Ha: Class size affects test scores. (either pos or neg, two-tailed test).

By default: EVIEWS (and most software) performs 2-tailed tests by default.

Example: Y = f (X)
Ho: X has NO effect on Y.
Ha: X has an effect on Y. (pos or neg).

SUMMARY:
1. State Ho and Ha.
2. Perform a stat. test (run OLS/regression)
3. Specify a sig level (10/5/1%)
4. Reject Ho or "accept" / fail to reject Ho.

Typical procedure: Y=f (X1, X2, X3, X4)

Joint Hos: X1=X2=X3=X4=0
Ha: X1=X2=X3=X4 n.e. 0

Y = 10 +  2.3 X1 - .45 X2 + .01 X3 + 1.5 X4
***         **            *

*** = sig at 1% level
** = sig at 5% level
* = sig at 10% level

The variable X1 is positive and sig at the 1% level.
The variable X2 is negative and sig at the 5% level
The variable X3 is pos and sig at the 10% level
The variable X4 is insignificant. (insignificantly different from 0).

SOME USEFUL STATISTICAL TESTS INVOLVING ONE VARIABLE

Z test can be used for: Is a given sample random?  Does it represent the population?

Example: we draw a random sample of high school seniors in Indiana.  The sample has a higher mean test score than the population. Smarter students are overrepresented, so we are concerned that the sample may not be random.  We can perform a Z test, to test the randomness of a sample.

Ho: Sample is NOT different from the population
Ha: Sample has a disproportionate share of smart students. (one-tailed test)

For a two-tailed test: Ha: Sample is different from population.

We can use the Central Limit Theorem (CLT): Regardless of the shape of the population distribution (normal, or skewed, or uniform, etc.) the sampling distribution of sample means will approach a normal distribution.  In general we can invoke the CLT as long as the sample size > 30.

The significance of the CLT is that we don't have to know the distribution of the population, and even if we do know it, it doesn't matter if it is skewed or non normal in any way.

CLT also states that the mean of the sampling distribution (X-BAR) will equal the population mean (µ). X-BAR = µ

Z score = (X-BAR - µ) / (sigma / N.5)

X-BAR = sample mean
µ = pop mean
sigma = pop std dev. (we can substitute sample std dev when n > 30.)

Z =    (620 - 600)     =     1.67
120 / 100.5

Z test statistic = 1.67

The sample mean is 1.67 s.d. away from (above) the mean. Is this far away enough from the mean to say that the sample is not random, over-repsresented by smart students?

Steps: 1) 1 or 2 tailed test?
2) level of sig? (10, 5 or 1%)

1) One-tailed test is appropriate, since we are concerned that there is a disproportionately high number of students who are above average, above the pop mean.

2) Assume we decide on a significance level of .05 or 5%.  We then determine the CRITICAL VALUE, which will establish the CRITICAL AREA/REJECTION REGION.

To determine the critical Z-value, we have two options:

1. A .05 rejection region corresponds to a Z-value of .4500. Go to page 351 and find the value .4500, and then determine the appropriate Z-statistic.  Z-stat. is equal to about 1.65. This is the CRITICAL Z-VALUE.

2. Use the t-statistic Table on page 353.  In the limit, with enough observations, the Z and t-stats. are exactly the same.  T-stats. values in the table are adjustments for the sample size.  On the top, we select the .05 level of sig for a one-tailed test and read down to the bottom: 1.645. This is the critical value: 1.645.

The critical value of 1.65 establishes the beginning of the Rejection Region.  If our test statistic is LESS than the critical value, it falls in the "acceptance" region, and we accept / "fail to reject" the Ho, find that the sample is RANDOM and NOT different than the population.  If the test stat. is GREATER than the critical value, it falls in the rejection region, we reject Ho and "accept" the Ha: sample is disproportionately represented with smart students.

In this case, the Z-stat. of 1.67 > crtical value of 1.65 and falls in the rejection region, so we reject Ho at the 5% level of stat sig and find that the sample is overrepresented with smart students. There is only a 5% chance of falsely rejecting the Ho, only a 5% probability that such a large Z-stat would happen just by chance.  95% of the time, a sample with such a disproportionate number of smart students indicates that the sample is NOT random.  There is a 5% chance that we have made Type I Error, and falsely rejected the Ho when in fact it really is TRUE.  In other words, in 5% of the cases, even a truly randomly selected group could have those characteristics (disproportionate number of smart students).

If we are dealing with percentages/proportions, we adjust the Z-stat. formula to:

***see page 199***

Example (p. 199):  Is motor vehicle inspection operating efficiently in County X?  Statewide inspection standard is that at least 90% of vehicles be inspected.  A county finds that only 85% of a 50 car sample in that county have been inspected.  Question: Is inspection enforcement below state standards in County X?   Is the 85% outcome due to low standards or just due to random variation in the sample of 50 cars?  We can use the Z test to find out:

Ho: The 85% inspection rate in County X is NOT different from the state standard of 90%. (the difference is due to chance, random variation, and not due to substandard conditions). (85% is NOT statistically significantly different from 90%.)

Ha: The difference between 90% and 85% is due to substandard inspection process in County X.

We calculate the Z-stat. of -1.18 (p. 199), compare that test statistic to the critical value of -1.65 for the 5% level of statistical significance.  The Z-stat. falls in the "accept" region, so we accept Ho (Z-stat < Critical Value at 5% level).  We find that the 85% inspection rate is due to chance and NOT due to substandard inspection process.

t-TESTS

t-tests are the main stat. tests that are used in regression analysis, OLS, to test the sig level of the independent variables.

We usually don't know the pop variance.  If we did know the pop variance, it would mean that we had a lot of information about the population, and then we probably wouldn't need a sample.  We would just use the entire population.

Example: If we knew the voting preferences of the entire voting population, you wouldn't need a poll.

A t-test and a t-stat. allows us to use the sample std deviation to determine statistical significance. We don't need to know the pop variance/std. dev.

To use t values we have to assume that the sample is drawn from a population that is normal.

t-formula: see page 200. The only difference between t and Z formula:

Z-test:      sigma = pop std dev and N (sample size) is in denominator
t-test:       sigma = sample std dev and N-1 is in denominator

Z test - assumes that the sample size (N) is large, > 30, or > 50.  Can't be adjusted or used for small samples.

t-test can be used for any sample size.  Example: Mpls Star/Trib piece.  The sample size was 11 school districts.  N=11 is too small for Z, but OK for t-tests.

See t-table on page 353.  The values in the last row are the same as for the Z table.  For small samples we use the appropriate CRITICAL VALUES.  Computer programs, EVIEWS, will automatically take into account the sample size, or more accurately the d.f.'s (degrees of freedom).

"There is a different t distribution for each of the possible degrees of freedom."

Degrees of freedom - the number of values that we can choose freely.

Usually operationalized as: d.f. = N - k

Where N = sample size and k = number of unknowns, or number of variables, number coefficients to be estimated.

Example: X + Y = 30.  There are two unknowns, X and Y, but there is only one degree of freedom.  That is because as soon as we pick a value for X, say 10, then the value of Y is automatically determined: 20.  Or if we pick a value for Y, then X is determined.  So we have only 1 d.f.

Example: X + Y + Z, there are 3 d.f., but as soon as we say X + Y + Z = 30, then there are only 2 d.f.  Knowing/selecting any two values for X and Y would then determine what Z is.

If we say that Y = f (X) and we plot a straight line we would say that:

Y = a + b X, where a is the intercept and b is the slope. For a sample size N, we would have d.f. = N - 2, where two is the number of coefficients being estimated (k = 2), a and b.

For Y = f(X1, X2, X3, X4, X5) and N= 100, (d.f. = N - k ) = 100 - 5 = 95.

T-test of test score example: see page 201.  As before:

Ho: Sample of students is NO different from all students
Ha: Sample of students is over represented with smart students.

Here we use the sample std dev of 125 instead of the pop std dev (120), and we use N-1 (100 - 1 = 99) instead of N (100).  We get a t-stat. of 1.59 compared to the Z-stat. of 1.67, and d.f. = 99 (N - 1).  In the t-table, we find the d.f. (degrees of freedom) to get the appropriate critical value.  The degrees of freedom go from 40 to 60 to 120 to Infinity, and 99 is not listed, so we look for the closest d.f., which is 120.  We compare the t-stat. of 1.59 to the critical value of 1.658 (page 353, use d.f. = 120) for one-tailed test at the 5% level of significance.  Or we could find the exact critical value by interpolation and get 1.663.

Conclusion: Since the t-stat (1.59) is less than the critical value (1.658), we fail to reject the Ho.  The t-stat < critical value, and falls in the "Acceptance" region, so we accept the Ho, and assume that we have a random sample.

TESTS FOR TWO OR MORE GROUPS

Difference of Means Tests: Z-test (for large samples) and T-test (small samples) for measuring the difference between the means of two groups in terms of std deviations, or standard error units.

See formulas on page 202 and 203 for Z-statistic and t-statistic.

Example from the beginning of chapter about MPAs working for the state.  We want to test for discrimination using a difference of means Z-test.  Assume that we select 50 M and 50 F MPAs at random. Our sample size is large enough (N = 100) to use the Z-test and is large enough to substitute the sample std dev for the pop std dev.

Ho:  Sex has no effect on salary (no sex discrimination)
Ha:  Women make less than men (there is discrimination)

Calculation on page 203. Z-stat. = +3.92. (since we had (M - F), if we had (F - M) we would have Z = -3.92).

We can use the critical value of 1.645, which is the critical value for a one-sided test at the 5% level (page 353).

We compare the Z-stat. of 3.92 to the critical value of 1.645.  The Z-stat. > critical value and falls in the rejection region, so we reject the Ho at the 5% level and find that there is sex discrimination.  Ceteris paribus: assuming that other variables like experience and education and quality of university are comparable.  The critical value at the 1% level is 2.326, so we could report the level of significance at the 1% level.  Conclusion: At the 1% level, there is a statistically significant difference between M (\$32,000) and F (\$30,000) salaries, and the earnings differential is NOT due to chance.

Example: t-test is used in this case because sample size is small (page 204).  A state experiments with raising admission fees at state parks.  Of 24 state parks, 11 have an increase of \$1 and 13 parks keep the current fee.  Before the change, the mean attendance for the two groups was the same.  After three months, the results are that the 11 parks with the higher fee have a lower mean attendance of 3500 people vs 3590 people for the other 13 parks that did NOT raise fees.  Is the difference in attendance stat. sig?  Has the increase in fees negatively affected park attendance or is the difference explainable by chance?

Ho: Attendance at parks that increased entrance fees is NOT different from parks that did not raise fees.
Ho: Attendance is the same at both groups of parks

t-test statistic = -.192

d.f = (N - k) =  24 - 2 =  22.  5% level of sig. for 22 df,  critical value on page 353 is 1.717.

Therefore, our test-statistic < critical value, and falls in the acceptance region, so we accept (fail to reject) the Ho.  We can conclude that the difference between the means of the two groups is NOT statistically different, and the change in fees has not significantly affected park attendance.