Regression
Chapter 13 of your book was focused on categorical data. Chapter 14 looks at numerical data. With numerical data, a relationship between two or more variables is explored by using a linear regression.
With regression, we think of one variable, which we call Y, as the dependent variable. Another variable, which we call X, is the independent variable. In practice, there often are many independent variables, so we speak of many X's. When there are many X's, this is called multiple regression. When there is one X, we call it single regression or simple regression.
Single regression fits a straight-line relationship between X and Y. For example, suppose that Y is the cost of a hard disk and X is the amount of storage on the hard disk. We might go on the Internet and look up the prices and storage capacities for a bunch of disk drives and plot these as points on a graph. Then, you could try to fit a linear relationship:
In this linear relationship, a is the intercept and b is the slope. You could just eyeball the data and try to draw the line of best fit. Alternatively, you can fit the line by choosing a and b according to the mathematical formula known as a regression line. To arrive at the formula for a regression line, do the following.
Think of each observation as a pair xi and yi.
Define y^i as the "fitted value" equal to a + bxi.
Define the residual ei as the difference between yi and y^i
Choose paramaters a and b to minimize the sum of the squared residuals, that is to minimize Sei2.
We can arrive at multiple regression by changing step two in the process to one that allows multiple independent variables. That is,
2. Let y^i = a + b1x1i + b2x2i + ...
With multiple regression there is still only one intercept, a, but there are multiple slope coefficients. Single regression is easier to show on a two-dimensional graph. Single regression is easier to compute using a calculator (although it gets pretty ridiculous with more than about 10 observations). However, in practice, multiple regression is used more often than single regression, because in practice there usually are multiple factors involved in numerical relationships. Your book does not cover multiple regression, so I will assume that it will not come up on the AP exam.
For both single and multiple regression, there is a relationship among the actual values for Y, the fitted values, and the residuals. I call this the Pythagorean relationship:
That is, you can think of the fitted values and the residuals as perpendicular to one another, with the actual values as a hypotenuse. What the Pythagorean relationship tells you is that the variation of the actual y's always is greater than the variation of the fitted y's. This property of lower variation for the regression line is one interpretation of the phrase "regression to the mean."
One statistic from a regression is this ratio, called R2:
We can see from the Pythagorean relationship that this ratio can never be greater than one. The closer it is to one, the smaller the residuals and the better the fit of the regression line.
If we take the square root of R2, we get R, which is called the correlation coefficient. When the slope of the regression line is negative (b < 0), we report a negative value for R. On the other hand, R2 can only be positive.
Statistical inference can be carried out in the regression model, under certain assumptions. One of the assumptions is that if you knew the true value of the intercept and slope coefficients, then the residuals would follow a normal distribution. In what follows, we will take these assumptions as true.
The intercept and each slope coefficient will have a t distribution. Each coefficient will have a different standard error, so confidence intervals will be different for each coefficient.
Computer output tends to focus on the null hypothesis that a coefficient is equal to zero. When you see a high value for t (greater than 2 is generally considered high), the coefficient is significantly different from zero.
For the regression as a whole, the significance is measured by the significance of R2, which follows an F distribution. If R2 is significant, it means that you can reject the null hypothesis that Y is independent of all of the X variables in the regression. Your book does not cover this topic, so I will assume that it will not come up on the AP exam.
There are many aspects of statistical analysis of relationships in numerical data. Here are some that I think show up quite often.
When you leave a variable out of a regression, and the omitted variable is correlated both with the included independent variable (s) and the dependent variable, the results are biased. For example, if you take a particular college's freshman class and predict their grade-point average on the basis of SAT scores, you could get a negative coefficient.
Suppose that to get into this college with low SAT scores, you need a really good high school GPA. If you have a low high school GPA, you need high SAT scores. In that case, high school GPA will be negatively correlated with SAT scores for students admitted as freshmen to this college. If high-school GPA is a good predictor of college GPA, then the students with low SAT scores may get the best grades.
However, if you were to try a multiple regression, including both SAT scores and high school GPA as independent variables, the coefficient on SAT score might be positive. That is, if you control for high school GPA, SAT score as the predicted positive effect on college GPA.
Suppose that we are trying to estimate the effect of salary on the ability of companies to recruit computer programmers. We look at 50 companies with similar systems requirements, and for each company we look at average salaries and the number of computer programmers. To our surprise, we find that those with higher salaries have fewer programmers!
What is going on is that there is more than one relationship between salary and the number of programmers. The relationship we were looking for is a supply relationship--programmers would rather supply their labor to higher-paying firms. However, there also is a demand relationship--firms that pay lower salaries can afford to hire more programmers.
The problem is that there is more than one relationship between the independent variable and the dependent variable. Sometimes the relationships conflict. In other cases they will be reinforcing. There are ways of estimating two relationships simultaneously, if you have certain types of data, but that is beyond the scope of this course. The point to remember is that if you estimate a regression looking for one relationship and there are other relationships that you had not considered, your estimates will be invalidated by bias.
Often, the relationship between the independent variable and the dependent variable is nonlinear. For example, if you look at personal computer power over the past twenty years, growth has been exponential. If you estimate a linear model of Millions of Instructions Per Second (MIPS) vs. time over the years 1980-2000, the coefficients will be biased.
A reliable way to detect nonlinearity is to split your sample according to X, the independent variable. Put all observations where X is below the mean into a separate bucket from observations where X is above the mean. Then either run separate regressions or draw separate graphs using the two buckets. If there is nonlinearity, the slopes will look quite different. (In the case of computer power, you would see a steeper slope from 1990-2000 than from 1980-1990.)
On the AP exam, you might be asked to detect nonlinearity by looking at a plot of the residuals. Try drawing a line through the residuals in the left half of the sample, and another line through the right half of the sample. If those lines have different slopes, then this shows nonlinearity.
When a relationship is nonlinear, you can transform one or both variables and estimate a linear relationship on the transformed data. For example, we could try using the log of MIPS as the dependent variable.
For more on nonlinearity, see p. 183 and p. 190 in your book.
When you have one value of the independent variable, X, that is far away from all of the others, then this point will be highly influential. Technically, the estimated slope and intercept(s) are not biased, but in practice they are unreliable. Basically, you are building a regression on two data points--the extreme value of X and "the rest of the data."
When you have an outlier value for X, you face a dilemma. If you were to eliminate the outlier, you would get a more accurate estimate of the remaining data. However, you do so at the cost of throwing out what may be your most interesting data point.
For the dependent variable, Y, the concern would be with a pattern of outliers. By construction, the residuals in the regression will average to zero and not be correlated with X. However, you may see that the largest residuals, both positive and negative, are associated with higher values of X. This phenomenon, called heteroskedasticity, suggests that more care needs to be taken with handling this data. The techniques are beyond the scope of this course.