Regression in Microsoft Excel
In Microsoft Excel, Regression is under "Tools/Data Analysis". (If you do not see "Data Analysis" under "tools", then you have to install it. To install it, go to "tools/add-ins" and add the "analysis toolpak.")
Here are the steps to do regression in Excel.
y | x1 | x2 |
---|---|---|
2 | 1 | 3 |
4 | 2 | 3 |
5 | 3 | 3 |
7 | 4 | 4 |
0 | 5 | 2 |
Multiple R | 0.928 |
R Square | 0.861 |
Adjusted R Square | 0.723 |
Standard Error | 1.421 |
Observations | 5 |
df | SS | MS | F | Significance F | |
Regression | 2 | 25.2 | 12.6 | 6.22 | 0.138 |
Residual | 2 | 4.04 | 2.02 | ||
Total | 4 | 29.2 |
The regression means the fitted values, the residual means the residuals, and total means the overall Y values.
The degrees of freedom is the number of observations minus the number of X variables. Assuming that the regression includes a constant term (the default option), that counts as an X variable. Thus, if we have x1, x2, and a constant term, then there are three x variables, which if there are five observations means there are only two degrees of freedom for the regression
The SS column stands for sum of squares. It shows the Pythagorean relationship, where the sum of squared fitted values plus the sum of squared residuals add up to the sum of squared Y values.
The MS column probably stands for something like "mean square." I've never used it.
The F statistic is used to calculate the significance of the regression. It is analogous to the chi-square statistic in categorical data. The significance level for F is like a P-value. In this case, an F of .138 suggests that the regression is not signicant at the 10 percent level (because the sample size is so small).Finally, Excel provides a table of coefficients, their standard errors, and confidence intervals. Below, I just show the coefficients and standard errors.
Coefficient | Standard Err | t Stat | |
Intercept | -8.08 | 3.72 | -2.18 |
X variable | .263 | .46 | .57 |
X variable | 3.63 | 1.03 | 3.52 |
This says that the regression equation is
Y = -8.08 + .263X1 + 3.63X2
The t-value for the coefficient on X1 is very low. This means that the coefficient is not significantly different from zero and that this variable is not adding any explanatory power to the equation. If your goal was strictly to predict Y, you would drop this variable and re-estimate an equation using only a constant term and X2 on the right-hand side.
Finally, the residuals and the "residual plots" let you look for patterns in the residuals. These can allow you to detect nonlinearity, influential points, and other complications.
Download a spreadsheet by clicking here and saving the file as something like "mytraffic.xls."
This is based on the monthly page views that we got when I was running a web site called Homefair. The goal of the project is to come up with a regression equation to predict page views in subsequent months. The challenge is that the relationship between page views and time is nonlinear.
You can start by estimating a linear regression with "traffic" as the dependent variable and "period" as the independent variable. You can then try a number of alternatives. For example, you can create a column that is the log of "traffic"--call this lntraffic. You can use that as the dependent variable. Also, you could try estimating two separate regressions, one for periods 1-40 and another for periods 41-79.
Other ideas you might consider would be inserting other columns for an indepemt variable. You might create a variable that is set equal to 1 each January and equal to 0 all other months. You can call this variable "January." See if using that variable along with "period" in the regression improves the fit, as measured by adjusted R-squared.
Plots of the residuals may suggest to you other ways to modify "period" to help improve the fit. Try some of them.