PredInt.XLS 8.3  Regression with Prediction Intervals on a Microsoft Excel Spreadsheet
by John O. McClain    Links: [ Free Software | John McClain's Page | Errors ]   December 6, 2007

NOW COMPATIBLE WITH EXCEL 2007 AND EXCEL 2003.

The Regression tool in Excel's Data Analysis tool package does not calculate prediction intervals. The PredInt.xls spreadsheet contains a macro to do that for you. Some new features have been added, including a Normal Probability Plot, residual plots, line-fit plots, and standardized residuals that match the ones calculated in most commercial statistics packages (Excel's program calculates them differently). Outliers and influential observations are also flagged.

Example: A director of admissions wishes to predict GPA, the grade point average of a prospective student, based on what is known from the application. The prediction model is to be based on data from previous students. The data set consists of 150 observations, with recorded values for GPA and 4 other variables. Using the PredictionInterval macro, which is on a hidden module in this spreadsheet, the user may select which of the varibles to use in the model. The picture below shows the information box that pops up for selecting variables. In the picture, the model has GPA as a dependent variable, and two predictor (independent) variables: Gender and GMAT.

The output is shown below. The Significance of the F statistic (also known as a p-value), being quite small, confirms that there is a relationship. The coefficients for the regression model are shown a little farther down, including standard errors, Student's t Statistic, two-tailed p-values and a 95% confidence interval for each. Since the p-values are very small, one may conclude that both Gender and GMAT are related to GPA.
    The 95% confidence intervals for the regression coefficients are the first places where this output differs from that from Excel's regression tool. If you change the number in the blue box (shown as 0.95 below) the lower and upper confidence intervals automatically change. (With Excel's regression tool, you must pre-specify the confidence level, and if you want a different one, you have to run the program again.)

The information needed by the admissions director is given at the right side of the picture above, but requires a change. Specifically, the user needs to enter the Gender and GMAT of the prospective student into the yellow boxes under “X Values for Prediction”. (The program automatically enters the averages for each X variable.)

The picture below shows that the director put in the value 1 for Gender and 635 for GMAT, presumably obtained from the prospective student's application. The result is a predicted GPA of 3.508. Unfortunately, the 95% prediction interval runs from about 2.61 to 4.41, which shows that this prediction is not very accurate. Clearly, the admissions director should not base decisions on this model.

Residual Plots: Validity of the regression statistics (such as the p-values) depends on a number of assumptions about the residuals, which are the differences between the actual values and the predicted values of the dependent variable. The PredictionInterval macro produces a number of statistics and charts to help determine whether those assumptions are satisfied. This includes identifying “outliers” and “influential observations.” Graphs are also produced in profusion, including residual plots against every independent variable and against the predicted values, and “Line-fit” plots that show how well the model fits. Examples are shown below.

Normal Probability Plot:

The latest feature of the PredictionInterval macro is the Normal Probability Plot, shown to the right for the example. This is used to test the assumption that the residuals have the normal probability distribution. When the assumption is satisfied, the Normal Probability Plot is nearly a straight line, as in the figure below. More information about this plot may be found in many statistics texts.

 

      Unfortunately, Excel’s Data Analysis Toolpak produces a Normal Probability Plot that differs from the standard, and therefore it will differ from the output of the PredictionInterval macro.

CAUTION: The macro used to calculate the prediction interval does not include very much error checking. If the regression can be calculated, you get an answer. If you need more diagnostics than this program provides, you should use a real statistics package such as SAS, SPSS, MINITAB or others. However, there is one additional part of the output that you may find useful: the correlations (and covariances) of the coefficients. The picture below shows that the coefficients for Gender and GMAT are almost uncorrelated. In general, low correlation among coefficients is good. When the correlations are large (near plus or minus 1) the estimates can be very unstable: small changes in either the data or the model can cause very large changes in the coefficients.

This is a very simple program, intended for teaching purposes. It includes three pages of instructions, with pictures. You are welcome to use it in any manner, and change it as you see fit. The program comes without any guarantee whatsoever, and is distributed free of charge.


Download: This is a very simple program, intended for teaching purposes. It includes three pages of instructions, with pictures. You are welcome to use it in any manner, and change it as you see fit. The program comes without any guarantee whatsoever, and is distributed free of charge.

 Click Here  to get a free copy of the Microsoft Excel Spreadsheet.


If PredInt.xls Doesn't Work:
Some users find that the macros in PredInt.xls do not appear in the Macro menu. If this happens, Excel has automatically disabled the macros that make it run.

Instructions to fix this are included in the file, on a worksheet named Warning.

Back to: [ Free Software | John McClain's Page ]