In the previous post, I showed how to call multivariate linear regression functions in the Apache Commons Math Statistics library. You might want to compare your results to Excel, perhaps to check your implementation or because you manually develop your analytic process before automating it in code. Here are some tips for comparing Excel results with those from code.
First, to perform a multivariate linear regression in Excel, use the LINEST function. It requires a formula array to see all of the results, so control-U and command-enter are your friends. Select an 5 row by n column group of cells, where n is the number of variables you’re fitting. Hit control-U to edit. Enter “=linest(y,x,1,1)” where y covers your output values and x covers the n columns of input values. Then hit command-enter to set the array formula for all of the cells. Check the Excel help for more information about LINEST usage.
Once you’ve figured out the array formula gymnastics, note that the coefficients are returned backwards! That’s right, the first row of the array formula contains the calculated coefficients for your fit, but in reverse order compared to the order of your input x columns. Go figure. You can even see this dyslexia in the third example in the LINEST help. Look for the section where the regression equation is constructed using the new coefficients.
The third parameter may cause confusion. If you’ve used LINEST for simple regressions, you may have used linest(y,x,,1) as your function, where x and y are columns. In that case, LINEST provides the slope in row 1, column 1 of the output array, and the intercept in row 1, column 2. Above we set the third parameter to 1 or true, which causes Excel to force the regression plane through the origin. Doing so allows us to compare results between Excel and the Commons Math regression functions. But wait, what if I don’t want the fit to go through the origin? What if I have no reason to think my function goes through 0?
It’s actually the standard way of doing regression. Derivations of regression equations don’t treat the intercept calculation separately. Instead, it’s built in. Remember, linear regression is linear on the regression coefficients. The function on the coefficients goes through the origin. The function you construct with those coefficients may or may not…
If you want a constant value in your regressor variables, add it as a column. The resulting coefficient is the intercept. More specifically, regression provides an output coefficient for each column of input. To create a non-zero intercept, add a column of 1s to your data. Regression will find a coefficient for that column. It will be the constant part of your function, the intercept.
For example, when fitting a single column of data, Excel allows you to input an x column and a y column. That’s one input variable, x. Then linest(y,x,,1) will return two values, the slope and the intercept. A more generalized way about thinking about this problem is to use two input columns: the original x values and a columns of 1s. Then linest(y,x,1,1) returns a coefficient for each column. The resulting equation is y=b1 * x1 + b0 * x0, where x0 is always 1.
With the generalized view, we can now compare results between Excel and coded results such as from Commons Math. The Excel coefficients in the first row of the LINEST array function and the values returned by estimateRegressionParameters() should match.
For example, to fit a parabola in both systems, we would perform a linear regression on a*x^2 + b*x + c. Create columns containing x^2, x, and 1. Then the regression will return a, b, and c.
Finally, note that that Excel returns the residual and regressor sum of squares. You can use these to check the the r-squared calculations in your code: r^2 = 1-ssRes/(ssRes+ssReg).