Linear Regression in Scala is as easy as calling the routines in the Apache Commons Math jar. We just need to add the calculation of the correlation coefficient, aka r^2, to see how well we fit. Here’s how:

Download the library from here and put the commons-math-2.1.jar file on your classpath. Here’s the example in the documentation converted to Scala:

		val regression = new OLSMultipleLinearRegression()
		// example from apache: http://commons.apache.org/math/userguide/stat.html#a1.5_Multiple_linear_regression
		val y = Array(11.0, 12.0, 13.0, 14.0, 15.0, 16.0)
		val x = new Array[Array[Double]](6,6)
		x(0) = Array(1.0, 0, 0, 0, 0, 0)
		x(1) = Array(1.0, 2.0, 0, 0, 0, 0)
		x(2) = Array(1.0, 0, 3.0, 0, 0, 0)
		x(3) = Array(1.0, 0, 0, 4.0, 0, 0)
		x(4) = Array(1.0, 0, 0, 0, 5.0, 0)
		x(5) = Array(1.0, 0, 0, 0, 0, 6.0)

		this.regression.newSampleData(y, x) 

		val beta = this.regression.estimateRegressionParameters()
		println("betas: " + beta.map("%.3f".format(_)).mkString(", "))

		// residuals, if needed
		val residuals = this.regression.estimateResiduals()

This works so cleanly because Scala compiles Array[Double] to double[]. It’s compatible with the Java library, while still allowing all the cool Scala functions like map().

Commons Math provides the matrix inversion needed to solve the normal equation and return the desired equation coefficients. But it doesn’t provide the correlation coefficient, often called “r-squared,” which tells you how much of the data’s variance is explained by the regression. We can add that calculation with the following code. Given the original Y values and the residuals, use:

	def calcRSquared(y:Array[Double], residuals:Array[Double]) = {
		val ssReg = sumSq(y.zip(residuals).map{case(a,b) => a - b})
		val rMean = sum(residuals) / residuals.size.toDouble
		val ssRes = sumSq(residuals.map(_ - rMean))

		1.0 - ssRes / (ssReg + ssRes)
	}

	def sq(in:Double) = in * in
	def sum(in: Array[Double]) = (0.0 /: in){_ + _}
	def sumSq(in: Array[Double]) = sum(in.map(sq))

The conciseness of Scala means that in reading the code, you can quickly discern both what a function does and how it does it. ssReg, for example is literally “the sum of the squares of pairs of items from y and residuals subtracted” or, more colloquially, “the sum of the squares of differences between y and residuals elements.

Bookmark and Share

One Response to “Easy Multivariate Linear Regression in Scala”

  1. [...] the previous post, I showed how to call multivariate linear regression functions in the Apache Commons Math [...]

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>