Stanford ML Week 2: Linear Regression with Multiple Variables

It’s been a tough week, I needed some time to comprehend the formulas and get used to it. I am not alone in this, surprisingly phd student also had a hard time with this course.

Last week, we learnt how to find a best fit line for graph with one variable, this week, we find best fit line for graph with multiple variables. For problems with multiple variables, it doesn’t produce a straight line, but an equation that best fit the features, x_1, x_2, x_3... with y .

The goal is to find an equation in the form:

h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ...

x_1, x_2, x_3, ... are features, \theta_0, \theta_1, \theta_2, ... are parameters, h(x) is the y value.

The equation can be condensed into vector form (meaning matrix) as follows, T in the equation means transpose:

h(x) = \sum \limits^{n}_{i=0} \theta_i x_i = \theta^T x

We have to find the values of theta that gives the least difference between h(x) and y to produce the linear regression line. We have two methods to do this, first, cost function mentioned last week, second, normal equations.

Cost function

J(\theta) = \frac{1}{2m} \sum\limits^{m}_{i=1} (h(x^i) - y ^ i ) ^ 2

or in vectorized form:

J(\theta) = \frac{1}{2m} (X\theta - y)^T (X\theta - y)

For the cost function to work, we have to change the values of theta for each iteration until it reaches the minimum value of J. This is done by gradient descent, introduced last week:

\theta_j := \theta_j + \alpha \sum\limits^{m}_{i=1}(y^i-h(x^i))x_j^i

The parameters are updated using gradient descent, remember the parameters must be updated simultaneously and use temporary variable to store the updated value before changing the theta itself. \alpha is the learning rate, low learning rate needs more iteration for the cost function to converge, high learning rate is faster, but risk not converging at all.

After getting the theta values that give the lowest J , the “errors”, we can substitute back into the h(x) equation to get the linear regression line.

Feature scaling

One point to note, the features, or x may have different scales, for example, we have size(feet^2) and number of bedrooms as our features, the former may range from 0 to 2000, while the later may range from 0 to 5. It will be wise to normalize the features into approximately a -1 \leq x_i \leq 1 range. This can be done by subtracting the mean of the feature from the data and divide by the range or standard deviation.

x_1 = \frac{x_1 - \mu_1}{\sigma_1}

Normal equation

Normal equation is a better way than cost function to find the values of theta.

\theta = (X^T X)^{-1} X^T y

Normal equation doesn’t require us to do feature scaling and there is no need to find sum via loop as in gradient descent.


This week, we used octave to plot the linear regression line that maps the relationship between house price and features such as size of the house and number of bedrooms. This is done through gradient descent and normal equation. Using the linear regression line, we predict the house price given the size of the house and number of bedrooms.

Octave is optimized to deal with matrix, and operations involving matrix such as basic operation, inverse, transpose, etc.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.