Stanford ML Week 1: Linear Regression with One Variable

This is the start of Stanford’s Machine Learning instructing by Andrew Ng. Andrew Ng is the director of Stanford AI lab and the cofounder of Coursera where I will be getting the course resources. Notes is not provided in Coursera version of the course, but it can be found at Stanford’s website.

Introduction

Machine learning grew out of artificial intelligence and become its own field. There is a need for machines to learn itself because we currently have large databases of information that can be mined, for example, web click data (Facebook, Google), medical records, biology, and engineering. Besides that machine learning can enable machine to do stuff that cannot be programmed by hand, such as teaching a helicopter to fly (there are too many factors to consider, cannot be programmed easily, watch the attempt to program a helicopter to fly by hand), handwriting recognition, Natural Language Processing (NLP), and computer vision. Lastly, machine learning is important in understanding how our brains learn.

Supervised and Unsupervised learning

There are two types of machine learning: supervised and unsupervised learning. In supervised learning, the computer is given a known dataset to make predictions. For example, we are given a dataset of patients where each patient’s size of tumor, whether the tumor is malignant or benign. With this dataset, we can predict what are the chances of a person’s tumor being malignant or benign from the size of tumor. In unsupervised learning, we are given a dataset that is unknown, we do not know if a tumor is malignant or benign, the program has to find a way to learn to find patterns. Another example of unsupervized learning is, given a dataset about a group of community, their lifestyle, age, and other factors, find a pattern to separate the community into group of similar people.

Unsupervised learning is also used in organizing computing clusters, market segmentation, social network analysis, and astronomical data analysis.

Linear Regression

Suppose we have a graph with size of house in feet as x-axis and price of house as y-axis, and the graph is populated with data, linear regression is a method to draw a line that best fit the data. With this line, we can predict the house price given size in feet of house. How do we draw this line?

The line will be a function h(x) = \theta_0+\theta_1x, where x is the size of house in feet, and h(x) is the price of house. We have to find the \theta_0 and \theta_1 so that h(x) is a line that best fit the data.

A best fit line is a line that minimized the sum of the difference between h(x) values and data values. So, we have something called a cost function, J(\theta_0, \theta_1) = \frac{1}{2m}\displaystyle\sum_{i=1}^m(h_\theta(x^i) - y^i)^2 . The cost function gives the sum of the “error”/ differences between the line and data. m is the number of data we are given. Our goal is to minimize the cost function.

Contour plot of cost function
Contour plot of cost function

Gradient descent will allow us to reduce J(\theta_0, \theta_1) until we end up at a minimum. First, we start by setting any value for \theta_0 and \theta_1. Then we change the values of \theta_0 and \theta_1 using the steps below:

\theta_0:= \theta_0 - \alpha\frac{\partial}{\partial \theta_0}J(\theta_0, \theta_1)
\theta_1:= \theta_1 - \alpha\frac{\partial}{\partial \theta_1}J(\theta_0, \theta_1)

:= is an assignment operator, this means the value on the right is assigned to the left. \alpha is the learning rate, this value must not be too big, or else the minimum cannot be converged. \frac{\partial}{\partial \theta}J(\theta_0, \theta_1) is the derivative of function J . The values of \theta_0 and \theta_1 is updated simultaneously until J becomes the minimum. When the \theta_0 and \theta_1 that caused the cost function to be minimum is found, substitute \theta_0 and \theta_1 into the equation h(x) = \theta_0+\theta_1x and get the best fit line.

Leave a Reply