Machine learning is all about computers learning from data to predict new data. There are two types of categories in machine learning

Supervised Learning
Unsupervised Learning

In supervised learning computer learns from input and output while creating a general model which can be used to predict output from new input. Unsupervised learning is about discovering patterns in data without knowing explicit labels beforehand. In this post I’m going to talk about linear regression which is a supervised learning method.

Jupyter Notebook

Jupyter Notebook is an interactive python environment. You can install jupyter notebook with Anaconda which will also install all necessary packages. After installation just run jupyter notebook in command line, which will open jupyter notebook instance in browser. Click New and create a Python 2 notebook.

Enter following lines of code in the cell and press Shift + Enter which will execute the code. Full notebook of the code I used in this post is also available on github.

import numpy as np
import matplotlib.pyplot as plt

Numpy is used for matrix operations. It’s the most important library in python scientific computing ecosystem.

Matplotlib is a plotting library. We will use it to visualize our data.

Data

In this post I’m not going to use any real data set. Instead I’m going to generate some random data.

rng = np.random.RandomState(42)
x = rng.rand(100) * 10
x = x[:, np.newaxis]
b = rng.rand(100) * 5
b = b[:, np.newaxis]
y = 3 * x + b

Here we generate 100 rows of random data we will say \(m = 100\). \(m\) is usually used in machine learning to denote number of data, in this case pairs of \(x\) and \(y\). Now we are going to plot the dataset using Matplotlib.

plt.scatter(x, y)
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show()

This is the plot generated from the code. Every blue dot is a data point. There are 100 blue dots here.

plot

Data is distributed in a sort of linear nature because of the way we generate random data.

Linear Regression

So the purpose of linear regression is to find the equation of a line that does justice to all data points. We will define the equation of the line as \(h(x) = \theta_0 + \theta_1x\). In machine learning \(h(x)\) is called the hypothesis function. But this is just a fancy way of writing old school \(y = mx + c\) where \(c = \theta_0\) and \(m = \theta_1\). Now our goal is to find \(\theta_0\) and \(\theta_1\). If we know the \(\theta_0\) and \(\theta_1\) we can draw the line. To find \(\theta_0\) and \(\theta_1\) we are going to use a function called cost function.

Cost Function

We are going to use the following equation which is called the least square method.

\[J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\]

The goal is to minimize \((h(x) - y)\) so the difference between the hypothesis function output and \(y\) is minimized as much as possible. Suppose \(\sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\) is \(0\), that means \(J(\theta_0, \theta_1) = 0\) so data will fit perfectly to the straight line \(h(x) = \theta_0 + \theta_1x\). But for a real dataset that may not be the case. If data has spread all over the plot this value may be high. Ok, you get now what \((h(x) - y)\) means but why squaring it? We square it because there could be negative or positive values for \(h(x^{(i)}) - y^{(i)}\) depending on the way data is distributed when using a range of values for \(\theta_0\) and \(\theta_1\). By squaring we make sure there are no negative values so there is no cancelling out situations. By that means \(0\) really means that it fits the data distribution. In a real world situation if data is not perfectly linear we can’t get \(J(\theta_0, \theta_1)\) to zero but try to get a minimum value possible. We say this as ‘minimizing the cost function’. We divide by \(m\) to get a relatively small value. We divided by 2 because, by a future derivative operation another function will be a much simpler equation.

Now we are going to implement the cost function in python. First we implement the hypothesis function \(h(x) = \theta_0 + \theta_1x\).

def h(x, theta0, theta1):
    return theta0 + theta1 * x

If we pass normal python integer values to x, theta0, theta1 it will output an integer. What will happen if we change x to a numpy array? All values get multiplied by theta1, then all values will get added by theta0. Output will be a numpy array. This happens because of a python feature called broadcasting. It will save us from writing a for loop to calculate all the values step by step. Broadcasting is a very powerful concept. Use of broadcasting will be used everywhere when doing machine learning in python. Next we implement the cost function in python

def cost(x, theta0, theta1):
    return np.sum(np.power(h(x, theta0, theta1) - y, 2)) * 1 / 2 * x.shape[0]

np.sum() function takes the sum of all the values inside a numpy array and returns a scalar value. np.power() takes power, in this case square of all the values and returns a same size array with modified values. We pass a numpy array as x so x.shape[0] is the number of elements in x which is also equal to \(m\).

Now we need to find theta0 and theta1 which minimize the cost function. One approach is to brute force a range of theta0 and theta1 and pick values where the return value of the cost function is minimal. But can we do better?

Gradient Descent

Gradient descent is an iterative algorithm to find out the minimum of a function. This is what we are going to do.

\[\theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)\] \[\theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)\]

I will explain step by step. First \(:=\) is the assignment operator. In programming languages we use \(=\) operator but in math it means equality. So \(:=\) means assignment in mathematics. \(\alpha\) is called the learning rate which I will explain later in detail. \(\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)\) is called the derivative part.

\[\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} \frac{1}{2m}\sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\]

This is what the plot looks with \(\theta_0\) against cost function like after 100,000 iterations.

plot

Derivative is measuring the slope, You can see the slope is a negative value as the plot going down. \(\alpha\) is the learning rate which is \(0.0001\). So

\[\theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)\]

You can see \(\theta_0\) will increase because \(\alpha\) is positive and \(\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)\) is negative. Rate of change in slope is decreasing so \(\theta_0\) will become a stabilized value.

Here’s the derivative steps for \(\theta_0\).

\[\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_0} \sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\]

I’ve taken out the \(\frac{1}{2m}\) out of the derivative. Since \(h(x^{(i)}) = \theta_0 + \theta_1x^{(i)}\) we can write

\[\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_0} \sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})^2\]

Now we are ready to take the derivative.

\[\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})\]

Next derivative steps for \(\theta_1\).

\[\frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_1} \sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\] \[\frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_1} \sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})^2\]

After taking the derivative,

\[\frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})x^{(i)}\]

Now let’s look at the equivalent python code.

theta0 = 0
theta1 = 0
alpha = 0.001
iteration = 0

while True:    
    temp0 = (1 / m) * np.sum(h(x, theta0, theta1) - y)
    temp1 = (1 / m) * np.sum((h(x, theta0, theta1) - y) * x)
    
    theta0 = theta0 - alpha * temp0
    theta1 = theta1 - alpha * temp1
        
    iteration += 1
    
    if iteration > 100000:
        print("theta0: " + str(theta0) + "theta1:" + str(theta1))
        break

We are initializing theta0 and theta1 to 0. Learning rate alpha to be 0.001. You must choose a smaller value or else it may not converge smoothly. Also note that you must use temporary values while assigning.

Now that we have found optimal theta0 and theta1 values below, we draw a line on top of the first plot.

theta0: 2.567988282theta1:2.98323417806

plot

You can see that the algorithm learned to draw the most optimal line over the data distribution.

Prediction

Here comes the easy part. Now we have a model with learned parameters which can be used to predict. We can use our hypothesis function.

y = h(3, theta1, theta2)

11.517690816184686

Scikit Learn

Scikit Learn is a python library with one of the most used classical machine learning algorithms. Now you know what happens under the hood of the linear regression function. So you are ready to use a library to do production level linear regression machine learning.

Here’s how you import the library

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x, y)

fit() function trains the data.

model.predict(3)

11.51769082

You can see that the values are similar to our own implementation.

GitHub repo of example code is available here