Machine learning is all about computers learning itself from data to predict new data. There are two types of categories in machine learning

Supervised Learning
Unsupervised Learning

In supervised learning computer learns from input and output while creating a general model which can be used to predict output from new input. Unsupervised learning is about discovering pattern in data without knowing explicit labels beforehand. In this post I’m going to talk about linear regression which is a supervised learning method.

Jupyter Notebook

Jupyter Notebook is an interactive python environment. You can install jupyter notebook with Anaconda which will also install all necessary packages. After installation just run jupyter notebook in command line, which will open jupyter notebook instance in browser. Click New and create a Python 2 notebook.

Enter following lines of code in the cell and press Shift + Enter which will execute the code. Full notebook of the code I used in this post is also available on github.

import numpy as np
import matplotlib.pyplot as plt

Numpy is used for matrix operations. It’s the most important library in python scientific computing echo system.

Matplotlib is a plotting library. We will use it to visualize our data.

Data

In this post I’m not going to use any real data set. Instead I’m going to generate some random data.

rng = np.random.RandomState(42)
x = rng.rand(100) * 10
x = x[:, np.newaxis]
b = rng.rand(100) * 5
b = b[:, np.newaxis]
y = 3 * x + b

Here we generate 100 rows of random data we will say \(m = 100\). \(m\) is usually used in machine learning to denote number of data, in this case pairs of \(x\) and \(y\). Now we are going to plot the dataset using Matplotlib.

plt.scatter(x, y)
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show()

This is the plot generate from the code. Every blue dot is a data point. There are 100 blue dots here.

plot

Data is distributed in sort of linear nature because of the way we generate random data.

Linear Regression

So purpose of the linear regression is to find the equation of a line that do justice to all data points. We will define the equation of the line as \(h(x) = \theta_0 + \theta_1x\). In machine learning \(h(x)\) is called the hypothesis function. But this is just a fancy way of writing old school \(y = mx + c\) where \(c = \theta_0\) and \(m = \theta_1\). Now our goal is to find find \(\theta_0\) and \(\theta_1\). If we know the \(\theta_0\) and \(\theta_1\) we can draw the line. To find \(\theta_0\) and \(\theta_1\) we are going to use a function called cost function.

Cost Function

We going to use following equation which is called least square method.

\[J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\]

Goal is to minimize \((h(x) - y)\) so difference between the hypothesis function output and \(y\) is minimized as possible. Suppose \(\sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\) is \(0\), that means \(J(\theta_0, \theta_1) = 0\) so data will fit perfectly to the strait line \(h(x) = \theta_0 + \theta_1x\). But for a real dataset that may be not the case. If data has spread all over the plot this value may be hight. Ok, you get now what \((h(x) - y)\) means but why squaring it? We sqaure it because there could be negetive or positive values for \(h(x^{(i)}) - y^{(i)}\) depending on the the way data is distributetd when using range of values for \(\theta_0\) and \(\theta_1\). By squaring we make sure there is no negetive values so their is no cancelling out situations. By that means \(0\) realy means that it’s fit the data distrubbution. In a real world situation if data is not perfectly linear we can’t get \(J(\theta_0, \theta_1)\) to zero but try to get a minimum value possible. We say this as ‘minimizing the cost function’. We divide by \(m\) get a relatively small value. We devided by 2 because, by a future derivative operation anther function will a be much simpler equation.

Now we are going to implement the cost function in python. First we implement the hypothesis function \(h(x) = \theta_0 + \theta_1x\).

def h(x, theta0, theta1):
    return theta0 + theta1 * x

If we pass normal python integer values to x, theta0, theta1 it will output an integer. What will happen if we change x to a numpy array. All values get multiplied by theta1, then all values will get added by theta0. Output will be a numpy array. This happens because of a python feature called broadcasting. It will save us from writing a for loop to calculate all the values step by step. Broadcastng is a very powerpull concept. Use of broadcasting will be used everywhere when doing machine learning in python. Next we implement cost function in python

def cost(x, theta0, theta1):
    return np.sum(np.power(h(x, theta0, theta1) - y, 2)) * 1 / 2 * x.shape[0]

np.sum() function take sum of all the values inside a numpy array and return a scaler value. np.power() take power, in this case square of all the values and retrun a same size array with modified values. we pass a numpy array as x so x.shape[0] is number of elements in x which is also equal to \(m\).

Now we need to find theta0 and theta1 which minimize the cost function. One approach is to brute force a range of theta0 and theta1 and pick values where return value of cost function is minimal. But can we to better?

Gradient Descent

Gradient descent is an iterative algorithm to find out the minimum of a function. This is what we are going to do.

\[\theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)\] \[\theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)\]

I will explain step by step. First \(:=\) is the assignment operator. In programming languages we use \(=\) operator but in math it means equality. So \(:=\) means assignment in mathematics. \(\alpha\) is called the learning rate which I will explain later in detail. \(\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)\) is called the derivative part.

\[\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} \frac{1}{2m}\sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\]

This is what the plot looks with \(\theta_0\) against cost function like after 100,000 iterations.

plot

Derivative is measuring the slope, You can see the slope is a negative value as the plot going down. \(\alpha\) is the learning rate which is \(0.0001\). So

\[\theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)\]

You can see \(\theta_0\) will increase because \(\alpha\) is positive \(\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)\) is negative. Rate of change in lope is decreasing so \(\theta_0\) will become a stabilized value.

Here’s the derivative steps for \(\theta_0\).

\[\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_0} \sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\]

I’ve taken out the \(\frac{1}{2m}\) out of the derivative. Since \(h(x^{(i)}) = \theta_0 + \theta_1x^{(i)}\) we can write

\[\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_0} \sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})^2\]

Now we are ready to take the derivative.

\[\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})\]

Next derivative steps for \(\theta_1\).

\[\frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_1} \sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\] \[\frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_1} \sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})^2\]

After taking the derivative,

\[\frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})x^{(i)}\]

Now let’s look at the equivalent python code.

theta0 = 0
theta1 = 0
alpha = 0.001
iteration = 0

while True:    
    temp0 = (1 / m) * np.sum(h(x, theta0, theta1) - y)
    temp1 = (1 / m) * np.sum((h(x, theta0, theta1) - y) * x)
    
    theta0 = theta0 - alpha * temp0
    theta1 = theta1 - alpha * temp1
        
    iteration += 1
    
    if iteration > 100000:
        print("theta0: " + str(theta0) + "theta1:" + str(theta1))
        break

We are initializing theta0 and theta1 to 0. Learning rate alpha to be 0.001. You must choose a smaller value unless it may not going to converge smoothly. Also note that you must use temporary values while assigning.

Now that we have found optimal theta0 and theta1 values below we draw a line in top of the first plot.

theta0: 2.567988282theta1:2.98323417806

plot

You can see that algorithm learned to draw the most optimal line over the data distribution.

Prediction

Here comes the easy part. Now we have a model with learned parameters which can be used to predict. We can use our hypothesis function.

y = h(3, theta1, theta2)

11.517690816184686

Scikit Learn

Scikit Learn is a python library with one of the most used classical machine learning algorithms. Now you know the what’s happens under the hood of linear regression function. So you are ready to use a library to do production level linear regression machine learning.

Here’s how you import the library

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x, y)

fit() function train the data.

model.predict(3)

11.51769082

You can see that values are similar to our own implementation.

Github repo of example code is available here