Chamoda Pandithage

Public-key Cryptography Explained

2017-12-13T00:00:00+00:00

Public-key cryptography is one of the most used cryptosystems today. It refers to any system that uses a key pair, one for encrypting data and another one for decrypting data. If data encrypted using a key, other key is used to decrypt it. This seems pretty magical at first, but in the end of blog post you will understand how this works. In this blog post I’ll start with an analogy to understand what is the purpose of using two key pairs. Then I’ll explain the mathematical concepts behind the algorithm. Then I’ll implement a toy algorithm to understand it further (But never design your own crypto algorithms). Next I’ll explain some openssl commands to generate RSA public and private keys which you can use in real world applications.

Let’s start with an analogy

Suppose you are at home and need to send your passbook to the bank. You have to send this by a bad courier service, in fact they always try to inspect and spy what’s inside every package. So you buy a locker box with two identical keys. You keep one to yourself and send other one to the bank. You can’t send the key with the package because the courier service will open the locker box and inspect. You can’t send it separately because this bad courier always make copies of the keys they deliver in hopes of trying them out on future deliveries. So you walk in to the bank yourself to deliver the key to the bank. Now you go home and put the passbook inside the locker box and lock it with your identical key and send it via the bad courier. They deliver the box hopelessly without being able to see what’s inside. It was inefficient, you had to visit yourself to the bank first. But can we do better?

This time the bank buy a new kind of locker for this purpose. The new locker box has two keys, one for locking the box (public key), another one for unlocking (private key). The key used to lock can’t be used to unlock the box. The key used to unlock can’t be used to lock the box. Bank send over the locker box to you along with the locking key (public key). As usual the bad courier makes a copy of the key in hopes of future endeavors. Now you put the passbook inside the box and lock it with your locking key (public key). You keep the key and send the box. Courier tries to unlock the box using the key the have had copied, but no luck. It can only unlock using the unlocking key (private key) only owned by the bank.

Modern Internet is like the bad courier, filled with hackers inspecting unencrypted packets. We need something similar to the second paragraph to secure the Internet communication.

The first paragraph is an analogy of symmetric encryption. Second paragraph discuss asymmetric encryption, which is the category public encryption belongs. But how could we design such a lock digitally?

The Underlying Mathematics

We can write our encryption function as \(C = E(M)\) where \(M\) is the message we want to encrypt, \(E\) is the function that does the encryption and \(C\) is the encrypted message. Decryption is \(M = D(C)\). Let’s define our functions.

\[E(M) = M^e \bmod n\] \[D(C) = C^d \bmod n\]

\(e\) and \(d\) are public and private keys respectively. \(n\) is a large number which is a multiple of two large prime numbers. \(\bmod\) means the modulo operation. Most programming languages represent this by % symbol. Given two positive numbers \(a\) and \(n\) result of \(a \bmod n\) is the reminder when \(a\) divides by \(n\). For example \(7 \bmod 3 = 1\) because when \(7\) divides by \(3\) remainder is \(1\). We could write above equations in another way using modular arithmetic syntax.

\[E(M) \equiv M^e \pmod{n}\] \[D(C) \equiv C^d \pmod{n}\]

If \(a \equiv b \pmod{n}\) then \(a\) and \(b\) has a congruence relationship. That means \(a - b\) is dividable by \(n\) and \(a \bmod n\) and \(b \bmod n\) both has the same reminder.

Now we can write the following congruence relation. This relationship must be true for encryption and decryption to work properly. \(D(E(M))\) must always return original \(M\) back.

\[M \equiv D(E(M)) \equiv (M^e)^d \pmod{n}\]

Now we need to find a relationship between \(e\) and \(m\) in such that \(D(E(M))\) holds true. To move forward we need another equation. For that we are going to start with Fermat’s little theorem from number theory. Theorem state the follows

\[a^p \equiv a \pmod{p}\]

If \(p\) is a prime number and \(a\) is any integer number \(a^p - a\) is an integer multiple of \(p\). We can also write this as

\[a \times a^{p - 1} \equiv a \times 1 \pmod{p}\]

Removing \(a\) from both sides we get

\[a^{p - 1} \equiv 1 \pmod{p}\]

Now let’s look at a function called Euler’s totient function. In number theory Euler’s totient function counts number of positive integers given integer \(n\) that are relatively prime to \(n\). Relative primes are numbers that don’t have divisors other than \(1\). If \(a\) and \(b\) are relatively prime then we can write that greatest common divisor is \(1\). We usually write this as \(gcd(a, b) = 1\). For example \(10\) and \(7\) are relatively prime because \(gcd(10, 7) = 1\). Now to the totient function, notice that if \(n\) is a prime all the postive integers less than \(n\) are relatively prime to \(n\). We can write this as

\[\phi(n) = n - 1\]

So what does it have to do with our previous equations? Now we can write something like this using Euler’s totient function.

\[M^{\phi(n)} \equiv 1 \pmod{n}\]

Now we are going to prepare above equations to match \(M^{ed} \equiv M \pmod{n}\) which is similar to \(M \equiv D(E(M)) \equiv (M^e)^d \pmod{n}\) with some reordering. As the first step of the preparation take the power \(k\) on both sides

\[M^{k\phi(n)} \equiv 1^k \pmod{n}\]

Since \(1^k\) is \(1\)

\[M^{k\phi(n)} \equiv 1 \pmod{n}\]

Now Let’s multiply both sides by \(M\)

\[M \times M^{k\phi(n)} \equiv M \times 1 \pmod{n}\] \[M^{k\phi(n) + 1} \equiv M \pmod{n}\]

Now we also have the following equation

\[M^{ed} \equiv M \pmod{n}\]

Did you notice the similarity in the right side of the two equations? Now we can write

\[k\phi(n) + 1 = ed\]

Now this is exactly the following by the definition of modular arithmetic syntax.

\[ed \equiv 1 \pmod{\phi(n)}\]

We can expand \(\phi(n)\) further because \(n\) is a multiple of two prime numbers.

\[n = p \times q\] \[\phi(n) = \phi(p) \times \phi(q)\] \[\phi(n) = (p - 1) \times (q - 1)\]

So we can write

\[ed \equiv 1 \pmod{(p -1 )(q - 1)}\]

Now choose a random private key \(d\) such that \(1 < d < \phi(n)\) and \(gcd(d, \phi(n)) = 1\) (Which means they need to be relatively prime). Now we can find \(e\) using modular inverse algorithm.

Now that we have found keys \(e\) and \(d\) we can use them to encrypt and decrypt messages. In the next section I will give a numeric example which will clear up most of the details.

Numerical Example

Let’s choose two prime numbers \(p\) \(q\). We choose small values to make calculations easier but in practice RSA algorithm use very large prime numbers.

\[p = 3\] \[q = 11\] \[n = p \times q = 3 \times 11 = 33\]

And let’s calculate the Euler’s totient function

\[\phi(n) = (p - 1)(q - 1) = 2 \times 10 = 20\]

Now we choose \(d\) such that it’s relatively prime to \(\phi(n)\). Let’s say

\[d = 7\]

Now compute \(e\) such that

\[d \times e \bmod \phi(n) = 1\] \[7 \times e \bmod 20 = 1\]

If \(e = 3\) then \(7 \times 3 \bmod 20 = 1\) satisfies the equation.

Now private key is \(d = (7, 33)\) and public key is \(e = (3, 33)\). Now let’s encrypt some number \(M = 2\) using public key. Of course if you need to encrypt chars you must convert them to integers first.

\[C = M^e \bmod n\] \[C = 2^3 \bmod 33\] \[C = 8\]

Now let’s decrypt \(C\) using private key.

\[M = C^d \bmod n\] \[M = 8^7 \bmod 33\] \[M = 2\]

It works :). I’ve implemented this as a toy algorithm using python. Never implement your own cryptography algorithms though, to use in production environments.

from __future__ import division
from Crypto.Util import number
import os
from fractions import gcd

from Crypto.Util import number for generating random primes.

class RSA:    
    p = q = n = d = e = pi_n = 0
    
    def __init__(self):        
        self.generate()
        
    def generate(self):        
        self.p = number.getPrime(10, os.urandom)
        self.q = number.getPrime(10, os.urandom)        
        self.n = self.p * self.q        
        self.pi_n = (self.p - 1) * (self.q - 1)        
        self.d = self.choose_d()        
        self.e = self.choose_e()
        
    def choose_d(self):        
        self.d = self.find_a_coprime(self.pi_n)        
        return self.d
                    
    def find_a_coprime(self, a):        
        for i in range(2, a):            
            if gcd(i, a) == 1:                
                return i
            
    def choose_e(self):        
        for i in range(self.n):            
            if (i * self.d) % self.pi_n == 1:                
                return i
    
    def public_key(self):        
        return (self.e, self.n)
    
    def private_key(self):        
        return (self.d, self.n)
    
    def encrypt(self, m, key):        
        return pow(m, key[0]) % key[1]
    
    def decrypt(self, c, key):        
        return pow(c, key[0]) % key[1]

Now we can create RSA class

rsa = RSA()

Following should output 99 if our algorithm works and it does :)

rsa.decrypt(rsa.encrypt(99, rsa.public_key()), rsa.private_key())

How is it secure

Whole security of public key encryption depends on that given a public key no one should be able to generate private key from that public key. Remember \(ed \equiv 1 \pmod{\phi(n)}\)? So, to calculate d from e attacker needs to know \(\phi(n)\). But only way to calculate that is \((p - 1)(q - 1)\). Only key generator knows \(p\) and \(q\). When \(p\) \(q\) are large enough no one can calculate (Yet!) \(p\), \(q\) from \(n\). The whole cryotosystem depends on this assumption.

Practical Usage

Public key cryptography is used everywhere. HTTPS run on public key cryptography (Not only RSA but it plays a huge role). If you need to use public key cryptography for your own application you can use openssl. openssl is a full featured toolkit which can generate RSA keys among lot of other things. To generate a key pair type following in command line (Assuming you are on UNIX based system)

openssl genpkey -algorithm RSA -out private_key.pem -pkeyopt rsa_keygen_bits:2048

You will generate private_key.pem file from above command. In this key there are lot of information encoded that it can be used to extract public key.

openssl rsa -pubout -in private_key.pem -out public_key.pem

public_key.pem only contain details to encrypt or decrypt something so it can be shared. But you should never share your private_key.pem

Now let’s encrypt some data using public key.

echo "reqviescat in constantia, ergo, repræsentatio cvpidi avctoris religionis" > key.bin

openssl rsautl -encrypt -inkey public_key.pem -pubin -in key.bin -out key.bin.enc

Encrypted data is written to key.bin.enc. Now let’s decrypt it again

openssl rsautl -decrypt -inkey private_key.pem -in key.bin.enc -out key.bin

If you open key.bin file you can see that same data is there.

Additional Resources

Toy RSA Algorithm - Github
Original RSA Paper - A Method for Obtaining Digital Signatures and Public-Key Cryptosystems
Modular Arithmetic - Wikipedia
Fermat’s Little Theorem - Wikipedia
Coprime integers - Wikipedia
Euler’s totient function - Wikipedia
Modular multiplicative inverse - Wikipedia

Real time face detection using OpenCV and Python

2017-11-18T00:00:00+00:00

Detection is an important application of computer vision. In this post I’m going to detail out how to do real time face detection using Viola Jones Algorithm introduced in paper Rapid object detection using a boosted cascade of simple features (2001). Mind the word detection, we are not going to recognize, means which one the face belong. This is merely detection that there is a face in a given image. We are going to use OpenCV (Open Source Computer Vision Library). OpenCV is written in C++ but there are interfaces for other languages so will use, preferably python.

Installation

Instructions are for OSX. Steps as follows

First install Anaconda.
Then create python 3 virtual environment with conda create -n py3 python=3.6.
Then type source activate py3 which will activate python 3 environment.
Now install opencv with conda install -c conda-forge opencv.
Check the installation with echo -e "import cv2; print(cv2.__version__)" | python command. It should output 3.3.0
Make sure your system has ffmpeg installed which is required for reading a video stream from video formats like mp4, mkv. Using brew you can install ffmpeg with one command brew install ffmpeg

Time for action

Let’s do a quick implementation.

import cv2

cap = cv2.VideoCapture("input.mp4")
face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')

while(True):
    ret, frame = cap.read()
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray, 1.3, 5) 
    
    for (x,y,w,h) in faces:
      cv2.rectangle(frame, (x, y), (x+w, y+h), (0 , 0, 255), 2)
    
    cv2.imshow('frame', frame)

    if cv2.waitKey(1) & 0xFF == ord('q'):        
        break

cap.release()
cv2.destroyAllWindows()

We are going to read a video file frame by frame, applying Viola Jones algorithm with trained parameters then apply a rectangle layer if a face found, displaying modified output frame by frame.

cap = cv2.VideoCapture("input.mp4") imports the video. You can also use a web cam instead of video, just pass 0 as parameter like cap = cv2.VideoCapture(0).

Next we are going to load the trained model. haarcascade_frontalface_default.xml is a model already trained using lot of faces and non faces and lot of computing power. Training good models sometime takes days not hours. Thankfully above model is trained by Intel using lot of data. The model we are using here comes with the installation of OpenCv. Generally you can find those models at your opencv-installation-directory/share/OpenCV/haarcascades but this could differ depends on your OS and installation method.

Next in the while loop we are reading frame by frame. Then get given frame and convert into gray scale.

gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

A frame is a array of 3 matrices where each matrix is for the respective color blue, green, red. Did you noticed the revered order? In OpenCV default representation is BGR not RGB. In the above step we are converting it to gray scale image. gray is a single matrix now.

faces = face_cascade.detectMultiScale(gray, 1.3, 5) 

detectMultiScale function detect faces and return an array of position coordinates and sizes. Second parameter is scaleFactor. To reduce true negatives you must use a value near to zero. Basically the algorithm can only detect it’s trained size usually around 20x20 pixel. To detect large object area get scaled by scaleFactor. If scaleFactor is \(1.05\) scaled block size would equal to \(20 \times 1.05 = 21\). If scaleFactor is equal to 1.3 in above image scaled block size is \(20 \times 1.3 = 26\) so you may miss some pixels, so do some faces. Trade off is accuracy vs performance.

3rd parameter is minNeighbors. It defines how many neighbor rectangles should identified to retain it. Higher value means less false positives.

Next we draw red rectangles if faces detected. Notice how we passed red color in BGR format (0 , 0, 255).

for (x,y,w,h) in faces:
      cv2.rectangle(frame, (x, y), (x+w, y+h), (0 , 0, 255), 2)

Next we are writing to a window frame by frame. If you press key q the loop will break and video will stop. I’ve created a quick video from the output here. Notice how it detects only frontal faces, because model is trained for frontal faces.

How it works

Viola Jones algorithm is a machine learning algorithm developed by Paul Viola and Michael Jones in 2001. It was designed to be very fast even to be possible in embedded systems. We can break down it to 4 parts

Haar like features
Integral image
AdaBoost (Adaptive Boosting)
Cascading

Haar like features

Haar like features, named after the Hungarian mathematician Alfred Haar is a way of identifying features in a image in a more abstract way.

To calculate a feature following equation is used.

\[Value = \text{(Sum of pixels in black area)} - \text{(Sum of pixels in white area)}\]

In the case of face detection following feature will give a higher value in the positioned area, that defines a feature.

Eye area in the are generally darker than under the eye, so \(\text{(Black area - White area)}\) will give higher value. And that’s defines single feature.

Integral Image

Intergral image is a way to calculate rectangle features quickly

\[ii(x, y) = \sum_{x{'} \le x, y{'} \le y} i(x', y')\]

Every position is sum of the top and left values. Note this code is written for clarity, there are more efficeint way to write this. In this case you will be feeding a normalized image to the fuction. Purpose of normaizing first is to get rid of conditions of lighting effects.

def my_intergral_image(img):
    
    intergral_img = np.zeros(img.shape)
    for x in range(img.shape[1]):
        for y in range(img.shape[0]):
            for i in range(x + 1):
                for j in range(y + 1):
                    intergral_img[y, x] += img[i, j]
    
    zero_padded_intergral_image = np.zeros((img.shape[0] + 1, img.shape[1] + 1))
    
    zero_padded_intergral_image[1:img.shape[0] + 1, 1:img.shape[1] + 1] = intergral_img
                    
    return zero_padded_intergral_image

Whole purpose of integral image is to calulate sum of pixels inside a given rectangle fast.

We knows values p, q, r, s in integral image and they represnt the sum of all the left and top values. Our goal is to find pixel sum of D.

\[p = A \\ q = A + B \\ r = A + C \\ s = A + B + C + D \\\]

So \(D = (p + s) - (q + r)\) which will reduce computation steps to calulate sum of pixels in defined rectangle significantly.

Here’s the equalant python code.

def sum_region(integral_img, top_left, bottom_right):
    # to numpy matrix notation
    top_left = (top_left[1], top_left[0]) 
    bottom_right = (bottom_right[1], bottom_right[0])
    if top_left == bottom_right:
        return integral_img[top_left]
    top_right = (bottom_right[0], top_left[1])
    bottom_left = (top_left[0], bottom_right[1])
    # s + p - (q + r)
    return integral_img[bottom_right] + integral_img[top_left] - integral_img[top_right] - integral_img[bottom_left]    

AdaBoost (Adaptive Boosting)

Now we need a way to select best features from all possible features that can correctly classify a face. AdaBoost algorithm was formulated by Yoav Freund and Robert Schapire in 1997 and won the prestigious Gödel Prize in 2003. This elegant machine learning approach can be applied to wide range of problems not only image detection.

I’ll start with the idea of weak and strong classifiers.

Weak Classifier - Classifier that’s little bit better than random guessing.
Strong Classifier - A combination of weak classifiers. It represents wisdom of a weighted crowd of experts.

Adaboost itself is a inhertantly incomplete algorithm, so it’s called a meta algorithm. Let’s express the idea of strong classifier mathematically.

\[H(x) = sign\Big( h_1(x) + h_2(x) + ... +h_T(x)\Big)\]

H(x) is a strong classifier which classify according to the sign of the sum of weak classifiers. Every weak classifier is a Haar feature combined with some other parameters I’ll detail out in next few paragraphs. Suppose there are only 3 weak classifiers so \(T = 3\). Every classifier outputs +1 or -1 so sum of weak classifier will have either + or - sign.

\[H(x) = sign\Big( +1 + -1 + -1 \Big)\]

Here strong classifier sign is negative so it may not be a face. I started with this analogy but we have to also weight the weak classifiers because some classifiers may have strong influence than others.

By adding weights strong classifier get little complicated but it’s nothing more than a simple inequality match.

\[H(x) = \begin{cases} +1, \text{if}\ \sum_{t = 1}^{T} \alpha_th_t(x) \ge \frac{1}{2} \sum_{t = 1}^{T} \alpha_t \\ -1, \text{otherwise} \end{cases}\]

Now how we decide which weak classifier to use, which \(\alpha\) weights to use? That’s why we need to train over existing labeled data. Suppose image is \(x_i\) and label is \(y_i\) which is 1 for face and 0 for non face. Given example images \((x_1, y_1), ... , (x_m, y_m)\) we will initialize weights for each example. Don’t confuse this weight with the \(\alpha\) weight discussed before. This algorithm has two kind of weights. One is \(\alpha\) weight for selected weak classifier. Other one is for each example and each step denoted by \(w_{t,i}\). Now we are going to intialize weights mentioned later for step one \(t = 1\)

\[w_{1, i} = \frac{1}{m}\]

Here we have normalized the weights so sum of the weights is 1. We do the normalize in every step to make sure weight distribution is always adds up to 1.

\[\sum_{i = 1}^{m} w_{t, i} = 1\]

The generalized normalized equation is

\[w_{t,i} = \frac{w_{t,i}}{\sum_{j = 0}^{m}w_{t,i}}\]

Next we loop over all the features to select the best weak classifier which minimize error rate.

\[\epsilon_t = min_{f, p, \theta} \sum_{i = 1}^m w_{t, i} | h(x_i, f, p, \theta) - y_i |\]

Here calculating the sum of weights of misclassified examples which is the error rate represented by epsilon \(\epsilon\). Notice \(h(x_i, f, p, \theta) - y_i\) return 1 or -1 if the misclassified, 0 is correctly classified. We are taking the absolute value out of it so weight get multiplied by 1 if misclassified. Let’s dive into the definition of weak classifier.

\[h(x) = h(x_i, f, p, \theta)\]

\(x_i\) is the \(i\)‘th image example. \(f\) is the Haar feature. \(p\) is the polarity which is either -1 or +1 which defines direction of the inequality. \(\theta\) is the threshold.

\[h(x, f, p , \theta) = \begin{cases} 1, \text{if } pf(x) \lt p\theta \\ 0, \text{otherwise} \end{cases}\]

To select the \(f\) feature we need loop over f(x) Haar classifier. To select the threshold we need to minimize the following equation.

\[error_{\theta} = min\Big( (S_+) + (T_-) - (S_-), (S_-) + (T_+) - (S_+))\Big)\]

Here’s the definition of symbols

\(T_+\) is total sum of positive sample weights
\(T_-\) is total sum of negetive sample weights
\(S_+\) is sum of positive sample weights below threshold
\(S_-\) is sum of negetive sample weights below threshold

After finding the minized error \(\epsilon_t\) for step \(t\) we can find the weights for the next step \(t + 1\)

\[w_{t+1, i} = w_{t, i} \Big(\frac{\epsilon_t}{1 - \epsilon_t}\Big)^{1 -e_i}\]

Where \(e_i = 0\) if sample image \(x_i\) is correctly classified, 0 otherwise. The goal of the equation is to make the weights of incorrectly classified samples slightly larger so in the next round will be unforgiving to weak classifiers that’s going to classify same samples incorrectly. So in each step it’s going choose an unique weak classifier with unique feature. To do so in the above equation correctly classified weights will decreased so misclassified weights will increase relative to correct weights.

Finally calculate \(\alpha_t\) for the selected classifier.

\[\alpha_t = log(\frac{1 - \epsilon_t}{\epsilon_t})\]

Notice how \(\alpha\) going to be a higher value if error rate is small so that weak classifier has more contribution to strong classifier.

So finally algorithm need to loop \(T\) steps to find \(T\) classifiers to get good results.

Cascading

You may have noticed how many loops we have in the algorithm so this AdaBoost along with Haar Features is computationally expensive for real time detection. So we are using attentional cascade to reduce some unnecessary computations. More efficient cascade can be constructed so that negative sub windows will rejected early. Every stage of cascade is a strong classifier, so all the features are grouped into several stages where each stage has a several number of features.

For the cascade we need following parameters

Number of stages in cascade
Number of features in each cascade
Threshold of each strong classifier

Finding optimum values for above parameters is a difficult task. Viola Jones introduced a simple method to find the optimum combination.

Select \(f_i\) the maximum acceptable false positive rate per stage
Select \(d_i\) the maximum acceptable true positive rate per stage
Select \(F_{target}\) Overall false positive rate

Now we are looping until pre defined \(F_{target}\) is met by adding new stages. In stages we keep adding features until \(f_i\) and \(d_i\) is met. By doing this we are going to create a cascade of strong classifiers.

Additional Resources

Paper (Revised) Viola Jones 2001.
Viola Jones Python Implementation Github
To learn more about AdaBoost read the book Boosting Foundations and Algorithms. It’s written by the original authors of the algorithm.
Source code for the face detection in this post - Realtime face detection
Pull requests are welcome if you find anything wrong in the post - Blog

Introduction to Machine Learning with Linear Regression

2017-10-05T00:00:00+00:00

Machine learning is all about computers learning itself from data to predict new data. There are two types of categories in machine learning

Supervised Learning
Unsupervised Learning

In supervised learning computer learns from input and output while creating a general model which can be used to predict output from new input. Unsupervised learning is about discovering pattern in data without knowing explicit labels beforehand. In this post I’m going to talk about linear regression which is a supervised learning method.

Jupyter Notebook

Jupyter Notebook is an interactive python environment. You can install jupyter notebook with Anaconda which will also install all necessary packages. After installation just run jupyter notebook in command line, which will open jupyter notebook instance in browser. Click New and create a Python 2 notebook.

Enter following lines of code in the cell and press Shift + Enter which will execute the code. Full notebook of the code I used in this post is also available on github.

import numpy as np
import matplotlib.pyplot as plt

Numpy is used for matrix operations. It’s the most important library in python scientific computing echo system.

Matplotlib is a plotting library. We will use it to visualize our data.

Data

In this post I’m not going to use any real data set. Instead I’m going to generate some random data.

rng = np.random.RandomState(42)
x = rng.rand(100) * 10
x = x[:, np.newaxis]
b = rng.rand(100) * 5
b = b[:, np.newaxis]
y = 3 * x + b

Here we generate 100 rows of random data we will say \(m = 100\). \(m\) is usually used in machine learning to denote number of data, in this case pairs of \(x\) and \(y\). Now we are going to plot the dataset using Matplotlib.

plt.scatter(x, y)
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show()

This is the plot generate from the code. Every blue dot is a data point. There are 100 blue dots here.

Data is distributed in sort of linear nature because of the way we generate random data.

Linear Regression

So purpose of the linear regression is to find the equation of a line that do justice to all data points. We will define the equation of the line as \(h(x) = \theta_0 + \theta_1x\). In machine learning \(h(x)\) is called the hypothesis function. But this is just a fancy way of writing old school \(y = mx + c\) where \(c = \theta_0\) and \(m = \theta_1\). Now our goal is to find find \(\theta_0\) and \(\theta_1\). If we know the \(\theta_0\) and \(\theta_1\) we can draw the line. To find \(\theta_0\) and \(\theta_1\) we are going to use a function called cost function.

Cost Function

We going to use following equation which is called least square method.

\[J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\]

Goal is to minimize \((h(x) - y)\) so difference between the hypothesis function output and \(y\) is minimized as possible. Suppose \(\sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\) is \(0\), that means \(J(\theta_0, \theta_1) = 0\) so data will fit perfectly to the strait line \(h(x) = \theta_0 + \theta_1x\). But for a real dataset that may be not the case. If data has spread all over the plot this value may be hight. Ok, you get now what \((h(x) - y)\) means but why squaring it? We sqaure it because there could be negetive or positive values for \(h(x^{(i)}) - y^{(i)}\) depending on the the way data is distributetd when using range of values for \(\theta_0\) and \(\theta_1\). By squaring we make sure there is no negetive values so their is no cancelling out situations. By that means \(0\) realy means that it’s fit the data distrubbution. In a real world situation if data is not perfectly linear we can’t get \(J(\theta_0, \theta_1)\) to zero but try to get a minimum value possible. We say this as ‘minimizing the cost function’. We divide by \(m\) get a relatively small value. We devided by 2 because, by a future derivative operation anther function will a be much simpler equation.

Now we are going to implement the cost function in python. First we implement the hypothesis function \(h(x) = \theta_0 + \theta_1x\).

def h(x, theta0, theta1):
    return theta0 + theta1 * x

If we pass normal python integer values to x, theta0, theta1 it will output an integer. What will happen if we change x to a numpy array. All values get multiplied by theta1, then all values will get added by theta0. Output will be a numpy array. This happens because of a python feature called broadcasting. It will save us from writing a for loop to calculate all the values step by step. Broadcastng is a very powerpull concept. Use of broadcasting will be used everywhere when doing machine learning in python. Next we implement cost function in python

def cost(x, theta0, theta1):
    return np.sum(np.power(h(x, theta0, theta1) - y, 2)) * 1 / 2 * x.shape[0]

np.sum() function take sum of all the values inside a numpy array and return a scaler value. np.power() take power, in this case square of all the values and retrun a same size array with modified values. we pass a numpy array as x so x.shape[0] is number of elements in x which is also equal to \(m\).

Now we need to find theta0 and theta1 which minimize the cost function. One approach is to brute force a range of theta0 and theta1 and pick values where return value of cost function is minimal. But can we to better?

Gradient Descent

Gradient descent is an iterative algorithm to find out the minimum of a function. This is what we are going to do.

\[\theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)\] \[\theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)\]

I will explain step by step. First \(:=\) is the assignment operator. In programming languages we use \(=\) operator but in math it means equality. So \(:=\) means assignment in mathematics. \(\alpha\) is called the learning rate which I will explain later in detail. \(\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)\) is called the derivative part.

\[\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} \frac{1}{2m}\sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\]

This is what the plot looks with \(\theta_0\) against cost function like after 100,000 iterations.

Derivative is measuring the slope, You can see the slope is a negative value as the plot going down. \(\alpha\) is the learning rate which is \(0.0001\). So

\[\theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)\]

You can see \(\theta_0\) will increase because \(\alpha\) is positive \(\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)\) is negative. Rate of change in lope is decreasing so \(\theta_0\) will become a stabilized value.

Here’s the derivative steps for \(\theta_0\).

\[\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_0} \sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\]

I’ve taken out the \(\frac{1}{2m}\) out of the derivative. Since \(h(x^{(i)}) = \theta_0 + \theta_1x^{(i)}\) we can write

\[\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_0} \sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})^2\]

Now we are ready to take the derivative.

\[\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})\]

Next derivative steps for \(\theta_1\).

\[\frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_1} \sum_{i=1}^m(h(x^{(i)}) - y^{(i)})^2\] \[\frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_1} \sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})^2\]

After taking the derivative,

\[\frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})x^{(i)}\]

Now let’s look at the equivalent python code.

theta0 = 0
theta1 = 0
alpha = 0.001
iteration = 0

while True:    
    temp0 = (1 / m) * np.sum(h(x, theta0, theta1) - y)
    temp1 = (1 / m) * np.sum((h(x, theta0, theta1) - y) * x)
    
    theta0 = theta0 - alpha * temp0
    theta1 = theta1 - alpha * temp1
        
    iteration += 1
    
    if iteration > 100000:
        print("theta0: " + str(theta0) + "theta1:" + str(theta1))
        break

We are initializing theta0 and theta1 to 0. Learning rate alpha to be 0.001. You must choose a smaller value unless it may not going to converge smoothly. Also note that you must use temporary values while assigning.

Now that we have found optimal theta0 and theta1 values below we draw a line in top of the first plot.

theta0: 2.567988282theta1:2.98323417806

You can see that algorithm learned to draw the most optimal line over the data distribution.

Prediction

Here comes the easy part. Now we have a model with learned parameters which can be used to predict. We can use our hypothesis function.

y = h(3, theta1, theta2)

11.517690816184686

Scikit Learn

Scikit Learn is a python library with one of the most used classical machine learning algorithms. Now you know the what’s happens under the hood of linear regression function. So you are ready to use a library to do production level linear regression machine learning.

Here’s how you import the library

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x, y)

fit() function train the data.

model.predict(3)

11.51769082

You can see that values are similar to our own implementation.

Github repo of example code is available here