Linear regression is a method that describes dependence relationships, a linear model establishes the relationship between a dependent variable y (Target) and one or more independent variables denoted X (Inputs).
Let us follow one simple example to get an idea of linear regression, our sample data set is shown below, then we plot them.
The values in column ‘GrLivArea’ are independent variables because they are pre-determined, we can not change them by the other variables. The values in column ‘Price’ are dependent variables since we can predict them on changes in the independent variables. Through observation of the diagram, we found there seems to be a positive impact of GrLivArea (X)on the Price (y) of the houses. We could try to fit a line to the graph that describes the ‘average’ impact of X on y. However, how could us find the straight line that fits the data best?
Basically, we could use either Ordinary Least Squared (OLS) Method or an algorithm called Gradient Descent.
- Ordinary Least Squared (OLS) Method
1.1. Sum of Squared Deviation
Sum of squared deviation, also called sum of squared errors, is the sum of the squared difference between each dependent variables and the mean of the dependent variables. The deviations here are the difference between actual values (each observations) and predict values (the line). Let us take a look at the mathematical solution.
1.2. Minimizing the sum of squared deviations
Following is the process how we get the formula for m and b, with simple linear regression, we could use the outcome formula directly.
Now we got the solution for the value of m and b, we will see how it works in our example:
By using the example data set and the formula above, we derive m = 142.02, b = -11954, below is the scatter plot diagram with the best fit line.
The Ordinary Least Squared (OLS) Method is an efficient way to deal with simple linear regression problems (e.g. single independent variables and single dependent variables). It can deal with linear regression with multiple variables as well. We could substitute b and m as θ0 and θ1, thus, the model could be denoted as:
2. Gradient Descent
2.1. The Hypothesis
In machine learning, we have a notion of hypothesis, represented as h, which is a function takes X (Inputs) and try to output estimated value. Therefore, we could present our model as following:
2.2. Cost Function
2.3. The Algorithm
There are three types of Gradient Descent algorithm:
Batch Gradient Descent: In Batch Gradient Descent, when we calculate the gradient of the cost function, we need to sum up all the training examples for each step. Therefore, if we had a huge data set, it is not appropriate to use this method.
Stochastic Gradient Descent (SGD): Instead of using all the training examples for each step in Batch Gradient Descent, SGD uses one randomly selected example or training sample of the data for each step. Depending on the problem, it sometimes could be much faster than Batch Gradient Descent. However, it may cause the problem of error rate jumping around instead of slowly decreasing.
Mini-Batch Gradient Descent: This method is a combination of Batch Gradient Descent and SGD, it uses a randomly selected subset of the data at every step. Minni-Batch Gradient Descent creates a balance between the stablity of Batch Gradient Descent and the efficiency of SGD.
Andrew Ng (2016) ‘Linear Regression’. https://youtu.be/kHwlB_j7Hkc.
Emvalomatis, G. (2020) ‘LinearRegression’ [PowerPoint presentation]. BU52018: Business Analytics.
Khan Academy (2011) ‘Squared error of regression line’.
Madhu Sanjeevi (2017) ‘Chapter 1.2: Gradient Descent with Math.’
Prabhu (2018) ‘Linear Regression Simplified — Ordinary Least Square vs Gradient Descent’.
StatQuest with Josh Starmer (2019) ‘Gradient Descent, Step-by-Step’. https://youtu.be/sDv4f4s2SB8.