written 5.7 years ago by |
Linear regression is a statistical approach for modelling relationship between a dependent variable with a given set of independent variables.
Here, we refer dependent variables as response and independent variables as features for simplicity.
In order to provide a basic understanding of linear regression, we start with the most basic version of linear regression, i.e. Simple linear regression.
- Simple Linear Regression
Simple linear regression is an approach for predicting a response using a single feature. It is assumed that the two variables are linearly related. Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).
Let us consider a dataset where we have a value of response y for every feature x:
x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
y | 1 | 3 | 2 | 5 | 7 | 8 | 8 | 9 | 10 | 12 |
For generality, we define:
x as feature vector, i.e $x = [x_1, x_2, …., x_n]$,
y as response vector, i.e $y = [y_1, y_2, …., y_n]$
for n observations (in above example, n=10).
A scatter plot of above dataset looks like:-
Now, the task is to find a line which fits best in above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in dataset)
This line is called regression line.
The equation of regression line is represented as: $Y=a+bX$
A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).
The Linear Regression Equation
Linear regression is a way to model the relationship between two variables. You might also recognize the equation as the slope formula. The equation has the form Y=a+bX, where Y is the dependent variable (that’s the variable that goes on the Y axis), X is the independent variable (i.e. it is plotted on the X axis), b is the slope of the line and a is the y-intercept.
$a=\frac{(\sum y)(\sum x^2)-(\sum x)(\sum xy)}{n(\sum x^2)-(\sum x)^2}$
$b=\frac{n(\sum xy)-(\sum x)(\sum y)}{n(\sum x^2)-(\sum x)^2}$
The first step in finding a linear regression equation is to determine if there is a relationship between the two variables. This is often a judgment call for the researcher. You’ll also need a list of your data in x-y format (i.e. two columns of data — independent and dependent variables).
Points to be Considered:
Just because two variables are related, it does not mean that one causes the other. For example, although there is a relationship between high GRE scores and better performance in graduation school, it doesn’t mean that high GRE scores cause good graduation school performance.
If you attempt to try and find a linear regression equation for a set of data (especially through an automated program like Excel or a TI-83), you will find one, but it does not necessarily mean the equation is a good fit for your data. One technique is to make a scatter plot first, to see if the data roughly fits a line before you try to find a linear regression equation.
Step 1: Make a chart of your data, filling in the columns
$$SUBJECT$$ | $$AGE$$ $$(X)$$ | $$GLUCOSE$$ $$LEVEL$$ $$(Y)$$ | $$XY$$ | $$X^2$$ | $$Y^2$$ |
---|---|---|---|---|---|
1 | 43 | 99 | 4257 | 1849 | 9801 |
2 | 21 | 65 | 1365 | 441 | 4225 |
3 | 25 | 79 | 1975 | 625 | 6241 |
4 | 42 | 75 | 3150 | 1764 | 5625 |
5 | 57 | 87 | 4959 | 3249 | 7569 |
6 | 59 | 81 | 4779 | 3481 | 6561 |
$$\sum$$ | 247 | 486 | 20485 | 11409 | 40022 |
From the above table, Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample size (6, in our case).
Step 2: Use the following equations to find a and b.
$a=\frac{(\sum y)(\sum x^2)-(\sum x)(\sum xy)}{n(\sum x^2)-(\sum x)^2}$
$b=\frac{n(\sum xy)-(\sum x)(\sum y)}{n(\sum x^2)-(\sum x)^2}$
a = 65.1416
b = .385225
Find a:
$((486 × 11,409) – ((247 × 20,485)) / 6 (11,409) – 247^2 )$
$484979 / 7445$
$=65.14$
Find b:
$(6(20,485) – (247 × 486)) / (6 (11409) – 247^2 )$
$(122,910 – 120,042) / 68,454 – 247^2$
$2,868 / 7,445$
$= .385225$
Step 3: Insert the values into the equation. $y'=a+bx$
$y'=65.14+.385225x$
Regression Coefficient
A regression coefficient is the same thing as the slope of the line of the regression equation. The equation for the regression coefficient that you’ll find on the AP Statistics test is:
$B_1=b_1=\sum [(x_i-x)(y_i-y)]/\sum [(x_i-x])^2.$
“y” in this equation is the mean of y and “x” is the mean of x.