written 5.6 years ago by |
- Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable (predictor).
This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables.
For example, relationship between rash driving and number of road accidents by a driver is best studied through regression.
Regression analysis is an important tool for modelling and analyzing data. Here, we fit a curve / line to the data points, in such a manner that the differences between the distances of data points from the curve or line is minimized. I’ll explain this in more details in coming sections.
Why do we use Regression Analysis?
As mentioned above, regression analysis estimates the relationship between two or more variables. Let’s say, you want to estimate growth in sales of a company based on current economic conditions.
You have the recent company data which indicates that the growth in sales is around two and a half times the growth in the economy. Using this insight, we can predict future sales of the company based on current & past information.
There are multiple benefits of using regression analysis. They are as follows:
It indicates the significant relationships between dependent variable and independent variable.
It indicates the strength of impact of multiple independent variables on a dependent variable.
Regression analysis also allows us to compare the effects of variables measured on different scales, such as the effect of price changes and the number of promotional activities.
These benefits help market researchers / data analysts / data scientists to eliminate and evaluate the best set of variables to be used for building predictive models.
1. Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually among the first few topics which people pick while learning predictive modeling. In this technique, the dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature of regression line is linear.
• There must be linear relationship between independent and dependent variables
• Multiple regression suffers from multicollinearity, autocorrelation, heteroskedasticity.
• Linear Regression is very sensitive to Outliers. It can terribly affect the regression line and eventually the forecasted values.
• Multicollinearity can increase the variance of the coefficient estimates and make the estimates very sensitive to minor changes in the model. The result is that the coefficient estimates are unstable
• In case of multiple independent variables, we can go with forward selection, backward elimination and step wise approach for selection of most significant independent variables.
2. Logistic Regression
• It is widely used for classification problems
• Logistic regression doesn’t require linear relationship between dependent and independent variables. It can handle various types of relationships because it applies a non-linear log transformation to the predicted odds ratio.
• To avoid over fitting and under fitting, we should include all significant variables. A good approach to ensure this practice is to use a step wise method to estimate the logistic regression
• It requires large sample sizes because maximum likelihood estimates are less powerful at low sample sizes than ordinary least square
• The independent variables should not be correlated with each other i.e. no multi collinearity. However, we have the options to include interaction effects of categorical variables in the analysis and in the model.
• If the values of dependent variable is ordinal, then it is called as Ordinal logistic regression
• If dependent variable is multi class then it is known as Multinomial Logistic regression.
3. Polynomial Regression
A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation:
While there might be a temptation to fit a higher degree polynomial to get lower error, this can result in over-fitting. Always plot the relationships to see the fit and focus on making sure that the curve fits the nature of the problem.
• Especially look out for curve towards the ends and see whether those shapes and trends make sense. Higher polynomials can end up producing wierd results on extrapolation.
4. Stepwise Regression
This form of regression is used when we deal with multiple independent variables. In this technique, the selection of independent variables is done with the help of an automatic process, which involves no human intervention.
The aim of this modeling technique is to maximize the prediction power with minimum number of predictor variables. It is one of the method to handle higher dimensionality of data set.
5. Ridge Regression
Ridge Regression is a technique used when the data suffers from multicollinearity ( independent variables are highly correlated). In multicollinearity, even though the least squares estimates (OLS) are unbiased, their variances are large which deviates the observed value far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors.
• The assumptions of this regression is same as least squared regression except normality is not to be assumed
• It shrinks the value of coefficients but doesn’t reaches zero, which suggests no feature selection feature
• This is a regularization method and uses l2 regularization.
6. Lasso Regression
Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) also penalizes the absolute size of the regression coefficients. • The assumptions of this regression is same as least squared regression except normality is not to be assumed
• It shrinks coefficients to zero (exactly zero), which certainly helps in feature selection
• This is a regularization method and uses l1 regularization
• If group of predictors are highly correlated, lasso picks only one of them and shrinks the others to zero
7. ElasticNet Regression
ElasticNet is hybrid of Lasso and Ridge Regression techniques. It is trained with L1 and L2 prior as regularizer.
• It encourages group effect in case of highly correlated variables
• There are no limitations on the number of selected variables
• It can suffer with double shrinkage