will notLinear Regression Validation

Linear Regression Validation


What is Multicollinearity?

Multicollinearity happened when having a high correlation between two or more predictors. As a result, the standard error of the coefficients will be large and not correct. As a result, the coefficients will notbe accurate. In some extreme cases with a perfect correlation between the predictors, you may not be able to calculate the coefficients. Despise the mentioned problems the predicated Y will be correct. If you mainly care for the predicated Y, you should be less concern but if you want to know how each predictor influence the dependent variable the multicollinearity may become a problem. For example, when having a high correlation between X1 and X2, you may get one of the following equation based on only a slightly different set of data:


Since the high correlation, the value of X1 and X2 will be similar hence the predicted Y will be similar in any of the above options.

how to find multicollinearity?

You may think of just looking at the correlation matrix of the predictors, in this way you may identify a high correlation between two dependent variables but the multicollinearity may be caused by a connection of more than two variables, such as 3=2X2+1X1. A simple way to find the multicollinearity is the Variance Inflation Factor (VIF) for each predictor. You should run the multiple regression for each predictor as a dependent variable based on the rest independent variables.
VIFj = 1 /(1 -R2j).

Y=b0+b1X1+b2X2+b3X3. Calculate the Rj values for the following regression models:


What is big VIF?

There is no clear-cut about the correct VIF threshold. If it is less than 2.5, you do not have a problem. If it is between 2.5 and 5, you should look into it, but probably it is not a problem. When the VIF value is greater than 5, you should probably remove the problematic variable from the model, and if it is greater than 10, you definitely need to act.

Do not worry about the following cases: