Linear Regression Validation

Video Multiple linear regression Simple linear regression Regression sample size

Multicollinearity

What is Multicollinearity?

Multicollinearity happened when having a high correlation between two or more predictors. As a result, the standard error of the coefficients will be large and not correct. As a result, the coefficients will notbe accurate. In some extreme cases with a perfect correlation between the predictors, you may not be able to calculate the coefficients. Despise the mentioned problems the predicated Y will be correct. If you mainly care for the predicated Y, you should be less concern but if you want to know how each predictor influence the dependent variable the multicollinearity may become a problem. For example, when having a high correlation between X₁ and X₂, you may get one of the following equation based on only a slightly different set of data:

Y=1X₁+1X₂+3X₃
Y=0.1X₁+0.9X₂+3X₃
Y=1.2X₁+0.2X₂+3X₃

Since the high correlation, the value of X₁ and X₂ will be similar hence the predicted Y will be similar in any of the above options.

how to find multicollinearity?

You may think of just looking at the correlation matrix of the predictors, in this way you may identify a high correlation between two dependent variables but the multicollinearity may be caused by a connection of more than two variables, such as 3=2X₂+1X₁. A simple way to find the multicollinearity is the Variance Inflation Factor (VIF) for each predictor. You should run the multiple regression for each predictor as a dependent variable based on the rest independent variables.
VIF_j = 1 /(1 -R²_j).

Example
Y=b₀+b₁X₁+b₂X₂+b₃X₃. Calculate the R_j values for the following regression models:

X₁=b₀+b₂X₂+b₃X₃.
X₂=b₀+b₁X₁+b₃X₃.
X₃=b₀+b₁X₁+b₂X₂.

What is big VIF?

There is no clear-cut about the correct VIF threshold. If it is less than 2.5, you do not have a problem. If it is between 2.5 and 5, you should look into it, but probably it is not a problem. When the VIF value is greater than 5, you should probably remove the problematic variable from the model, and if it is greater than 10, you definitely need to act.

Do not worry about the following cases:

If you built the collinearity in the model like the following examples: X₁X₂ or X1².
When the problematic variable is a control variable, and the non-control variables do not have high multicollinearity.
The model contains a dummy variable with more than two categories, and the reference category's proportion is small (the category that does not get a dummy variable).

Since in most cases, the value of the reference is zero, hence only one of the dummy variables will be one, so you can predict any dummy variable if you know the rest dummy variables.