# Linear Regression Validation

## Multicollinearity

### What is Multicollinearity?

Multicollinearity happened when having a high correlation between two or more predictors. As a result, the standard error of the coefficients will be large and not correct. As a result, the coefficients will notbe accurate. In some extreme cases with a perfect correlation between the predictors, you may not be able to calculate the coefficients. Despise the mentioned problems the predicated Y will be correct. If you mainly care for the predicated Y, you should be less concern but if you want to know how each predictor influence the dependent variable the multicollinearity may become a problem. For example, when having a high correlation between X_{1} and X_{2}, you may get one of the following equation based on only a slightly different set of data:

**Y=1X**_{1}+1X_{2}+3X_{3}

Y=0.1X_{1}+0.9X_{2}+3X_{3}

Y=1.2X_{1}+0.2X_{2}+3X_{3}

Since the high correlation, the value of X_{1} and X_{2} will be similar hence the predicted Y will be similar in any of the above options.

### how to find multicollinearity?

You may think of just looking at the correlation matrix of the predictors, in this way you may identify a high correlation between two dependent variables but the multicollinearity may be caused by a connection of more than two variables, such as 3=2X_{2}+1X_{1}. A simple way to find the multicollinearity is the Variance Inflation Factor (VIF) for each predictor. You should run the multiple regression for each predictor as a dependent variable based on the rest independent variables.

**VIF**_{j} = 1 /(1 -R^{2}_{j}).

**Example**

Y=b_{0}+b_{1}X_{1}+b_{2}X_{2}+b_{3}X_{3}. Calculate the R_{j} values for the following regression models:

**X**_{1}=b_{0}+b_{2}X_{2}+b_{3}X_{3}.

X_{2}=b_{0}+b_{1}X_{1}+b_{3}X_{3}.

X_{3}=b_{0}+b_{1}X_{1}+b_{2}X_{2}.

### What is big VIF?

There is no clear-cut about the correct VIF threshold. If it is less than **2.5**, you do not have a problem. If it is between **2.5** and **5**, you should look into it, but probably it is not a problem. When the VIF value is greater than **5**, you should probably remove the problematic variable from the model, and if it is greater than **10**, you definitely need to act.

### Do not worry about the following cases:

- If you built the collinearity in the model like the following examples: X
_{1}X_{2} or X1^{2}. - When the problematic variable is a control variable, and the non-control variables do not have high multicollinearity.
- The model contains a dummy variable with more than two categories, and the reference category's proportion is small (the category that does not get a dummy variable).

Since in most cases, the value of the reference is zero, hence only one of the dummy variables will be one, so you can predict any dummy variable if you know the rest dummy variables.