# Correlation Coefficient Calculator (and covariance)

**correlation calculator**and

**covariance calculator**calculates the correlation and tests the significance of the result.

## Information

### What is covariance?

The covariance checks the relationship between two variables.

The covariance range is unlimited from negative infinity to positive infinity. For independent variables, the **covariance is zero**.**Positive covariance** - changes go in the same direction, when one variable increases usually also the second variable increases, and when one variable decreases usually also the second variable decreases.**Negative covariance** - opposite direction, when one variable increases usually the second variable decreases, and when one variable decreases usually the second variable increases.

#### How to calculate the covariance

The covariance formula is:Cov(X,Y) = E[(X-E[X])(Y-E[Y])]

Cov(X,Y) = E[XY]-E(X)E[Y]

S

_{XY}- the sample covariance between X and Y.

S_{XY} = | Σ(x_{i}-x̄)(y_{i}-ȳ) |

n - 1 |

### What is correlation?

You may say that there is a correlation between two variables, or statistical association, when the value of one variable may at least partially predict the value of the other variable.

The correlation is a standardized covariance, the correlation range is between -1 and 1.

The correlation ignores the cause and effect question, is X depends on Y or Y depends on X or both variables depend on the third variable Z.

Similarly to the covariance, for independent variables, the **correlation is zero**.**Positive correlation** - changes go in the same direction, when one variable increases usually also the second variable increases, and when one variable decreases usually also the second variable decreases.**Negative correlation** - opposite direction, when one variable increases usually the second variable decreases, and when one variable decreases usually the second variable increases.**Perfect correlation** - When you know the value of one variable you may calculate the exact value of the second variable. For a perfect positive correlation r = 1. and for a perfect negative correlation r = -1.

### What is the Pearson correlation coefficient?

The Pearson correlation coefficient is a type of correlation, that measure linear association between two variables

#### How to calculate the Pearson correlation?

##### Population Pearson correlation formula

ρ_{XY} = | E[(X-E[X])(Y-E[Y])] |

σ_{X}σ_{Y} |

ρ = | Cov(X,Y) |

σ_{X}σ_{Y} |

##### Sample Pearson correlation formula

r = | Σ(x_{i} - x̄)(y_{i} - ȳ) |

√(Σ(x_{i} - x̄)^{2}Σ(y_{i} - ȳ)^{2} ) |

r = | S_{XY} |

S_{X}S_{Y} |

## Assumptions

**Continuous variables**- The two variables are continuous (ratio or interval).**Outliers**- The sample correlation value is sensitive to outliers. We check for outliers in the pair level, on the linear regression residuals,**Linearity**- a linear relationship between the two variables, the correlation is the effect size of the linearity. (the commonly used effect size f^{2}is derived from R^{2}(r and R are the same)**Normality**- Bivariate normal distribution. Instead of checking for bivariate normal, we calculate the linear regression and check the normality of the residuals.**Homoscedasticity**, homogeneity of variance - the variance of the residuals is constant and does not depend on the independent variables X_{i}

## Tests

When the null assumption is ρ_{0} = 0, independent variables, and X and Y have bivariate normal distribution or the sample size is large, then you may use the t-test.

When ρ_{0} ≠ 0, the sample distribution will not be symmetrical, hence you can't use the t distribution. In this case, you should use the Fisher transformation to transform the distribution.

After using the transformation the sample distribution tends toward the normal distribution.

### What is Spearman's rank correlation coefficient?

Spearman's rank correlation coefficient is a non-parametric statistic that measures the monotonic association between two variables.

What is the monotonic association? when one variable increases usually also the second variable increases, or when one variable increases usually the second variable decreases.

You may use Spearman's rank correlation when two variables do not meet the Pearson correlation assumptions. as in the following cases:

- Ordinal discrete variables
- Non-linear data
- The data distribution is not Bivariate normal.
- Data contains outliers
- Data doesn't meet the Homoscedasticity assumption. The variance of the residuals is not constant.

#### How to calculate the Spearman's rank correlation?

Rank the data separately for each variable and then calculate the **Pearson correlation** of the ranked data.

The smallest value gets 1, the second 2, etc. Even when ranking the opposite way, largest value as 1, the result will be the same correlation value.

##### Ties data

When the data contains repeated values, each value gets the average of the ranks. In the example below, value 8 ranks are 4 and 5, hence both values will get the average rank: (4 + 5)/2 = **4.5**.

#### Example

__Data__

X | Y |
---|---|

7.3 | 7 |

8 | 6.6 |

5.4 | 5.4 |

2.7 | 3.7 |

8 | 9.9 |

9.1 | 11 |

__Ranks__

X | Y |
---|---|

3 | 4 |

4.5 | 3 |

2 | 2 |

1 | 1 |

4.5 | 5 |

6 | 6 |

## Assumptions

**Ordinal / Continuous**- The two variables should be ordinal or continuous (ratio or interval).**Monotonic association**

### Distribution

When ρ_{0} ≠ 0, the distribution is not symmetric, in this case, the tool will use the normal distribution over the Fisher transformation.

When ρ_{0} = 0, you have several options:

**Automatic**- Uses the t-test, and uses the Fisher transformation for the confidence interval.**T - distribution**- use the t-test and confidence interval with t-distribution**Z - distribution**- use the Fisher transformation for the z-test and the confidence interval.**Exact**- relevant only for the Spearman's rank correlation, when the sample size is small, the t-distribution or z distribution is not good enough as an approximation, hence you should use the exact value, taken from a pre-calculated table, in this case, the p-value of the following list will be accurate:

[0.25,0.1,0.05,0.025,0.01,0.005,0.0025,0.001,0.0005]

Any p-value between is only an extrapolation, but usually will not change the result, as all the common significance levels listed above are accurate.

The confidence interval based on Fisher transformation supports better results.

_{0}: ρ = ρ

_{0}

_{1}: ρ ≠ ρ

_{0}

**We usually test for ρ**

_{0}= 0, hence use the t-test.__T-test__

t = | r√(n - 2) |

1 - r^{2} |

__Z-test on Fisher transformation__

z = | r' - ρ'_{0} |

σ' |

__Spearman rank exact__